
Flash Moe Inference
Build and run Flash-MoE on Apple Silicon to stream a 397B MoE model from SSD for local chat and inference without a Python ML stack.
Overview
flash-moe-inference is an agent skill for the Build phase that guides building and running Flash-MoE to stream a 397B MoE LLM from SSD on Apple Silicon using C and Metal.
Install
npx skills add https://github.com/aradotso/trending-skills --skill flash-moe-inferenceWhat is this skill?
- Runs Qwen3.5-397B-A17B (397B MoE) with SSD-streamed expert weights (~209GB)
- Pure C/Objective-C/Metal—no Python inference runtime; hand-tuned Metal shaders
- Documented 4.4+ tokens/second on 48GB unified memory (M3 Max class)
- `make` builds infer, chat, and main binaries from metal_infer
- 397B parameter MoE
- 209GB expert weights streamed
Adoption & trust: 1k installs on skills.sh; 31 GitHub stars; 0/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want to run a 200GB-class MoE model on a laptop but normal Python frameworks will not fit your RAM or latency goals on Apple Silicon.
Who is it for?
Advanced indie builders on high-RAM Macs who need offline or private MoE experiments and can spare SSD space for expert shards.
Skip if: Cloud-only Runpod or Railway deploys, Windows/Linux CUDA setups, or beginners without Xcode CLT and large model downloads.
When should I use this skill?
User wants to run a large MoE LLM on a MacBook, stream expert weights from SSD, or use the Flash-MoE Metal inference engine (including Qwen3.5-397B triggers).
What do I get? / Deliverables
You clone Flash-MoE, compile infer and chat binaries, prepare streamed weights, and run local MoE inference at documented token speeds without a full framework stack.
- Built infer, chat, and main binaries
- Prepared streamed weight layout on SSD
- Local MoE chat/inference session
Recommended Skills
Journey fit
Weight prep, compile, and local inference setup are product-building work before you treat the stack as production ops. Agent-tooling fits because the skill enables local LLM runtime tooling for builders experimenting with large MoE models on Mac hardware.
How it compares
On-device Metal MoE streaming engine—not runpodctl cloud GPUs or a generic Ollama one-click model pull.
Common Questions / FAQ
Who is flash-moe-inference for?
Solo builders with Apple Silicon Macs, large SSDs, and ML curiosity who want agent-guided steps to compile and run Flash-MoE locally.
When should I use flash-moe-inference?
Use it in Build (agent-tooling) when triggers mention running Qwen3.5 397B on Mac, SSD expert streaming, or Metal MoE inference; revisit Operate only after you own process supervision for long local runs.
Is flash-moe-inference safe to install?
Review the Security Audits panel on this page; the workflow clones third-party repos, compiles native code, and downloads very large model weights from external hosts.
SKILL.md
READMESKILL.md - Flash Moe Inference
# Flash-MoE Inference Engine > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. Flash-MoE is a pure C/Objective-C/Metal inference engine that runs **Qwen3.5-397B-A17B** (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders. ## Requirements - **Hardware**: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free - **OS**: macOS 26+ (Darwin 25+) - **Tools**: Xcode Command Line Tools, Python 3.x (for weight extraction only) - **Model**: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace) ## Installation & Build ```bash # Clone the repo git clone https://github.com/danveloper/flash-moe cd flash-moe/metal_infer # Build everything make # Verify build artifacts ls infer chat main ``` The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`. ## Weight Preparation ### Step 1: Extract non-expert weights ```bash # From the metal_infer/ directory # Point to your downloaded Qwen3.5-397B safetensors directory python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/ # Produces: # model_weights.bin (~5.5GB, mmap'd at runtime) # model_weights.json (tensor manifest) # vocab.bin (vocabulary) # tokenizer.bin (BPE tokenizer data) ``` ### Step 2: Pack expert weights (4-bit, production) ```bash # From repo root python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/ # Produces packed_experts/ directory (~209GB) # Each expert is a separate file: layer_XX_expert_YYYY.bin ``` ### Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling) ```bash # Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total) python3 metal_infer/repack_experts_2bit.py \ metal_infer/packed_experts/ \ metal_infer/packed_experts_2bit/ ``` ## Key Commands ### Basic inference ```bash cd metal_infer # 4-bit inference (production quality, tool calling works) ./infer --prompt "Explain quantum computing" --tokens 100 # 2-bit inference (faster, breaks JSON/tool calling) ./infer --prompt "Explain quantum computing" --tokens 100 --2bit # Per-layer timing breakdown ./infer --prompt "Hello" --tokens 20 --timing ``` ### Interactive chat with tool calling ```bash ./chat # Opens TUI with full tool calling support # Uses 4-bit experts by default ``` ### MoE-only benchmark (measures expert throughput) ```bash ./main # Runs pure expert forward-pass benchmark # Reports tokens/sec without attention overhead ``` ## Project Structure ``` flash-moe/ ├── paper/ │ └── flash_moe.pdf # Full technical paper ├── metal_infer/ │ ├── infer.m # Complete inference engine (~7000 lines) │ ├── shaders.metal # Metal compute kernels (~1200 lines) │ ├── chat.m # Interactive chat TUI │ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines) │ ├── main.m # MoE-only benchmark │ ├── Makefile │ ├── extract_weights.py # Safetensors → model_weights.bin │ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization │ ├── train_predictor.py # Expert routing prediction analysis │ ├── model_weights.bin # Non-expert weights (mmap'd) │ ├── model_weights.json # Tensor manifest │ ├── vocab.bin │ ├── tokenizer.bin │ ├── packed_experts/ # 4-bit expert files (209GB