Flash Moe Inference

Name: Flash Moe Inference
Author: aradotso

aradotso/trending-skills

1.1k installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

flash-moe-inference provides documented workflows for Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming

About

The flash-moe-inference skill run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming # Flash-MoE Inference Engine > Skill by [ara.so](https://ara.so) - Daily 2026 Skills collection. Flash-MoE is a pure C/Objective-C/Metal inference engine that runs **Qwen3.5-397B-A17B** (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand - no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders. ## Requirements - **Hardware**: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free - **OS**: macOS 26+ (Darwin 25+) - **Tools**: Xcode Command Line Tools, Python 3.x (for weight extraction only) - **Model**: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace) ## Installation & Build ```bash # Clone the repo git clone https://github.com/danveloper/flash-moe cd flash-moe/metal_infer # Build everything make # Verify build artifacts ls infer chat main ``` The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`.

**Hardware**: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
**OS**: macOS 26+ (Darwin 25+)
**Tools**: Xcode Command Line Tools, Python 3.x (for weight extraction only)
**Model**: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)
45 GatedDeltaNet (linear attention) layers

Flash Moe Inference by the numbers

1,110 all-time installs (skills.sh)
+7 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #267 of 2,742 Automation & Workflows skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

flash-moe-inference capabilities & compatibility

Capabilities: **hardware**: apple silicon mac (m3 max or simil · **os**: macos 26+ (darwin 25+) · **tools**: xcode command line tools, python 3.x · **model**: qwen3.5 397b a17b safetensors weights · 45 gateddeltanet (linear attention) layers
Use cases: documentation

From the docs

What flash-moe-inference says it does

Flash-MoE is a pure C/Objective-C/Metal inference engine that runs **Qwen3.5-397B-A17B** (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second.

SKILL.md

npx skills add https://github.com/aradotso/trending-skills --skill flash-moe-inference

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/flash-moe-inference.svg)](https://skillselion.com/skills/aradotso/trending-skills/flash-moe-inference)

Installs	1.1k
repo stars	★ 66
Security audit	0 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do I use flash-moe-inference for the task described in its SKILL.md triggers?

Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming

Who is it for?

Teams invoking flash-moe-inference when the user request matches documented triggers and prerequisites.

Skip if: Skip when cached docs are missing, the request is a negative trigger, or another sibling skill owns the workflow.

When should I use this skill?

Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming

What you get

Step-by-step guidance grounded in flash-moe-inference documentation and reference files.

local inference session
streamed MoE weight loads

By the numbers

397-billion-parameter Qwen3.5-397B-A17B Mixture-of-Experts model
MacBook Pro 48GB RAM inference at about 4.4 tokens per second
Approximately 209GB model size with SSD expert weight streaming

Files

SKILL.mdMarkdownGitHub ↗

Flash-MoE Inference Engine

Skill by ara.so — Daily 2026 Skills collection.

Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.

Requirements

Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
OS: macOS 26+ (Darwin 25+)
Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)
Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)

Installation & Build

# Clone the repo
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer

# Build everything
make

# Verify build artifacts
ls infer chat main

The Makefile compiles infer.m, chat.m, main.m with Metal shader compilation for shaders.metal.

Weight Preparation

Step 1: Extract non-expert weights

# From the metal_infer/ directory
# Point to your downloaded Qwen3.5-397B safetensors directory
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/

# Produces:
#   model_weights.bin   (~5.5GB, mmap'd at runtime)
#   model_weights.json  (tensor manifest)
#   vocab.bin           (vocabulary)
#   tokenizer.bin       (BPE tokenizer data)

Step 2: Pack expert weights (4-bit, production)

# From repo root
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/

# Produces packed_experts/ directory (~209GB)
# Each expert is a separate file: layer_XX_expert_YYYY.bin

Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)

# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)
python3 metal_infer/repack_experts_2bit.py \
    metal_infer/packed_experts/ \
    metal_infer/packed_experts_2bit/

Key Commands

Basic inference

cd metal_infer

# 4-bit inference (production quality, tool calling works)
./infer --prompt "Explain quantum computing" --tokens 100

# 2-bit inference (faster, breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit

# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing

Interactive chat with tool calling

./chat
# Opens TUI with full tool calling support
# Uses 4-bit experts by default

MoE-only benchmark (measures expert throughput)

./main
# Runs pure expert forward-pass benchmark
# Reports tokens/sec without attention overhead

Project Structure

flash-moe/
├── paper/
│   └── flash_moe.pdf          # Full technical paper
├── metal_infer/
│   ├── infer.m                # Complete inference engine (~7000 lines)
│   ├── shaders.metal          # Metal compute kernels (~1200 lines)
│   ├── chat.m                 # Interactive chat TUI
│   ├── tokenizer.h            # Single-header C BPE tokenizer (449 lines)
│   ├── main.m                 # MoE-only benchmark
│   ├── Makefile
│   ├── extract_weights.py     # Safetensors → model_weights.bin
│   ├── repack_experts_2bit.py # 4-bit → 2-bit requantization
│   ├── train_predictor.py     # Expert routing prediction analysis
│   ├── model_weights.bin      # Non-expert weights (mmap'd)
│   ├── model_weights.json     # Tensor manifest
│   ├── vocab.bin
│   ├── tokenizer.bin
│   ├── packed_experts/        # 4-bit expert files (209GB)
│   └── packed_experts_2bit/   # 2-bit expert files (120GB, optional)
├── repack_experts.py          # 4-bit expert packing from safetensors
├── progress.py                # Results visualization
└── results.tsv                # Experiment log

Architecture Overview

The model has 60 transformer layers:

45 GatedDeltaNet (linear attention) layers
15 standard full attention layers
Each layer: 512 experts, K=4 activated per token + 1 shared expert
Hidden dimension: 4096

Per-layer pipeline (4.28ms average at 4-bit)

CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]
           → CPU: flush results                       [0.01ms CPU]  
           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]
           → CPU: softmax + topK routing               [0.003ms]
           → I/O: parallel pread K=4 experts           [2.41ms SSD]
           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]

Metal Shader Kernels

The shaders.metal file contains hand-written kernels. Key kernels:

// 4-bit dequantized matrix-vector multiply (FMA-optimized)
// Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x
// Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction

kernel void matvec_4bit_fma(
    device const uint8_t* weights [[buffer(0)]],
    device const float* scales    [[buffer(1)]],
    device const float* biases    [[buffer(2)]],
    device const float* x         [[buffer(3)]],
    device float* out             [[buffer(4)]],
    uint tid [[thread_position_in_threadgroup]],
    uint gid [[threadgroup_position_in_grid]])
{
    // ... tiled SIMD-reduced FMA kernel
    // 12% faster than naive (nibble * scale + bias) * x
}

// Fused SwiGLU activation
kernel void swiglu(device float* gate [[buffer(0)]],
                   device const float* up [[buffer(1)]],
                   uint gid [[thread_position_in_grid]])
{
    float g = gate[gid];
    gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}

// RMS normalization (two-pass)
kernel void rms_norm_pass1(...) // sum of squares reduction
kernel void rms_norm_pass2(...) // apply normalization

// GPU RoPE (fused with Q deinterleave and K normalization)
kernel void rope_qk(...)

// MoE combine + residual + sigmoid gate (fused)
kernel void moe_combine_residual(...)

SSD Expert Streaming Pattern

The core innovation — loading only K=4 active experts per layer from SSD:

// Parallel expert loading using GCD dispatch groups
// From infer.m (conceptual pattern)

dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);

for (int k = 0; k < K_EXPERTS; k++) {
    int expert_id = top_k_indices[k];
    dispatch_group_async(group, ioQueue, ^{
        // Each expert: ~6.75MB at 4-bit
        char path[256];
        snprintf(path, sizeof(path), 
                 "packed_experts/layer_%02d_expert_%04d.bin",
                 layer, expert_id);
        
        int fd = open(path, O_RDONLY);
        // pread() — non-blocking, OS page cache handles LRU
        pread(fd, expert_buffer[k], expert_size, 0);
        close(fd);
    });
}

dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon

Why `pread()` not `mmap()`: mmap incurs per-page fault overhead on cold data (~5x slower). Direct pread() with OS page cache achieves ~71% hit rate naturally.

GatedDeltaNet Linear Attention (BLAS)

The recurrence update uses Accelerate BLAS — 64% faster than scalar:

// GatedDeltaNet state update per head (conceptual pattern)
// state: 128×128 float matrix, 64 heads
// From infer.m

#import <Accelerate/Accelerate.h>

for (int h = 0; h < 64; h++) {
    float* S = state + h * 128 * 128;  // 128×128 state matrix
    float* q = Q + h * 128;
    float* k = K + h * 128;
    float* v = V + h * 128;
    
    // β·(k⊗v) outer product update
    // cblas_sger: S += beta * (k ⊗ v)
    cblas_sger(CblasRowMajor, 128, 128,
               beta[h], k, 1, v, 1, S, 128);
    
    // Decay: S = alpha * S
    cblas_sscal(128 * 128, alpha[h], S, 1);
    
    // Output: o = S @ q
    cblas_sgemv(CblasRowMajor, CblasNoTrans,
                128, 128, 1.0f, S, 128, q, 1, 0.0f,
                output + h * 128, 1);
}

Performance Configuration

4-bit (production default)

Quality: Excellent — full tool calling, correct JSON
Speed: 4.36 tok/s
Disk: 209GB

2-bit (speed testing only)

Quality: Good — but breaks JSON/tool calling (\name\ instead of "name")
Speed: 5.74 tok/s (7.05 peak single token with warm cache)
Disk: 120GB
Uses F_NOCACHE flag to avoid page cache thrashing

What NOT to Try (Learned from 58 Experiments)

Approach	Why it fails
`mmap()` expert files	Per-page fault overhead: 5x slower than `pread()`
`dispatch_io`	`dispatch_data` management overhead: -70%
`F_RDADVISE` prefetch	SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed
Custom Metal LRU cache	GPU memory pressure: -38% vs OS page cache
LZ4 expert compression	Decompress overhead > warm cache savings: -13%
Temporal expert prediction	25% hit rate, wastes SSD bandwidth: -18%
Speculative early routing	Cache pollution: -38%
MTP speculative decoding	MoE I/O scales per-token (unlike dense models): break-even
Spin-poll GPU wait	CPU thermal throttle competes with GPU: -23%
Parallel SSD + GPU overlap	Unified memory controller arbitration: net negative

Key principle: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

Troubleshooting

Build fails

# Ensure Xcode CLI tools are installed
xcode-select --install

# Check Metal compiler is available
xcrun -sdk macosx metal --version

Out of memory

The engine is designed to use ~6GB active:

5.5GB: model_weights.bin (mmap'd, read-only)
~200MB: Metal scratch buffers
Remaining ~42GB: OS page cache for expert data

If you see OOM, check for other processes consuming unified memory:

sudo memory_pressure
vm_stat

Slow performance

# Check SSD speed — needs ~17GB/s for target performance
# Run with timing to identify bottleneck
./infer --prompt "Hello" --tokens 5 --timing

# Verify packed_experts/ is on internal SSD, not external drive
diskutil info /

Wrong expert directory

# Default paths expected by infer.m:
# metal_infer/packed_experts/     (4-bit)
# metal_infer/packed_experts_2bit/ (2-bit)

# Ensure you're running from metal_infer/ directory
cd metal_infer
./infer --prompt "test"

Tool calling broken

Use 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads.

Memory Safety

The engine explicitly manages all allocations:

No unbounded caches
Expert data never accumulates in GPU memory
model_weights.bin is mmap'd read-only — kernel manages pages
Expert files are opened/read/closed per inference step

Related skills

Agent BrowserGive their coding agent reliable, high-fidelity control over any website or Electron desktop app.577k39.1k

Lark ApprovalLet their AI coding agent create, read, update, and approve items in Lark (Feishu) approval workflows without leaving the coding environment.471k

Lark EventHandle Feishu/Lark bot events, webhooks, and subscription callbacks in agent-driven backend code.471k

Lark Workflow Meeting SummaryAutomatically generate structured meeting summaries and action items from Lark/Feishu calls without manual note-taking.470k

Lark Workflow Standup ReportAutomatically generate and post daily standup reports in Lark/Feishu from their workflow and activity data.470k

Lark Vc AgentGive their coding agent the ability to read, create, and update documents, tasks, and wiki pages inside Feishu (Lark).415k

How it compares

Use for native Metal MoE on Mac instead of cloud API calls or Python GPU stacks.

FAQ

What does flash-moe-inference do?

Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming

When should I use flash-moe-inference?

Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming

What are common prerequisites?

--- name: flash-moe-inference description: Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming triggers: - run a large language model on my laptop - stream expert weights from SS

Is Flash Moe Inference safe to install?

skills.sh reports 0 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Automation & Workflowsintegrations

About

Flash Moe Inference by the numbers

flash-moe-inference capabilities & compatibility

What flash-moe-inference says it does

Add your badge

How do I use flash-moe-inference for the task described in its SKILL.md triggers?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Flash-MoE Inference Engine

Requirements

Installation & Build

Weight Preparation

Step 1: Extract non-expert weights

Step 2: Pack expert weights (4-bit, production)

Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)

Key Commands

Basic inference

Interactive chat with tool calling

MoE-only benchmark (measures expert throughput)

Project Structure

Architecture Overview

Per-layer pipeline (4.28ms average at 4-bit)

Metal Shader Kernels

SSD Expert Streaming Pattern

GatedDeltaNet Linear Attention (BLAS)

Performance Configuration

4-bit (production default)

2-bit (speed testing only)

What NOT to Try (Learned from 58 Experiments)

Troubleshooting

Build fails

Out of memory

Slow performance

Wrong expert directory

Tool calling broken

Memory Safety

Related skills

How it compares

FAQ

What does flash-moe-inference do?

When should I use flash-moe-inference?

What are common prerequisites?

Is Flash Moe Inference safe to install?

This week in AI coding