Mac Code Local Ai Agent

Name: Mac Code Local Ai Agent
Author: aradotso

aradotso/trending-skills

811 installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

mac-code-local-ai-agent is a setup skill that runs a 35B-parameter coding agent on Apple Silicon Macs using local LLMs via llama.cpp or MLX with no cloud dependency or monthly fees.

About

mac-code-local-ai-agent is an ara.so trending-skills module for running mac-code, a CLI AI coding agent on Apple Silicon at $0/month. The skill guides loading a 35B reasoning model locally through llama.cpp or MLX with web search, shell, and file tools attached. Triggers include local LLM agent setup, Claude Code alternatives on Mac, and out-of-RAM inference tuning on 16GB machines. Developers reach for it when they want an on-device coding agent with tool use instead of cloud API billing. The skill covers setup, usage, and memory constraints for local inference.

Runs 35B MoE reasoning model locally at 30 tokens per second on 16 GB Apple Silicon
LLM-as-router automatically classifies prompts as search, shell, or chat and dispatches tools
Supports both llama.cpp (IQ2_M quantization) and MLX (64K context with persistent KV cache) backends
Includes web search, shell execution, file read/write, and context compression tools
Fits full 35B Q4 model in 16 GB RAM using custom Expert Sniper technique

Mac Code Local Ai Agent by the numbers

811 all-time installs (skills.sh)
+6 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #1,299 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/trending-skills --skill mac-code-local-ai-agent

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/mac-code-local-ai-agent.svg)](https://skillselion.com/skills/aradotso/trending-skills/mac-code-local-ai-agent)

Installs	811
repo stars	★ 66
Security audit	0 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do you run a local coding agent on Mac?

Run a powerful 35B-parameter coding agent entirely on-device with no monthly fees or cloud dependency.

Who is it for?

Apple Silicon Mac developers who want a free on-device coding agent with shell and file tools instead of cloud LLM APIs.

Skip if: Linux or Windows local agent setups, or teams that require managed cloud model APIs and centralized billing.

When should I use this skill?

The user wants a local LLM coding agent on Mac, llama.cpp or MLX setup, or a free Claude Code alternative.

What you get

Configured mac-code CLI agent with local 35B model, tool access, and Apple Silicon inference setup

Configured mac-code CLI agent
Local LLM inference setup
Tool-enabled agent runtime

By the numbers

Runs a 35B reasoning model locally on Apple Silicon
$0/month local agent setup with no cloud dependency

Files

SKILL.mdMarkdownGitHub ↗

mac-code — Free Local AI Agent on Apple Silicon

Skill by ara.so — Daily 2026 Skills collection.

Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.

---

What It Does

LLM-as-router: The model classifies every prompt as search, shell, or chat and routes accordingly
35B MoE at 30 tok/s via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
35B full Q4 on 16 GB via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
9B at 64K context via quantized KV cache (q4_0 keys/values)
MLX backend adds persistent KV cache save/load, context compression, R2 sync
Tools: DuckDuckGo search, shell execution, file read/write

---

Installation

Prerequisites

brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages

Clone the repo

git clone https://github.com/walter-grace/mac-code
cd mac-code

Download models

35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):

mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
    local_dir='$HOME/models/'
)
"

9B — 64K context, long documents (5.3 GB):

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-9B-GGUF',
    'Qwen3.5-9B-Q4_K_M.gguf',
    local_dir='$HOME/models/'
)
"

---

Starting the Backend

Option A: llama.cpp + 35B MoE (recommended, 30 tok/s)

llama-server \
    --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 12288 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -np 1 -t 4

Option B: llama.cpp + 9B (64K context)

llama-server \
    --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 65536 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -t 4

Option C: MLX backend (persistent context, 9B)

# Starts server on port 8000, downloads model on first run
python3 mlx/mlx_engine.py

Start the agent (all options)

python3 agent.py

---

Agent CLI Commands

Inside the agent REPL, type / for all commands:

Command	Action
`/agent`	Agent mode with tools (default)
`/raw`	Direct streaming, no tools
`/model 9b`	Switch to 9B model (64K context)
`/model 35b`	Switch to 35B MoE
`/search <query>`	Quick DuckDuckGo search
`/bench`	Run speed benchmark
`/stats`	Session statistics
`/cost`	Show cost savings vs cloud
`/good` / `/bad`	Grade the last response
`/improve`	View response grading stats
`/clear`	Reset conversation
`/quit`	Exit

Example prompts

> find all Python files modified in the last 7 days
→ routes to "shell", generates: find . -name "*.py" -mtime -7

> who won the NBA finals
→ routes to "search", queries DuckDuckGo, summarizes

> explain how attention works
→ routes to "chat", streams directly

---

MLX Backend — Persistent KV Cache API

The MLX engine exposes a REST API on localhost:8000.

Save context after processing a large codebase

curl -X POST localhost:8000/v1/context/save \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project", "prompt": "$(cat README.md)"}'

Load saved context instantly (0.0003s)

curl -X POST localhost:8000/v1/context/load \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project"}'

Download context from Cloudflare R2 (cross-Mac sync)

# Requires R2 credentials in environment
export R2_ACCOUNT_ID=your_account_id
export R2_ACCESS_KEY_ID=your_key_id
export R2_SECRET_ACCESS_KEY=your_secret
export R2_BUCKET=your_bucket_name

curl -X POST localhost:8000/v1/context/download \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project"}'

Standard OpenAI-compatible chat

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "stream": False
})
print(response.json()["choices"][0]["message"]["content"])

Streaming chat

import requests, json

with requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Explain transformers"}],
    "stream": True
}, stream=True) as r:
    for line in r.iter_lines():
        if line.startswith(b"data: "):
            chunk = json.loads(line[6:])
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)

---

KV Cache Compression (MLX)

Compress context 4x with 99.3% similarity:

from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache

# After building a KV cache from a long document
compressed = compress_kv_cache(kv_cache, bits=4)  # 26.6 MB → 6.7 MB
save_kv_cache(compressed, "my-project-compressed")

# Load later
kv = load_kv_cache("my-project-compressed")

---

Flash Streaming — Out-of-Core Inference

For models larger than your RAM (research mode):

cd research/flash-streaming

# Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)
python3 moe_expert_sniper.py

# Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)
python3 flash_stream_v2.py

How F_NOCACHE direct I/O works

import os, fcntl

# Open model file bypassing macOS Unified Buffer Cache
fd = os.open("model.bin", os.O_RDONLY)
fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)  # bypass page cache

# Aligned read (16KB boundary for DART IOMMU)
ALIGN = 16384
offset = (layer_offset // ALIGN) * ALIGN
data = os.pread(fd, layer_size + ALIGN, offset)
weights = data[layer_offset - offset : layer_offset - offset + layer_size]

MoE Expert Sniper pattern

# Router predicts which 8 of 256 experts activate per token
active_experts = router_forward(hidden_state)  # returns [8] indices

# Load only those experts from SSD (8 threads, parallel pread)
from concurrent.futures import ThreadPoolExecutor

def load_expert(expert_idx):
    offset = expert_offsets[expert_idx]
    return os.pread(fd, expert_size, offset)

with ThreadPoolExecutor(max_workers=8) as pool:
    expert_weights = list(pool.map(load_expert, active_experts))

# ~14 MB loaded per layer instead of 221 MB (dense)

---

Common Patterns

Use as a Python library (direct API calls)

import requests

BASE = "http://localhost:8000/v1"

def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
    r = requests.post(f"{BASE}/chat/completions", json={
        "model": "local",
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    })
    return r.json()["choices"][0]["message"]["content"]

# Examples
print(ask("Write a Python function to parse JSON safely"))
print(ask("Explain this error: AttributeError: NoneType has no attribute split"))

Process a large file with paged inference

from mlx.paged_inference import PagedInference

engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")

with open("large_codebase.txt") as f:
    content = f.read()  # beyond single context window

# Automatically pages through content
result = engine.summarize(content, question="What does this codebase do?")
print(result)

Monitor server performance

python3 dashboard.py

---

Model Selection Guide

Your Mac RAM	Best Option	Command
8 GB	9B Q4_K_M	`--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096`
16 GB	35B IQ2_M (30 tok/s)	Default Option A above
16 GB (quality)	35B Q4 Expert Sniper	`python3 research/flash-streaming/moe_expert_sniper.py`
48 GB	35B Q4_K_M native	Download full Q4, `--n-gpu-layers 99`
192 GB	397B frontier	Any large GGUF, full offload

---

Troubleshooting

Server not responding on port 8000

# Check if server is running
curl http://localhost:8000/health

# Check what's on port 8000
lsof -i :8000

# Restart llama-server with verbose logging
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --verbose

Model download fails / incomplete

# Resume interrupted download
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
    local_dir='$HOME/models/',
    resume_download=True
)
"

Slow inference / RAM pressure on 16 GB Mac

# Reduce context size to free RAM
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --ctx-size 4096 \   # reduced from 12288
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 -t 4

# Or switch to 9B for lower RAM usage
python3 agent.py
# Then: /model 9b

MLX engine crashes with memory error

# MLX uses unified memory — check pressure
vm_stat | grep "Pages free"

# Reduce batch size in mlx_engine.py
# Edit: max_batch_size = 512  →  max_batch_size = 128

F_NOCACHE not bypassing page cache (macOS Sonoma+)

# Verify F_NOCACHE is active
import fcntl, os
fd = os.open(model_path, os.O_RDONLY)
result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)
assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"

`ddgs` search fails

pip3 install --upgrade ddgs --break-system-packages
# ddgs uses DuckDuckGo — no API key required, but may rate-limit
# Retry after 60 seconds if you get a 202 response

Wrong reshape on GGUF dequantization

# GGUF tensors are column-major — correct reshape:
weights = dequantized_flat.reshape(ne[1], ne[0])   # CORRECT
# NOT: dequantized_flat.reshape(ne[0], ne[1]).T     # WRONG

---

Architecture Summary

agent.py
  ├── Intent classification → "search" | "shell" | "chat"
  ├── search → ddgs.DDGS().text() → summarize
  ├── shell  → generate command → subprocess.run()
  └── chat   → stream directly

Backends (both expose OpenAI-compatible API on :8000)
  ├── llama.cpp  → fast, standard, no persistence
  └── mlx/       → KV cache save/load/compress/sync

Flash Streaming (research/)
  ├── moe_expert_sniper.py  → 35B Q4, 1.42 GB RAM
  └── flash_stream_v2.py    → 32B dense, 4.5 GB RAM
      └── F_NOCACHE + pread + 16KB alignment

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

FAQ

Which inference backends does mac-code-local-ai-agent support?

mac-code-local-ai-agent guides llama.cpp and MLX backends for running a 35B local reasoning model on Apple Silicon with mac-code CLI tool integrations.

What tools does mac-code attach to the local agent?

mac-code-local-ai-agent configures mac-code with web search, shell execution, and file operation tools so the local 35B coding agent can act on the developer machine.

Is Mac Code Local Ai Agent safe to install?

skills.sh reports 0 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsautomation