
Nanochat Llm Training
Run Karpathy’s nanochat end-to-end on a GPU node to pretrain, finetune, evaluate, and chat with a GPT-2–class model for a fraction of historical cost.
Overview
nanochat-llm-training is an agent skill for the Build phase that runs Karpathy’s nanochat pipeline from tokenization through pretraining, finetuning, evaluation, inference, and a chat UI on a GPU node.
Install
npx skills add https://github.com/aradotso/trending-skills --skill nanochat-llm-trainingWhat is this skill?
- End-to-end harness: tokenization, pretraining, SFT, RL, DCLM CORE eval, KV-cache inference, and a ChatGPT-like web UI.
- Single `--depth` complexity dial auto-sets width, heads, LR, horizon, and weight decay for compute-optimal runs.
- Documented GPT-2-class speedrun on 8×H100 (~2–3 hours, ~$48) versus ~$43k in 2019.
- uv-managed install (`uv sync`) and reference commands for full speedrun workflows.
- Triggers cover GPU node setup, hyperparameters, finetuning, leaderboard speedruns, and chat UI.
- Documented ~$48 GPT-2-class speedrun on 8×H100 (~2–3 hours)
- Historical GPT-2 training cited at ~$43,000 (2019)
- Single `--depth` dial auto-configures width, heads, LR, horizon, and weight decay
Adoption & trust: 1.2k installs on skills.sh; 31 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want a GPT-2–level model you control, but full-stack LLM training scripts, hyperparameters, and eval harnesses feel too fragmented to run on one machine.
Who is it for?
Indie builders with access to a multi-GPU or strong single-GPU node who want a minimal, auditable training path before wiring models into agents.
Skip if: Teams that need managed training, multi-tenant serving SLAs, or production MLOps without touching CUDA, uv, or long GPU jobs.
When should I use this skill?
User asks to train with nanochat, run pretraining or finetuning, reproduce GPT-2, set up a GPU node, use the speedrun leaderboard, tune depth, or use the chat UI.
What do I get? / Deliverables
You get a documented nanochat workflow—from `uv sync` through speedrun training and chat—so you can reproduce, tune `--depth`, and talk to your own checkpoint.
- Trained nanochat checkpoint after pretraining and optional SFT/RL
- Evaluation against DCLM CORE-style metrics
- Runnable inference and ChatGPT-like chat UI against the model
Recommended Skills
Journey fit
Canonical shelf is Build because the skill’s job is executing a full training and inference pipeline, not ideation or go-to-market work. agent-tooling fits solo builders shipping custom models and chat harnesses alongside their agent stack, not only app CRUD.
How it compares
Use for a self-hosted training harness on your GPU node, not for calling hosted foundation APIs only.
Common Questions / FAQ
Who is nanochat-llm-training for?
Solo and indie developers who already ship with coding agents and want to train or finetune a small LLM with nanochat instead of outsourcing the whole stack.
When should I use nanochat-llm-training?
During Build when you are standing up pretraining or finetuning, reproducing a GPT-2-class speedrun, configuring depth on an 8×H100 node, or opening the bundled chat UI against your checkpoint.
Is nanochat-llm-training safe to install?
Treat it like any repo you clone and run on GPU infrastructure: review the Security Audits panel on this Prism page and inspect nanochat’s code and shell commands before executing on production machines.
SKILL.md
READMESKILL.md - Nanochat Llm Training
# nanochat LLM Training > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. nanochat is Karpathy's minimal, hackable harness for training LLMs end-to-end on a single GPU node. It covers tokenization, pretraining, SFT finetuning, RL, evaluation (DCLM CORE score), inference with KV cache, and a ChatGPT-like web UI. A single complexity dial (`--depth`) auto-configures all other hyperparameters (width, heads, LR, training horizon, weight decay) for compute-optimal training. You can reproduce GPT-2 capability (~$43,000 in 2019) for ~$48 on an 8×H100 node (~2 hours). ## Installation nanochat uses `uv` for dependency management: ```bash git clone https://github.com/karpathy/nanochat.git cd nanochat # Install uv if needed curl -LsSf https://astral.sh/uv/install.sh | sh # Create venv and install deps uv sync source .venv/bin/activate ``` ## Key Commands ### Full GPT-2 Speedrun (8×H100 node, ~2–3 hours, ~$48) ```bash # Run the reference pipeline: data download, pretraining, SFT, eval, chat bash runs/speedrun.sh ``` ### Pretraining (distributed) ```bash OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \ --depth=26 \ --run="d26_run" \ --model-tag="d26" ``` ### Pretraining (single GPU) ```bash python -m scripts.base_train -- \ --depth=26 \ --run="d26_single" ``` ### Quick Research Iteration (~5 min, GPT-1 scale) ```bash OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \ --depth=12 \ --run="d12_exp" \ --model-tag="d12" \ --core-metric-every=999999 \ --sample-every=-1 \ --save-every=-1 ``` ### CPU / Apple Silicon (tiny model, ~minutes) ```bash bash runs/runcpu.sh ``` ### Serve Chat UI ```bash # After training completes source .venv/bin/activate python -m scripts.chat_web # Visit http://<your-server-ip>:8000/ ``` ### CLI Chat ```bash python -m scripts.chat_cli -p "hello" ``` ### Scaling Laws / Miniseries ```bash bash runs/scaling_laws.sh # sweep depths for scaling law data bash runs/miniseries.sh # train full compute-optimal miniseries ``` ## The Depth Dial The single most important parameter. Everything else is derived automatically: | `--depth` | Approximate model scale | Notes | |-----------|------------------------|-------| | 6–8 | Tiny (toy) | CPU/MPS feasible | | 12 | GPT-1 size | ~5 min on 8×H100, great for research iteration | | 16 | Medium | ~15 min on 8×H100 | | 24–26 | GPT-2 size | ~2 hrs on 8×H100, ~$48 | ```bash # Smaller/faster experiments python -m scripts.base_train -- --depth=12 --run="quick_test" # Full GPT-2 grade torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro" ``` ## Precision / dtype Configuration nanochat uses explicit dtype management via `COMPUTE_DTYPE` in `nanochat/common.py`. No `torch.amp.autocast`. | Hardware | Default | Override | |----------|---------|---------| | CUDA SM 80+ (A100, H100) | `bfloat16` | `NANOCHAT_DTYPE=float32` | | CUDA SM < 80 (V100, T4) | `float32` | `NANOCHAT_DTYPE=float16` | | CPU / MPS | `float32` | — | ```bash # Force fp32 for inference NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello" # Force bf16 for training NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train # float16 training (enables GradScaler automatically) NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train ``` **How it works:** Weights stored in