Nanochat Llm Training

Name: Nanochat Llm Training
Author: aradotso

aradotso/trending-skills

1.3k installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

nanochat-llm-training is an agent skill for training GPT-2-level LLMs with Karpathy nanochat including pretraining, finetuning, eval, and chat UI.

About

The nanochat-llm-training skill documents Karpathy's minimal nanochat harness for end-to-end LLM training on a single GPU node. It installs via uv sync from github.com/karpathy/nanochat and supports speedrun.sh to reproduce GPT-2 capability for roughly forty-eight dollars on eight H100 GPUs in about two hours. A single depth dial auto-configures width, heads, learning rate, training horizon, and weight decay for compute-optimal runs. Commands cover distributed torchrun pretraining, single-GPU base_train, quick depth-12 research iterations, CPU runcpu.sh, chat_web UI on port 8000, and chat_cli prompts. Additional scripts include scaling_laws.sh and miniseries.sh for depth sweeps. Evaluation uses DCLM CORE score with inference KV cache and ChatGPT-like web UI. Use when developers train custom LLMs with nanochat, configure depth hyperparameters, or serve a trained chat model.

uv-based install and speedrun.sh pipeline reproducing GPT-2 for about forty-eight dollars on 8xH100.
Single --depth parameter auto-configures model width, LR, and training horizon.
Scripts for distributed pretrain, single GPU, CPU tiny models, chat_web, and chat_cli.
Covers tokenization, pretraining, SFT, RL, DCLM CORE evaluation, and KV cache inference.
scaling_laws.sh and miniseries.sh for depth sweeps and compute-optimal series.

Nanochat Llm Training by the numbers

1,304 all-time installs (skills.sh)
+10 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #865 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: CRITICAL risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

nanochat-llm-training capabilities & compatibility

GPU node required; speedrun ~$48 on 8xH100

Capabilities: uv install and speedrun pipeline · depth dial compute optimal hyperparameters · distributed and single gpu pretraining commands · chat web ui and cli inference serving · scaling laws and miniseries sweep scripts
Use cases: research · orchestration

From the docs

What nanochat-llm-training says it does

nanochat is Karpathy's minimal, hackable harness for training LLMs end-to-end on a single GPU node.

SKILL.md

A single complexity dial (`--depth`) auto-configures all other hyperparameters

SKILL.md

npx skills add https://github.com/aradotso/trending-skills --skill nanochat-llm-training

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/nanochat-llm-training.svg)](https://skillselion.com/skills/aradotso/trending-skills/nanochat-llm-training)

Installs	1.3k
repo stars	★ 66
Security audit	1 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do I run nanochat pretraining, reproduce GPT-2 on a GPU node, or serve my trained chat model?

Train GPT-2-level LLMs with Karpathy nanochat covering tokenization, pretraining, finetuning, eval, and chat UI on GPU nodes.

Who is it for?

Developers running nanochat LLM experiments, speedruns, or finetuning on single or multi-GPU nodes.

Skip if: Skip for production LLM serving at scale without the nanochat training codebase.

When should I use this skill?

User trains with nanochat, runs speedrun.sh, configures depth hyperparameters, or opens chat_web after training.

What you get

Configured nanochat environment with depth-based training commands, speedrun pipeline, and chat web or CLI serving.

trained model checkpoint
evaluation results
chat UI session

By the numbers

Targets GPT-2-level training for under $100 on a single GPU node
Covers six nanochat stages from tokenization through chat UI

Files

SKILL.mdMarkdownGitHub ↗

nanochat LLM Training

Skill by ara.so — Daily 2026 Skills collection.

nanochat is Karpathy's minimal, hackable harness for training LLMs end-to-end on a single GPU node. It covers tokenization, pretraining, SFT finetuning, RL, evaluation (DCLM CORE score), inference with KV cache, and a ChatGPT-like web UI. A single complexity dial (--depth) auto-configures all other hyperparameters (width, heads, LR, training horizon, weight decay) for compute-optimal training. You can reproduce GPT-2 capability (~$43,000 in 2019) for ~$48 on an 8×H100 node (~2 hours).

Installation

nanochat uses uv for dependency management:

git clone https://github.com/karpathy/nanochat.git
cd nanochat
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create venv and install deps
uv sync
source .venv/bin/activate

Key Commands

Full GPT-2 Speedrun (8×H100 node, ~2–3 hours, ~$48)

# Run the reference pipeline: data download, pretraining, SFT, eval, chat
bash runs/speedrun.sh

Pretraining (distributed)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=26 \
    --run="d26_run" \
    --model-tag="d26"

Pretraining (single GPU)

python -m scripts.base_train -- \
    --depth=26 \
    --run="d26_single"

Quick Research Iteration (~5 min, GPT-1 scale)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12_exp" \
    --model-tag="d12" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

CPU / Apple Silicon (tiny model, ~minutes)

bash runs/runcpu.sh

Serve Chat UI

# After training completes
source .venv/bin/activate
python -m scripts.chat_web
# Visit http://<your-server-ip>:8000/

CLI Chat

python -m scripts.chat_cli -p "hello"

Scaling Laws / Miniseries

bash runs/scaling_laws.sh   # sweep depths for scaling law data
bash runs/miniseries.sh     # train full compute-optimal miniseries

The Depth Dial

The single most important parameter. Everything else is derived automatically:

`--depth`	Approximate model scale	Notes
6–8	Tiny (toy)	CPU/MPS feasible
12	GPT-1 size	~5 min on 8×H100, great for research iteration
16	Medium	~15 min on 8×H100
24–26	GPT-2 size	~2 hrs on 8×H100, ~$48

# Smaller/faster experiments
python -m scripts.base_train -- --depth=12 --run="quick_test"

# Full GPT-2 grade
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"

Precision / dtype Configuration

nanochat uses explicit dtype management via COMPUTE_DTYPE in nanochat/common.py. No torch.amp.autocast.

Hardware	Default	Override
CUDA SM 80+ (A100, H100)	`bfloat16`	`NANOCHAT_DTYPE=float32`
CUDA SM < 80 (V100, T4)	`float32`	`NANOCHAT_DTYPE=float16`
CPU / MPS	`float32`	—

# Force fp32 for inference
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"

# Force bf16 for training
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train

# float16 training (enables GradScaler automatically)
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train

How it works: Weights stored in fp32 (optimizer precision), custom Linear casts to COMPUTE_DTYPE in forward pass, embeddings stored directly in COMPUTE_DTYPE to save memory.

Key Python Modules

nanochat/
├── gpt.py              # GPT nn.Module Transformer
├── engine.py           # Inference with KV Cache
├── dataloader.py       # Tokenizing Distributed Data Loader
├── dataset.py          # Download/read utils for pretraining data
├── optim.py            # AdamW + Muon optimizer (1GPU and distributed)
├── core_eval.py        # DCLM CORE score evaluation
├── loss_eval.py        # Bits-per-byte evaluation
├── checkpoint_manager.py  # Save/Load checkpoints
├── common.py           # Utilities, COMPUTE_DTYPE
├── execution.py        # Python code execution tool for LLM
└── engine.py           # Efficient KV-cache inference

scripts/
├── base_train.py       # Pretraining entry point
├── chat_web.py         # Web chat UI server
└── chat_cli.py         # CLI chat interface

runs/
├── speedrun.sh         # Reference full pipeline (GPT-2 speedrun)
├── scaling_laws.sh     # Scaling law sweeps
├── miniseries.sh       # Full compute-optimal miniseries
└── runcpu.sh           # CPU/MPS example

Real Code Examples

Load and Run Inference on a Trained Model

import torch
from nanochat.gpt import GPT
from nanochat.engine import InferenceEngine
from nanochat.checkpoint_manager import CheckpointManager

# Load checkpoint
ckpt_manager = CheckpointManager("checkpoints/d26")
model, config = ckpt_manager.load()
model.eval()

# Run inference with KV cache
engine = InferenceEngine(model)
output = engine.generate(
    prompt="Once upon a time",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
)
print(output)

Custom Training Script with Depth Dial

import subprocess

def train_model(depth: int, run_name: str, nproc: int = 8):
    """Launch a compute-optimal training run for given depth."""
    cmd = [
        "torchrun",
        "--standalone",
        f"--nproc_per_node={nproc}",
        "-m", "scripts.base_train",
        "--",
        f"--depth={depth}",
        f"--run={run_name}",
        f"--model-tag={run_name}",
    ]
    subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})

# Quick research iteration
train_model(depth=12, run_name="my_experiment_d12")

# Full GPT-2 grade
train_model(depth=26, run_name="my_gpt2_repro")

Adjust Device Batch Size for Lower VRAM

# Default device_batch_size=32 needs ~80GB VRAM per GPU
# Reduce for smaller GPUs (gradient accumulation handles the rest)
torchrun --standalone --nproc_per_node=4 -m scripts.base_train -- \
    --depth=12 \
    --device_batch_size=16 \
    --run="low_vram_run"

# Even smaller
python -m scripts.base_train -- \
    --depth=8 \
    --device_batch_size=4 \
    --run="single_gpu_small"

Monitoring Key Metrics in wandb

# nanochat logs to wandb automatically. Key metrics to watch:
# - val_bpb: validation loss in bits-per-byte (vocab-size-invariant)
#   as a function of step, total_training_time, total_training_flops
# - core_metric: DCLM CORE score (target > 0.2565 to beat GPT-2)
# - train/mfu: Model FLOPS utilization
# - train/tok_per_sec: Training throughput

# Set wandb project via env var before training
import os
os.environ["WANDB_PROJECT"] = "my-nanochat-runs"

Synthetic Data for SFT Personality

# dev/gen_synthetic_data.py — generate identity/personality data
# Then mix into SFT stage per the guide:
# https://github.com/karpathy/nanochat/discussions/139

# Example: generate data and point SFT to it
python dev/gen_synthetic_data.py --output data/identity_sft.jsonl
# Then reference in your SFT script configuration

Common Patterns

Research Iteration Loop

# 1. Make a code change in nanochat/
# 2. Run quick d12 to validate
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 --run="test_my_change" \
    --core-metric-every=999999 --sample-every=-1 --save-every=-1
# 3. Check wandb: val_bpb vs step/time/flops
# 4. If promising, test at d16 or d26

FP8 Training (H100 only, for speedrun)

# FP8 is used in the speedrun for additional speedup
# See runs/speedrun.sh for the exact invocation
bash runs/speedrun.sh

Evaluate CORE Score Only

python -m nanochat.core_eval --checkpoint checkpoints/d26/latest

Serve on Lambda / Remote Machine

# On remote machine after training:
source .venv/bin/activate
python -m scripts.chat_web
# Access via: http://<PUBLIC_IP>:8000/
# Use `screen` or `tmux` to keep alive
screen -S nanochat
python -m scripts.chat_web
# Ctrl+A, D to detach

Troubleshooting

OOM / Out of VRAM

# Reduce --device_batch_size (default 32)
# Code uses gradient accumulation to maintain effective batch size
--device_batch_size=16   # Try 16, 8, 4, 2, 1

Single GPU is 8× Slower

This is expected. Omit torchrun and use python -m scripts.base_train directly. Gradient accumulation kicks in automatically to maintain equivalent total batch size.

Running on Non-CUDA Hardware

# MPS (Apple Silicon) or CPU — use runcpu.sh as template
bash runs/runcpu.sh
# Results will be weak; this is for development/debugging only

float16 Gradient Underflow

# nanochat auto-enables GradScaler when NANOCHAT_DTYPE=float16
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12
# Note: RL scripts do NOT support float16 (SFT and base_train do)

V100 / T4 (SM < 80) — No bf16

# Default falls back to float32; optionally use float16
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

Chat UI Not Accessible

# Ensure the port (default 8000) is open in your cloud provider's firewall/security group
# Use the public IP, not localhost:
# http://<PUBLIC_IP>:8000/

Resources

DeepWiki Q&A: https://deepwiki.com/karpathy/nanochat
Discussions: https://github.com/karpathy/nanochat/discussions
Discord: #nanochat channel on Karpathy's Discord
Leaderboard docs: dev/LEADERBOARD.md
Beating GPT-2 guide: https://github.com/karpathy/nanochat/discussions/481
Miniseries v1: https://github.com/karpathy/nanochat/discussions/420
Adding abilities guide: https://github.com/karpathy/nanochat/discussions/164

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Use nanochat-llm-training for hands-on GPT-2-scale training; use OpenViking or agent-memory skills when the goal is persistent agent context, not model weights.

FAQ

What does nanochat-llm-training cover?

Tokenization, pretraining, finetuning, evaluation, inference with KV cache, and a ChatGPT-like web UI via nanochat scripts.

When should I use nanochat-llm-training?

When setting up or extending Karpathy nanochat training, speedruns, or chat serving on GPU or CPU nodes.

Is nanochat-llm-training safe to install?

Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingllmresearch