
Huggingface Llm Trainer
Convert TRL-trained Hugging Face Job checkpoints into quantized GGUF files for Ollama, LM Studio, and llama.cpp.
Overview
huggingface-llm-trainer is an agent skill most often used in Build (also Ship) that converts TRL-trained models to GGUF with llama.cpp quantization for local inference tools.
Install
npx skills add https://github.com/huggingface/skills --skill huggingface-llm-trainerWhat is this skill?
- Production-oriented GGUF pipeline after TRL training on Hugging Face Jobs
- Install build-essential and cmake before cloning llama.cpp
- Build quantization with CMake (not legacy Make-only flow)
- Supports 4-bit, 5-bit, and 8-bit quantization targets
- Targets Ollama, LM Studio, Jan, GPT4All, and llama.cpp consumers
- Typical ~2–8GB size for 7B models vs ~14GB unquantized (per guide)
Adoption & trust: 926 installs on skills.sh; 10.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You finished HF Jobs training but lack a reliable, dependency-correct path to GGUF for Ollama or LM Studio.
Who is it for?
Builders shipping fine-tuned LLMs who want local Ollama/LM Studio distribution after cloud training.
Skip if: Teams that only need hosted Hugging Face Inference Endpoints with no local GGUF requirement.
When should I use this skill?
After training with TRL on Hugging Face Jobs when you need GGUF for llama.cpp, Ollama, LM Studio, or edge deployment.
What do I get? / Deliverables
You produce quantized GGUF artifacts and can run them locally with llama.cpp-compatible runners.
- GGUF model files
- Quantized weights for local inference runtimes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Model artifact conversion is core product engineering after training, before local or edge deployment. Backend ML pipeline work—build tools, quantize weights—not frontend or marketing.
Where it fits
Compile llama.cpp and emit GGUF right after a TRL fine-tune completes on HF Jobs.
Ship a quantized bundle indie users can pull into Ollama without a GPU server.
Refresh quantization presets when serving smaller footprints on edge hardware.
How it compares
Skill-packaged conversion runbook, not a managed Hugging Face Training UI click-through alone.
Common Questions / FAQ
Who is huggingface-llm-trainer for?
Solo and indie ML builders who train with TRL on Hugging Face Jobs and need GGUF for local inference ecosystems.
When should I use huggingface-llm-trainer?
Use it in Build when converting checkpoints to GGUF, and in Ship when packaging quantized models for edge or offline deploy with Ollama or llama.cpp.
Is huggingface-llm-trainer safe to install?
Check the Security Audits panel on this page; the workflow implies shell package installs and cloning third-party repos—review scripts before running on production machines.
SKILL.md
READMESKILL.md - Huggingface Llm Trainer
# GGUF Conversion Guide After training models with TRL on Hugging Face Jobs, convert them to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools. **This guide provides production-ready, tested code based on successful conversions.** All critical dependencies and build steps are included. ## What is GGUF? **GGUF** (GPT-Generated Unified Format): - Optimized format for CPU/GPU inference with llama.cpp - Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size - Compatible with: Ollama, LM Studio, Jan, GPT4All, llama.cpp - Typically 2-8GB for 7B models (vs 14GB unquantized) ## When to Convert to GGUF **Convert when:** - Running models locally with Ollama or LM Studio - Using CPU-optimized inference - Reducing model size with quantization - Deploying to edge devices - Sharing models for local-first use ## Critical Success Factors Based on production testing, these are **essential** for reliable conversion: ### 1. ✅ Install Build Tools FIRST **Before cloning llama.cpp**, install build dependencies: ```python subprocess.run(["apt-get", "update", "-qq"], check=True, capture_output=True) subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True, capture_output=True) ``` **Why:** The quantization tool requires gcc and cmake. Installing after cloning doesn't help. ### 2. ✅ Use CMake (Not Make) **Build the quantize tool with CMake:** ```python # Create build directory os.makedirs("/tmp/llama.cpp/build", exist_ok=True) # Configure subprocess.run([ "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp", "-DGGML_CUDA=OFF" # Faster build, CUDA not needed for quantization ], check=True, capture_output=True, text=True) # Build subprocess.run([ "cmake", "--build", "/tmp/llama.cpp/build", "--target", "llama-quantize", "-j", "4" ], check=True, capture_output=True, text=True) # Binary path quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize" ``` **Why:** CMake is more reliable than `make` and produces consistent binary paths. ### 3. ✅ Include All Dependencies **PEP 723 header must include:** ```python # /// script # dependencies = [ # "transformers>=4.36.0", # "peft>=0.7.0", # "torch>=2.0.0", # "accelerate>=0.24.0", # "huggingface_hub>=0.20.0", # "sentencepiece>=0.1.99", # Required for tokenizer # "protobuf>=3.20.0", # Required for tokenizer # "numpy", # "gguf", # ] # /// ``` **Why:** `sentencepiece` and `protobuf` are critical for tokenizer conversion. Missing them causes silent failures. ### 4. ✅ Verify Names Before Use **Always verify repos exist:** ```python # Before submitting job, verify: hub_repo_details([ADAPTER_MODEL], repo_type="model") hub_repo_details([BASE_MODEL], repo_type="model") ``` **Why:** Non-existent dataset/model names cause job failures that could be caught in seconds. ## Complete Conversion Script See `scripts/convert_to_gguf.py` for the complete, production-ready script. **Key features:** - ✅ All dependencies in PEP 723 header - ✅ Build tools installed automatically - ✅ CMake build process (reliable) - ✅ Comprehensive error handling - ✅ Environment variable configuration - ✅ Automatic README generation ## Quick Conversion Job ```python # Before submitting: VERIFY MODELS EXIST hub_repo_details(["username/my-finetuned-model"], repo_type="model") hub_repo_details(["Qwen/Qwen2.5-0.5B"], repo_type="model") # Submit conversion job hf_jobs("uv", { "script": open("trl/scripts/convert_to_gguf.py").read(), # Or inline the script "flavor": "a10g-large", "timeout": "45m", "secrets": {"HF_TOKEN": "$HF_TOKEN"}, "env": { "ADAPTER_MODEL": "username/my-finetuned-model", "BASE_MODEL": "Qwen/Qwen2.5-0.5B", "OUTPUT_REPO": "username/my-model-gguf", "HF_USERNAME": "username" # Optional, for README } }) ``` ## Conversion Process The script performs these steps: 1. **Load and Merge** - Load base model and LoRA ad