Huggingface Llm Trainer

Model artifact conversion is core product engineering after training, before local or edge deployment. Backend ML pipeline work—build tools, quantize weights—not frontend or marketing.

Also useful

Also useful

Where it fits

Example use

Compile llama.cpp and emit GGUF right after a TRL fine-tune completes on HF Jobs.

Example use

Ship a quantized bundle indie users can pull into Ollama without a GPU server.

Example use

Refresh quantization presets when serving smaller footprints on edge hardware.

How it compares

Skill-packaged conversion runbook, not a managed Hugging Face Training UI click-through alone.

Common Questions / FAQ

Who is huggingface-llm-trainer for?

Solo and indie ML builders who train with TRL on Hugging Face Jobs and need GGUF for local inference ecosystems.

When should I use huggingface-llm-trainer?

Use it in Build when converting checkpoints to GGUF, and in Ship when packaging quantized models for edge or offline deploy with Ollama or llama.cpp.

Is huggingface-llm-trainer safe to install?

Check the Security Audits panel on this page; the workflow implies shell package installs and cloning third-party repos—review scripts before running on production machines.

SKILL.md

READMESKILL.md - Huggingface Llm Trainer

# GGUF Conversion Guide

After training models with TRL on Hugging Face Jobs, convert them to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.

**This guide provides production-ready, tested code based on successful conversions.** All critical dependencies and build steps are included.

## What is GGUF?

**GGUF** (GPT-Generated Unified Format):
- Optimized format for CPU/GPU inference with llama.cpp
- Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
- Compatible with: Ollama, LM Studio, Jan, GPT4All, llama.cpp
- Typically 2-8GB for 7B models (vs 14GB unquantized)

## When to Convert to GGUF

**Convert when:**
- Running models locally with Ollama or LM Studio
- Using CPU-optimized inference
- Reducing model size with quantization
- Deploying to edge devices
- Sharing models for local-first use

## Critical Success Factors

Based on production testing, these are **essential** for reliable conversion:

### 1. ✅ Install Build Tools FIRST
**Before cloning llama.cpp**, install build dependencies:
```python
subprocess.run(["apt-get", "update", "-qq"], check=True, capture_output=True)
subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True, capture_output=True)
```

**Why:** The quantization tool requires gcc and cmake. Installing after cloning doesn't help.

### 2. ✅ Use CMake (Not Make)
**Build the quantize tool with CMake:**
```python
# Create build directory
os.makedirs("/tmp/llama.cpp/build", exist_ok=True)

# Configure
subprocess.run([
    "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
    "-DGGML_CUDA=OFF"  # Faster build, CUDA not needed for quantization
], check=True, capture_output=True, text=True)

# Build
subprocess.run([
    "cmake", "--build", "/tmp/llama.cpp/build",
    "--target", "llama-quantize", "-j", "4"
], check=True, capture_output=True, text=True)

# Binary path
quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize"
```

**Why:** CMake is more reliable than `make` and produces consistent binary paths.

### 3. ✅ Include All Dependencies
**PEP 723 header must include:**
```python
# /// script
# dependencies = [
#     "transformers>=4.36.0",
#     "peft>=0.7.0",
#     "torch>=2.0.0",
#     "accelerate>=0.24.0",
#     "huggingface_hub>=0.20.0",
#     "sentencepiece>=0.1.99",  # Required for tokenizer
#     "protobuf>=3.20.0",        # Required for tokenizer
#     "numpy",
#     "gguf",
# ]
# ///
```

**Why:** `sentencepiece` and `protobuf` are critical for tokenizer conversion. Missing them causes silent failures.

### 4. ✅ Verify Names Before Use
**Always verify repos exist:**
```python
# Before submitting job, verify:
hub_repo_details([ADAPTER_MODEL], repo_type="model")
hub_repo_details([BASE_MODEL], repo_type="model")
```

**Why:** Non-existent dataset/model names cause job failures that could be caught in seconds.

## Complete Conversion Script

See `scripts/convert_to_gguf.py` for the complete, production-ready script.

**Key features:**
- ✅ All dependencies in PEP 723 header
- ✅ Build tools installed automatically
- ✅ CMake build process (reliable)
- ✅ Comprehensive error handling
- ✅ Environment variable configuration
- ✅ Automatic README generation

## Quick Conversion Job

```python
# Before submitting: VERIFY MODELS EXIST
hub_repo_details(["username/my-finetuned-model"], repo_type="model")
hub_repo_details(["Qwen/Qwen2.5-0.5B"], repo_type="model")

# Submit conversion job
hf_jobs("uv", {
    "script": open("trl/scripts/convert_to_gguf.py").read(),  # Or inline the script
    "flavor": "a10g-large",
    "timeout": "45m",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
    "env": {
        "ADAPTER_MODEL": "username/my-finetuned-model",
        "BASE_MODEL": "Qwen/Qwen2.5-0.5B",
        "OUTPUT_REPO": "username/my-model-gguf",
        "HF_USERNAME": "username"  # Optional, for README
    }
})
```

## Conversion Process

The script performs these steps:

1. **Load and Merge** - Load base model and LoRA ad

What is this skill?

Production-oriented GGUF pipeline after TRL training on Hugging Face Jobs

Install build-essential and cmake before cloning llama.cpp

Build quantization with CMake (not legacy Make-only flow)

Supports 4-bit, 5-bit, and 8-bit quantization targets

Targets Ollama, LM Studio, Jan, GPT4All, and llama.cpp consumers

Typical ~2–8GB size for 7B models vs ~14GB unquantized (per guide)

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 926 installs on skills.sh; 10.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Model artifact conversion is core product engineering after training, before local or edge deployment. Backend ML pipeline work—build tools, quantize weights—not frontend or marketing.

Also useful

Also useful

Where it fits

Example use

Compile llama.cpp and emit GGUF right after a TRL fine-tune completes on HF Jobs.

Example use

Ship a quantized bundle indie users can pull into Ollama without a GPU server.

Example use