
Knowledge Distillation
Distill a large teacher LLM into a smaller student using MiniLLM reverse-KL objectives instead of mode-seeking forward KL that hurts generative diversity.
Overview
Knowledge Distillation is an agent skill most often used in Build (also Ship, Operate) that explains MiniLLM reverse KL distillation so student LLMs cover teacher generation modes instead of collapsing to a single mode.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill knowledge-distillationWhat is this skill?
- Contrasts forward KL(mode-seeking) vs reverse KL(mode-covering) for generative distillation
- MiniLLM formulation tied to arXiv 2306.08543 and Microsoft LMOps reference implementation
- Explains why standard KL distillation under-trains low-probability but valid teacher modes
- Decision guide for when reverse KLD better preserves diverse generations in student models
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your distilled student model sounds bland or misses valid phrasings because forward KL distillation pushed it to imitate only the teacher’s peaks.
Who is it for?
Builders planning teacher–student LLM compression who care about output diversity and want the objective function choice explained before coding.
Skip if: Anyone who only needs prompt-level shortening or routing to a smaller API model with no custom distillation training.
When should I use this skill?
You are distilling an LLM teacher to a student and need to pick forward vs reverse KL—or implement MiniLLM-style reverse KLD—for generative quality.
What do I get? / Deliverables
You understand when and how to use reverse KLD (MiniLLM-style) in a distillation loss so your smaller model retains broader generative coverage before you wire the trainer.
- Clear forward vs reverse KL decision rationale for your distillation run
- MiniLLM-aligned objective framing to implement in your trainer
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Model compression and student training are build-time engineering; the same framing also informs ship and operate cost decisions. Backend subphase fits training objectives, loss definitions, and distillation loops rather than frontend or go-to-market tasks.
Where it fits
Choose reverse KLD in your distillation script after forward KL produces overly narrow student completions on eval prompts.
Decide whether a smaller student can replace the teacher for an agent feature without unacceptable diversity loss.
Justify a distilled checkpoint for lower-latency deployment while documenting expected coverage vs the teacher.
Revisit distillation objectives when production logs show repetitive or safe-only generations from a compressed endpoint.
How it compares
Use for generative distillation objective design, not as a drop-in replacement for inference-only quantization or RAG retrieval tuning.
Common Questions / FAQ
Who is knowledge-distillation for?
Indie ML engineers and agent builders shrinking LLMs who already have a teacher model and training stack but want research-backed guidance on KL direction for generation quality.
When should I use knowledge-distillation?
In build while designing a distillation pipeline; in ship when evaluating whether a student meets quality bars; in operate when revisiting serving costs and noticing diversity regressions after compression.
Is knowledge-distillation safe to install?
Review the Security Audits panel on this Prism page and validate any linked training code or dependencies before running distillation jobs on proprietary data.
Workflow Chain
Then invoke: unsloth
SKILL.md
READMESKILL.md - Knowledge Distillation
# MiniLLM: Reverse KL Divergence for LLM Distillation Based on arXiv 2306.08543 (2024) - MiniLLM: Knowledge Distillation of Large Language Models ## Overview **Source**: https://arxiv.org/abs/2306.08543 **GitHub**: https://github.com/microsoft/LMOps/tree/main/minillm MiniLLM replaces forward KLD with reverse KLD for knowledge distillation, achieving better performance on generative language models. ## Problem with Standard KLD ### Forward KL Divergence (Standard) **Formula**: `KL(Student || Teacher)` **Minimization behavior**: Mode-seeking ``` Student tries to match teacher's MEAN behavior → Student focuses on teacher's highest probability regions → Student ignores low-probability but valid generations ``` **Issue for generative models**: Limits diversity, student generates safe but boring outputs. ### Why Forward KL Fails for Generation ```python # Teacher distribution (diverse) teacher_probs = [0.3, 0.3, 0.2, 0.1, 0.1] # Multiple valid options # Forward KL minimization # Student learns: [0.6, 0.3, 0.1, 0.0, 0.0] # Problem: Ignores options 4-5 entirely (mode-seeking) ``` ## MiniLLM Solution: Reverse KLD ### Reverse KL Divergence **Formula**: `KL(Teacher || Student)` **Minimization behavior**: Mode-covering ``` Student tries to COVER all teacher's modes → Student learns diverse generation → Student doesn't ignore any valid teacher outputs ``` ### Mathematical Formulation **Forward KL** (standard distillation): ``` L_forward = Σ p_student(x) log(p_student(x) / p_teacher(x)) = E_{x~student} [log p_student(x) - log p_teacher(x)] ``` **Reverse KL** (MiniLLM): ``` L_reverse = Σ p_teacher(x) log(p_teacher(x) / p_student(x)) = E_{x~teacher} [log p_teacher(x) - log p_student(x)] ``` **Key difference**: Expectation over teacher distribution vs student distribution. ## Implementation ### Reverse KLD Loss ```python import torch import torch.nn.functional as F def reverse_kl_loss(student_logits, teacher_logits, temperature=1.0): """ Reverse KL divergence: KL(Teacher || Student). Args: student_logits: Model predictions (batch, seq_len, vocab_size) teacher_logits: Teacher predictions (batch, seq_len, vocab_size) temperature: Softening parameter Returns: Reverse KL divergence loss """ # Teacher distribution (target, detached) p_teacher = F.softmax(teacher_logits / temperature, dim=-1) p_teacher = p_teacher.detach() # Don't backprop through teacher # Student distribution (learnable) log_p_student = F.log_softmax(student_logits / temperature, dim=-1) # Reverse KL: -Σ p_teacher * log p_student reverse_kl = -(p_teacher * log_p_student).sum(dim=-1).mean() # Temperature correction return reverse_kl * (temperature ** 2) ``` ### Policy Gradient Optimization **Challenge**: Reverse KL requires sampling from teacher. **Solution**: Use policy gradient with teacher samples. ```python def minillm_policy_gradient(student_model, teacher_model, prompt_batch): """ MiniLLM training with policy gradient. Steps: 1. Sample responses from teacher 2. Compute reverse KL using those samples 3. Optimize student to cover teacher's distribution """ # 1. Generate from teacher (detached) with torch.no_grad(): teacher_outputs = teacher_model.generate( prompt_batch, max_new_tokens=256, do_sample=True, temperature=1.0, return_dict_in_generate=True, output_scores=True ) teacher_sequences = teacher_outputs.sequences teacher_scores = teacher_outputs.scores # 2. Student evaluates teacher's samples student_outputs = student_model( input_ids=teacher_sequences, labels=teacher_sequences ) # 3. Policy gradient loss # Maximize student's likelihood on teacher's samples loss = -student_outputs.logits.mean() return loss ``` ## Training Procedure ### Two-Stage