
Fine Tuning With Trl
Align an open model on preference pairs with TRL DPOConfig and the right loss variant (sigmoid, IPO, hinge, robust, BCO) instead of guessing hyperparameters.
Overview
fine-tuning-with-trl is an agent skill for the Build phase that guides TRL DPO training across 10+ documented loss variants and ready-made DPOConfig patterns.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill fine-tuning-with-trlWhat is this skill?
- Documents 10+ DPO loss variants in TRL with formulas and when to use each (sigmoid default, IPO, hinge/SLiC, robust with
- Copy-paste DPOConfig blocks per variant with beta, batch size, learning rate, and length limits where the guide specifie
- Maps scenarios to losses: general alignment, theoretical IPO, margin hinge, noisy labels via robust + label_smoothing
- Centers on chosen/rejected preference pairs and KL-style beta tuning common in RLHF-style workflows
- 10+ DPO loss variants documented in TRL
- 5 named loss sections with DPOConfig examples in the skill excerpt (sigmoid, IPO, hinge, robust, BCO pair)
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have preference pairs and a base model but do not know which TRL loss_type, beta, or batch settings fit your alignment goal or noisy labels.
Who is it for?
Indie builders fine-tuning open models with TRL on preference data who want variant-specific configs without reading the full TRL source.
Skip if: Teams with no GPU training setup, no chosen/rejected dataset, or needs limited to prompt engineering without weight updates.
When should I use this skill?
User is implementing TRL DPO fine-tuning, comparing loss_type options, or tuning beta and batch settings on preference data.
What do I get? / Deliverables
You get a concrete DPOConfig and loss choice aligned to your scenario so your agent can implement a TRL training run with defensible hyperparameters.
- DPOConfig selection per scenario
- Training hyperparameter recommendations
- TRL training script scaffold
Recommended Skills
Journey fit
Model fine-tuning and preference optimization happen during product build when you ship custom LLM behavior, not during idea validation or launch distribution. TRL training scripts, batch sizes, and loss configs are backend/ML pipeline work tied to training jobs and model artifacts.
How it compares
Use for TRL DPO loss selection and configs, not as a general Hugging Face Trainer cheat sheet or an MCP inference server.
Common Questions / FAQ
Who is fine-tuning-with-trl for?
Solo and indie developers shipping custom LLM behavior who already use or plan to use Hugging Face TRL and preference-labeled data.
When should I use fine-tuning-with-trl?
During Build when you are wiring backend training jobs, tuning agent models on user feedback, or comparing DPO losses before a ship-phase eval pass.
Is fine-tuning-with-trl safe to install?
Review the Security Audits panel on this Prism page before running training commands; the skill describes ML configs and does not replace your sandbox and secret-handling policy.
SKILL.md
READMESKILL.md - Fine Tuning With Trl
# DPO Variants Complete guide to Direct Preference Optimization loss variants in TRL. ## Overview DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios. ## Loss Types ### 1. Sigmoid (Standard DPO) **Formula**: `-log(sigmoid(β * logits))` **When to use**: Default choice, general preference alignment **Config**: ```python DPOConfig( loss_type="sigmoid", beta=0.1, # KL penalty per_device_train_batch_size=64, learning_rate=1e-6 ) ``` ### 2. IPO (Identity Policy Optimization) **Formula**: `(logits - 1/(2β))²` **When to use**: Better theoretical foundation, reduce overfitting **Config**: ```python DPOConfig( loss_type="ipo", beta=0.1, per_device_train_batch_size=90, learning_rate=1e-2 ) ``` ### 3. Hinge (SLiC) **Formula**: `ReLU(1 - β * logits)` **When to use**: Margin-based objective **Config**: ```python DPOConfig( loss_type="hinge", beta=0.1, per_device_train_batch_size=512, learning_rate=1e-4 ) ``` ### 4. Robust DPO **Formula**: Sigmoid with label smoothing for noise robustness **When to use**: Noisy preference labels **Config**: ```python DPOConfig( loss_type="robust", beta=0.01, label_smoothing=0.1, # Noise probability per_device_train_batch_size=16, learning_rate=1e-3, max_prompt_length=128, max_length=512 ) ``` ### 5. BCO Pair (Binary Classification) **Formula**: Train binary classifier (chosen=1, rejected=0) **When to use**: Pairwise preference data **Config**: ```python DPOConfig( loss_type="bco_pair", beta=0.01, per_device_train_batch_size=128, learning_rate=5e-7, max_prompt_length=1536, max_completion_length=512 ) ``` ### 6. SPPO Hard **Formula**: Push chosen→0.5, rejected→-0.5 **When to use**: Nash equilibrium, sparse data **Config**: ```python DPOConfig( loss_type="sppo_hard", beta=0.1 ) ``` ### 7. DiscoPOP **Formula**: Log-Ratio Modulated Loss **When to use**: Automated loss discovery **Config**: ```python DPOConfig( loss_type="discopop", beta=0.05, discopop_tau=0.05, per_device_train_batch_size=64, learning_rate=5e-7 ) ``` ### 8. APO Zero **Formula**: Increase chosen, decrease rejected likelihood **When to use**: Model worse than winning outputs **Config**: ```python DPOConfig( loss_type="apo_zero", beta=0.1, per_device_train_batch_size=64, learning_rate=2e-7, max_prompt_length=512, max_completion_length=512 ) ``` ### 9. APO Down **Formula**: Decrease both, emphasize rejected reduction **When to use**: Model better than winning outputs **Config**: ```python DPOConfig( loss_type="apo_down", beta=0.1, # Same hyperparameters as apo_zero ) ``` ### 10. AOT & AOT Pair **Formula**: Distributional alignment via stochastic dominance **When to use**: - `aot_pair`: Paired preference data - `aot`: Unpaired data **Config**: ```python DPOConfig( loss_type="aot_pair", # or "aot" beta=0.1, label_smoothing=0.0 ) ``` ## Multi-Loss Training Combine multiple losses: ```python DPOConfig( loss_type=["sigmoid", "ipo"], loss_weights=[0.7, 0.3], # Weighted combination beta=0.1 ) ``` ## Key Parameters ### Beta (β) Controls deviation from reference model: - **Higher** (0.5): More conservative, stays close to reference - **Lower** (0.01): More aggressive alignment - **Default**: 0.1 ### Label Smoothing For robust DPO: - **0.0**: No smoothing (default) - **0.1-0.3**: Moderate noise robustness - **0.5**: Maximum noise tolerance ### Max Lengths - `max_prompt_length`: 128-1536 - `max_completion_length`: 128-512 - `max_length`: Total sequence (1024-2048) ## Comparison Table | Loss | Speed | Stability | Best For | |------|-------|-----------|----------| | Sigmoid | Fast | Good | **General use** | | IPO | Fast | Better | Overfitting issues | | Hinge | Fast | Good | Margin objectives | | Robust | Fast | Best | Noisy data | | BCO | Medium | G