Peft Fine Tuning

Name: Peft Fine Tuning
Author: orchestra-research

orchestra-research/ai-research-skills

399 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

peft-fine-tuning is a Claude Code skill that helps developers configure DoRA, AdaLoRA, and LoRA+ PEFT adapters when fine-tuning causal language models without full-weight training on limited GPU memory.

About

peft-fine-tuning is an AI research skill for advanced parameter-efficient fine-tuning of causal LMs with Hugging Face PEFT. The guide configures LoraConfig with r=16, lora_alpha=32, and target_modules q_proj, v_proj, k_proj, and o_proj, enabling DoRA via use_dora=True for weight-decomposed adaptation that often beats standard LoRA on instruction-following tasks at roughly 10% higher memory from magnitude vectors. It also covers AdaLoRA adaptive rank allocation and LoRA+ learning-rate splits for quality-critical fine-tunes. Developers reach for peft-fine-tuning when GPU memory blocks full fine-tuning and they need documented adapter variant selection instead of default LoRA settings that underperform on instruction data.

DoRA via use_dora=True on LoraConfig for instruction-following quality over standard LoRA
AdaLoRA with init_r, target_r, and pruning schedule for adaptive per-layer rank
LoRA+ patterns with asymmetric learning rates on A and B low-rank matrices
Guidance on when DoRA trades ~10% extra memory for quality-critical runs

Peft Fine Tuning by the numbers

399 all-time installs (skills.sh)
+37 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #496 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill peft-fine-tuning

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/peft-fine-tuning.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/peft-fine-tuning)

Installs	399
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you configure DoRA and AdaLoRA in PEFT?

Configure DoRA, AdaLoRA, and LoRA+ PEFT adapters when fine-tuning causal LMs without full-weight training on limited GPU memory.

Who is it for?

ML engineers fine-tuning causal LMs with Hugging Face PEFT who need DoRA, AdaLoRA, or LoRA+ configs beyond basic LoRA defaults.

Skip if: Full-parameter fine-tuning on large multi-GPU clusters or vision-only adapter training without causal LM PEFT adapters.

When should I use this skill?

User asks about DoRA, AdaLoRA, LoRA+, or advanced PEFT adapter configuration for causal LM fine-tuning.

What you get

PEFT adapter weights, LoraConfig with chosen variant, and get_peft_model-wrapped causal LM checkpoint.

PEFT adapter checkpoint
LoraConfig specification
Fine-tuned causal LM with adapters

By the numbers

Default LoRA rank r=16 with lora_alpha=32
DoRA adds roughly 10% memory overhead versus standard LoRA

Files

SKILL.mdMarkdownGitHub ↗

PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

When to use PEFT

Use PEFT/LoRA when:

Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
Need to train <1% parameters (6MB adapters vs 14GB full model)
Want fast iteration with multiple task-specific adapters
Deploying multiple fine-tuned variants from one base model

Use QLoRA (PEFT + quantization) when:

Fine-tuning 70B models on single 24GB GPU
Memory is the primary constraint
Can accept ~5% quality trade-off vs full fine-tuning

Use full fine-tuning instead when:

Training small models (<1B parameters)
Need maximum quality and have compute budget
Significant domain shift requires updating all weights

Quick start

Installation

# Basic installation
pip install peft

# With quantization support (recommended)
pip install peft bitsandbytes

# Full stack
pip install peft transformers accelerate bitsandbytes datasets

LoRA fine-tuning (standard)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

# Load base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank (8-64, higher = more capacity)
    lora_alpha=32,                 # Scaling factor (typically 2*r)
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    bias="none"                    # Don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

# Prepare dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example):
    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
                                 "labels": torch.stack([f["input_ids"] for f in data])}
)

trainer.train()

# Save adapter only (6MB vs 16GB)
model.save_pretrained("./lora-llama-adapter")

QLoRA fine-tuning (memory-efficient)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
    bnb_4bit_use_double_quant=True       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA config for QLoRA
lora_config = LoraConfig(
    r=64,                              # Higher rank for 70B
    lora_alpha=128,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# 70B model now fits on single 24GB GPU!

LoRA parameter selection

Rank (r) - capacity vs efficiency

Rank	Trainable Params	Memory	Quality	Use Case
4	~3M	Minimal	Lower	Simple tasks, prototyping
8	~7M	Low	Good	Recommended starting point
16	~14M	Medium	Better	General fine-tuning
32	~27M	Higher	High	Complex tasks
64	~54M	High	Highest	Domain adaptation, 70B models

Alpha (lora_alpha) - scaling factor

# Rule of thumb: alpha = 2 * rank
LoraConfig(r=16, lora_alpha=32)  # Standard
LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)

Target modules by architecture

# Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Auto-detect all linear layers
target_modules = "all-linear"  # PEFT 0.6.0+

Loading and merging adapters

Load trained adapter

from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# Option 1: Load with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

# Option 2: Load directly (recommended)
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-llama-adapter",
    device_map="auto"
)

Merge adapter into base model

# Merge for deployment (no adapter overhead)
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")

# Push to Hub
merged_model.push_to_hub("username/llama-finetuned")

Multi-adapter serving

from peft import PeftModel

# Load base with first adapter
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

# Load additional adapters
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")

# Switch between adapters at runtime
model.set_adapter("task1")  # Use task1 adapter
output1 = model.generate(**inputs)

model.set_adapter("task2")  # Switch to task2
output2 = model.generate(**inputs)

# Disable adapters (use base model)
with model.disable_adapter():
    base_output = model.generate(**inputs)

PEFT methods comparison

Method	Trainable %	Memory	Speed	Best For
LoRA	0.1-1%	Low	Fast	General fine-tuning
QLoRA	0.1-1%	Very Low	Medium	Memory-constrained
AdaLoRA	0.1-1%	Low	Medium	Automatic rank selection
IA3	0.01%	Minimal	Fastest	Few-shot adaptation
Prefix Tuning	0.1%	Low	Medium	Generation control
Prompt Tuning	0.001%	Minimal	Fast	Simple task adaptation
P-Tuning v2	0.1%	Low	Medium	NLU tasks

IA3 (minimal parameters)

from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
# Trains only 0.01% of parameters!

Prefix Tuning

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)

Integration patterns

With TRL (SFTTrainer)

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()

With Axolotl (YAML config)

# axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_target_linear: true  # Target all linear layers

With vLLM (inference)

from vllm import LLM
from vllm.lora.request import LoRARequest

# Load base model with LoRA support
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

# Serve with adapter
outputs = llm.generate(
    prompts,
    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)

Performance benchmarks

Memory usage (Llama 3.1 8B)

Method	GPU Memory	Trainable Params
Full fine-tuning	60+ GB	8B (100%)
LoRA r=16	18 GB	14M (0.17%)
QLoRA r=16	6 GB	14M (0.17%)
IA3	16 GB	800K (0.01%)

Training speed (A100 80GB)

Method	Tokens/sec	vs Full FT
Full FT	2,500	1x
LoRA	3,200	1.3x
QLoRA	2,100	0.84x

Quality (MMLU benchmark)

Model	Full FT	LoRA	QLoRA
Llama 2-7B	45.3	44.8	44.1
Llama 2-13B	54.8	54.2	53.5

Common issues

CUDA OOM during training

# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size + increase accumulation
TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16
)

# Solution 3: Use QLoRA
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

Adapter not applying

# Verify adapter is active
print(model.active_adapters)  # Should show adapter name

# Check trainable parameters
model.print_trainable_parameters()

# Ensure model in training mode
model.train()

Quality degradation

# Increase rank
LoraConfig(r=32, lora_alpha=64)

# Target more modules
target_modules = "all-linear"

# Use more training data and epochs
TrainingArguments(num_train_epochs=5)

# Lower learning rate
TrainingArguments(learning_rate=1e-4)

Best practices

1. Start with r=8-16, increase if quality insufficient 2. *Use alpha = 2 rank as starting point 3. Target attention + MLP layers for best quality/efficiency 4. Enable gradient checkpointing for memory savings 5. Save adapters frequently (small files, easy rollback) 6. Evaluate on held-out data before merging 7. Use QLoRA for 70B+ models** on consumer hardware

References

[Advanced Usage](references/advanced-usage.md) - DoRA, LoftQ, rank stabilization, custom modules
[Troubleshooting](references/troubleshooting.md) - Common errors, debugging, optimization

Resources

GitHub: https://github.com/huggingface/peft
Docs: https://huggingface.co/docs/peft
LoRA Paper: arXiv:2106.09685
QLoRA Paper: arXiv:2305.14314
Models: https://huggingface.co/models?library=peft

PEFT Advanced Usage Guide

Advanced LoRA Variants

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA decomposes weights into magnitude and direction components, often achieving better results than standard LoRA:

from peft import LoraConfig

dora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    use_dora=True,  # Enable DoRA
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, dora_config)

When to use DoRA:

Consistently outperforms LoRA on instruction-following tasks
Slightly higher memory (~10%) due to magnitude vectors
Best for quality-critical fine-tuning

AdaLoRA (Adaptive Rank)

Automatically adjusts rank per layer based on importance:

from peft import AdaLoraConfig

adalora_config = AdaLoraConfig(
    init_r=64,              # Initial rank
    target_r=16,            # Target average rank
    tinit=200,              # Warmup steps
    tfinal=1000,            # Final pruning step
    deltaT=10,              # Rank update frequency
    beta1=0.85,
    beta2=0.85,
    orth_reg_weight=0.5,    # Orthogonality regularization
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM"
)

Benefits:

Allocates more rank to important layers
Can reduce total parameters while maintaining quality
Good for exploring optimal rank distribution

LoRA+ (Asymmetric Learning Rates)

Different learning rates for A and B matrices:

from peft import LoraConfig

# LoRA+ uses higher LR for B matrix
lora_plus_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    use_rslora=True,  # Rank-stabilized LoRA (related technique)
)

# Manual implementation of LoRA+
from torch.optim import AdamW

# Group parameters
lora_A_params = [p for n, p in model.named_parameters() if "lora_A" in n]
lora_B_params = [p for n, p in model.named_parameters() if "lora_B" in n]

optimizer = AdamW([
    {"params": lora_A_params, "lr": 1e-4},
    {"params": lora_B_params, "lr": 1e-3},  # 10x higher for B
])

rsLoRA (Rank-Stabilized LoRA)

Scales LoRA outputs to stabilize training with different ranks:

lora_config = LoraConfig(
    r=64,
    lora_alpha=64,
    use_rslora=True,  # Enables rank-stabilized scaling
    target_modules="all-linear"
)

When to use:

When experimenting with different ranks
Helps maintain consistent behavior across rank values
Recommended for r > 32

LoftQ (LoRA-Fine-Tuning-aware Quantization)

Initializes LoRA weights to compensate for quantization error:

from peft import LoftQConfig, LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# LoftQ configuration
loftq_config = LoftQConfig(
    loftq_bits=4,              # Quantization bits
    loftq_iter=5,              # Alternating optimization iterations
)

# LoRA config with LoftQ initialization
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    init_lora_weights="loftq",
    loftq_config=loftq_config,
    task_type="CAUSAL_LM"
)

# Load quantized model
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config
)

model = get_peft_model(model, lora_config)

Benefits over standard QLoRA:

Better initial quality after quantization
Faster convergence
~1-2% better final accuracy on benchmarks

Custom Module Targeting

Target specific layers

# Target only first and last transformer layers
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["model.layers.0.self_attn.q_proj",
                    "model.layers.0.self_attn.v_proj",
                    "model.layers.31.self_attn.q_proj",
                    "model.layers.31.self_attn.v_proj"],
    layers_to_transform=[0, 31]  # Alternative approach
)

Layer pattern matching

# Target layers 0-10 only
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    layers_to_transform=list(range(11)),  # Layers 0-10
    layers_pattern="model.layers"
)

Exclude specific layers

lora_config = LoraConfig(
    r=16,
    target_modules="all-linear",
    modules_to_save=["lm_head"],  # Train these fully (not LoRA)
)

Embedding and LM Head Training

Train embeddings with LoRA

from peft import LoraConfig

# Include embeddings
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "embed_tokens"],  # Include embeddings
    modules_to_save=["lm_head"],  # Train lm_head fully
)

Extending vocabulary with LoRA

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig

# Add new tokens
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
new_tokens = ["<custom_token_1>", "<custom_token_2>"]
tokenizer.add_tokens(new_tokens)

# Resize model embeddings
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model.resize_token_embeddings(len(tokenizer))

# Configure LoRA to train new embeddings
lora_config = LoraConfig(
    r=16,
    target_modules="all-linear",
    modules_to_save=["embed_tokens", "lm_head"],  # Train these fully
)

model = get_peft_model(model, lora_config)

Multi-Adapter Patterns

Adapter composition

from peft import PeftModel

# Load model with multiple adapters
model = AutoPeftModelForCausalLM.from_pretrained("./base-adapter")
model.load_adapter("./style-adapter", adapter_name="style")
model.load_adapter("./task-adapter", adapter_name="task")

# Combine adapters (weighted sum)
model.add_weighted_adapter(
    adapters=["style", "task"],
    weights=[0.7, 0.3],
    adapter_name="combined",
    combination_type="linear"  # or "cat", "svd"
)

model.set_adapter("combined")

Adapter stacking

# Stack adapters (apply sequentially)
model.add_weighted_adapter(
    adapters=["base", "domain", "task"],
    weights=[1.0, 1.0, 1.0],
    adapter_name="stacked",
    combination_type="cat"  # Concatenate adapter outputs
)

Dynamic adapter switching

import torch

class MultiAdapterModel:
    def __init__(self, base_model_path, adapter_paths):
        self.model = AutoPeftModelForCausalLM.from_pretrained(adapter_paths[0])
        for name, path in adapter_paths[1:].items():
            self.model.load_adapter(path, adapter_name=name)

    def generate(self, prompt, adapter_name="default"):
        self.model.set_adapter(adapter_name)
        return self.model.generate(**self.tokenize(prompt))

    def generate_ensemble(self, prompt, adapters, weights):
        """Generate with weighted adapter ensemble"""
        outputs = []
        for adapter, weight in zip(adapters, weights):
            self.model.set_adapter(adapter)
            logits = self.model(**self.tokenize(prompt)).logits
            outputs.append(weight * logits)
        return torch.stack(outputs).sum(dim=0)

Memory Optimization

Gradient checkpointing with LoRA

from peft import prepare_model_for_kbit_training

# Enable gradient checkpointing
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)

CPU offloading for training

from accelerate import Accelerator

accelerator = Accelerator(
    mixed_precision="bf16",
    gradient_accumulation_steps=8,
    cpu_offload=True  # Offload optimizer states to CPU
)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Memory-efficient attention with LoRA

from transformers import AutoModelForCausalLM

# Combine Flash Attention 2 with LoRA
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

# Apply LoRA
model = get_peft_model(model, lora_config)

Inference Optimization

Merge for deployment

# Merge adapter weights into base model
merged_model = model.merge_and_unload()

# Quantize merged model for inference
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
quantized_model = AutoModelForCausalLM.from_pretrained(
    "./merged-model",
    quantization_config=bnb_config
)

Export to different formats

# Export to GGUF (llama.cpp)
# First merge, then convert
merged_model.save_pretrained("./merged-model")

# Use llama.cpp converter
# python convert-hf-to-gguf.py ./merged-model --outfile model.gguf

# Export to ONNX
from optimum.onnxruntime import ORTModelForCausalLM

ort_model = ORTModelForCausalLM.from_pretrained(
    "./merged-model",
    export=True
)
ort_model.save_pretrained("./onnx-model")

Batch adapter inference

from vllm import LLM
from vllm.lora.request import LoRARequest

# Initialize with LoRA support
llm = LLM(
    model="meta-llama/Llama-3.1-8B",
    enable_lora=True,
    max_lora_rank=64,
    max_loras=4  # Max concurrent adapters
)

# Batch with different adapters
requests = [
    ("prompt1", LoRARequest("adapter1", 1, "./adapter1")),
    ("prompt2", LoRARequest("adapter2", 2, "./adapter2")),
    ("prompt3", LoRARequest("adapter1", 1, "./adapter1")),
]

outputs = llm.generate(
    [r[0] for r in requests],
    lora_request=[r[1] for r in requests]
)

Training Recipes

Instruction tuning recipe

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",
    bias="none",
    task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
)

Code generation recipe

lora_config = LoraConfig(
    r=32,              # Higher rank for code
    lora_alpha=64,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
    learning_rate=1e-4,        # Lower LR for code
    num_train_epochs=2,
    max_seq_length=2048,       # Longer sequences
)

Conversational/Chat recipe

from trl import SFTTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,  # alpha = r for chat
    lora_dropout=0.05,
    target_modules="all-linear"
)

# Use chat template
def format_chat(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]}
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False)

trainer = SFTTrainer(
    model=model,
    peft_config=lora_config,
    train_dataset=dataset.map(format_chat),
    max_seq_length=1024,
)

Debugging and Validation

Verify adapter application

# Check which modules have LoRA
for name, module in model.named_modules():
    if hasattr(module, "lora_A"):
        print(f"LoRA applied to: {name}")

# Print detailed config
print(model.peft_config)

# Check adapter state
print(f"Active adapters: {model.active_adapters}")
print(f"Trainable: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

Compare with base model

# Generate with adapter
model.set_adapter("default")
adapter_output = model.generate(**inputs)

# Generate without adapter
with model.disable_adapter():
    base_output = model.generate(**inputs)

print(f"Adapter: {tokenizer.decode(adapter_output[0])}")
print(f"Base: {tokenizer.decode(base_output[0])}")

Monitor training metrics

from transformers import TrainerCallback

class LoRACallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if "loss" in logs:
            # Log adapter-specific metrics
            model = kwargs["model"]
            lora_params = sum(p.numel() for n, p in model.named_parameters()
                            if "lora" in n and p.requires_grad)
            print(f"Step {state.global_step}: loss={logs['loss']:.4f}, lora_params={lora_params}")

PEFT Troubleshooting Guide

Installation Issues

bitsandbytes CUDA Error

Error: CUDA Setup failed despite GPU being available

Fix:

# Check CUDA version
nvcc --version

# Install matching bitsandbytes
pip uninstall bitsandbytes
pip install bitsandbytes --no-cache-dir

# Or compile from source for specific CUDA
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x  # Adjust for your CUDA
pip install .

Triton Import Error

Error: ModuleNotFoundError: No module named 'triton'

Fix:

# Install triton (Linux only)
pip install triton

# Windows: Triton not supported, use CUDA backend
# Set environment variable to disable triton
export CUDA_VISIBLE_DEVICES=0

PEFT Version Conflicts

Error: AttributeError: 'LoraConfig' object has no attribute 'use_dora'

Fix:

# Upgrade to latest PEFT
pip install peft>=0.13.0 --upgrade

# Check version
python -c "import peft; print(peft.__version__)"

Training Issues

CUDA Out of Memory

Error: torch.cuda.OutOfMemoryError: CUDA out of memory

Solutions:

1. Enable gradient checkpointing:

from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

2. Reduce batch size:

TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16  # Maintain effective batch size
)

3. Use QLoRA:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

4. Lower LoRA rank:

LoraConfig(r=8)  # Instead of r=16 or higher

5. Target fewer modules:

target_modules=["q_proj", "v_proj"]  # Instead of all-linear

Loss Not Decreasing

Problem: Training loss stays flat or increases.

Solutions:

1. Check learning rate:

# Start lower
TrainingArguments(learning_rate=1e-4)  # Not 2e-4 or higher

2. Verify adapter is active:

model.print_trainable_parameters()
# Should show >0 trainable params

# Check adapter applied
print(model.peft_config)

3. Check data formatting:

# Verify tokenization
sample = dataset[0]
decoded = tokenizer.decode(sample["input_ids"])
print(decoded)  # Should look correct

4. Increase rank:

LoraConfig(r=32, lora_alpha=64)  # More capacity

NaN Loss

Error: Loss is NaN

Fix:

# Use bf16 instead of fp16
TrainingArguments(bf16=True, fp16=False)

# Or enable loss scaling
TrainingArguments(fp16=True, fp16_full_eval=True)

# Lower learning rate
TrainingArguments(learning_rate=5e-5)

# Check for data issues
for batch in dataloader:
    if torch.isnan(batch["input_ids"].float()).any():
        print("NaN in input!")

Adapter Not Training

Problem: trainable params: 0 or model not updating.

Fix:

# Verify LoRA applied to correct modules
for name, module in model.named_modules():
    if "lora" in name.lower():
        print(f"Found LoRA: {name}")

# Check target_modules match model architecture
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
print(TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.get(model.config.model_type))

# Ensure model in training mode
model.train()

# Check requires_grad
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name}")

Loading Issues

Adapter Loading Fails

Error: ValueError: Can't find adapter weights

Fix:

# Check adapter files exist
import os
print(os.listdir("./adapter-path"))
# Should contain: adapter_config.json, adapter_model.safetensors

# Load with correct structure
from peft import PeftModel, PeftConfig

# Check config
config = PeftConfig.from_pretrained("./adapter-path")
print(config)

# Load base model first
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, "./adapter-path")

Base Model Mismatch

Error: RuntimeError: size mismatch

Fix:

# Ensure base model matches adapter
from peft import PeftConfig

config = PeftConfig.from_pretrained("./adapter-path")
print(f"Base model: {config.base_model_name_or_path}")

# Load exact same base model
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)

Safetensors vs PyTorch Format

Error: ValueError: We couldn't connect to 'https://huggingface.co'

Fix:

# Force local loading
model = PeftModel.from_pretrained(
    base_model,
    "./adapter-path",
    local_files_only=True
)

# Or specify format
model.save_pretrained("./adapter", safe_serialization=True)  # safetensors
model.save_pretrained("./adapter", safe_serialization=False)  # pytorch

Inference Issues

Slow Generation

Problem: Inference much slower than expected.

Solutions:

1. Merge adapter for deployment:

merged_model = model.merge_and_unload()
# No adapter overhead during inference

2. Use optimized inference engine:

from vllm import LLM
llm = LLM(model="./merged-model", dtype="half")

3. Enable Flash Attention:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2"
)

Output Quality Issues

Problem: Fine-tuned model produces worse outputs.

Solutions:

1. Check evaluation without adapter:

with model.disable_adapter():
    base_output = model.generate(**inputs)
# Compare with adapter output

2. Lower temperature during eval:

model.generate(**inputs, temperature=0.1, do_sample=False)

3. Retrain with more data:

# Increase training samples
# Use higher quality data
# Train for more epochs

Wrong Adapter Active

Problem: Model using wrong adapter or no adapter.

Fix:

# Check active adapters
print(model.active_adapters)

# Explicitly set adapter
model.set_adapter("your-adapter-name")

# List all adapters
print(model.peft_config.keys())

QLoRA Specific Issues

Quantization Errors

Error: RuntimeError: mat1 and mat2 shapes cannot be multiplied

Fix:

# Ensure compute dtype matches
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Match model dtype
    bnb_4bit_quant_type="nf4"
)

# Load with correct dtype
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16
)

QLoRA OOM

Error: OOM even with 4-bit quantization.

Fix:

# Enable double quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True  # Further memory reduction
)

# Use offloading
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    max_memory={0: "20GB", "cpu": "100GB"}
)

QLoRA Merge Fails

Error: RuntimeError: expected scalar type BFloat16 but found Float

Fix:

# Dequantize before merging
from peft import PeftModel

# Load in higher precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,  # Not quantized
    device_map="auto"
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")

# Now merge
merged = model.merge_and_unload()

Multi-Adapter Issues

Adapter Conflict

Error: ValueError: Adapter with name 'default' already exists

Fix:

# Use unique names
model.load_adapter("./adapter1", adapter_name="task1")
model.load_adapter("./adapter2", adapter_name="task2")

# Or delete existing
model.delete_adapter("default")

Mixed Precision Adapters

Error: Adapters trained with different dtypes.

Fix:

# Convert adapter precision
model = PeftModel.from_pretrained(base_model, "./adapter")
model = model.to(torch.bfloat16)

# Or load with specific dtype
model = PeftModel.from_pretrained(
    base_model,
    "./adapter",
    torch_dtype=torch.bfloat16
)

Performance Optimization

Memory Profiling

import torch

def print_memory():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")

# Profile during training
print_memory()  # Before
model.train()
loss = model(**batch).loss
loss.backward()
print_memory()  # After

Speed Profiling

import time
import torch

def benchmark_generation(model, tokenizer, prompt, n_runs=5):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warmup
    model.generate(**inputs, max_new_tokens=10)
    torch.cuda.synchronize()

    # Benchmark
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        outputs = model.generate(**inputs, max_new_tokens=100)
        torch.cuda.synchronize()
        times.append(time.perf_counter() - start)

    tokens = outputs.shape[1] - inputs.input_ids.shape[1]
    avg_time = sum(times) / len(times)
    print(f"Speed: {tokens/avg_time:.2f} tokens/sec")

# Compare adapter vs merged
benchmark_generation(adapter_model, tokenizer, "Hello")
benchmark_generation(merged_model, tokenizer, "Hello")

Getting Help

1. Check PEFT GitHub Issues: https://github.com/huggingface/peft/issues 2. HuggingFace Forums: https://discuss.huggingface.co/ 3. PEFT Documentation: https://huggingface.co/docs/peft

Debugging Template

When reporting issues, include:

# System info
import peft
import transformers
import torch

print(f"PEFT: {peft.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")

# Config
print(model.peft_config)
model.print_trainable_parameters()

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use peft-fine-tuning for Hugging Face adapter variant selection; use gptq when the goal is post-training weight quantization rather than adapter training.

FAQ

When should developers choose DoRA over LoRA in PEFT?

peft-fine-tuning recommends DoRA when instruction-following quality matters because weight-decomposed adaptation often outperforms standard LoRA, accepting roughly 10% higher memory from magnitude vectors via use_dora=True.

Which modules does peft-fine-tuning target by default?

peft-fine-tuning sets LoraConfig target_modules to q_proj, v_proj, k_proj, and o_proj with r=16 and lora_alpha=32, then applies adapters through get_peft_model on the causal LM.

Is Peft Fine Tuning safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmagents