Transformers

Name: Transformers
Author: k-dense-ai

k-dense-ai/scientific-agent-skills

908 installs
32k repo stars
Updated July 29, 2026
k-dense-ai/scientific-agent-skills

Transformers is a scientific agent skill that teaches developers to generate high-quality controllable text from Hugging Face language models using model.generate() inside custom agents and automation workflows.

About

Transformers is a text-generation skill from k-dense-ai/scientific-agent-skills built around the Hugging Face Transformers library. It documents generating text with model.generate(), controlling output through generation strategies and parameters, and choosing between the Pipeline API for quick prototyping versus direct AutoModelForCausalLM and AutoTokenizer usage for custom preprocessing and decoding control. Examples use gpt2 with AutoModelForCausalLM.from_pretrained, tokenizer input handling, and max_new_tokens generation. Developers reach for Transformers when building scientific agents or Python automation that needs fine-grained control over LM decoding rather than opaque API calls. The skill bridges Pipeline convenience and low-level generate() customization.

Full control via model.generate() with custom tokenization and decoding
Three core generation strategies: Greedy Decoding, Sampling, and Beam Search
Supports temperature, top_k, top_p, and max_new_tokens for precise output tuning
Pipeline API for rapid prototyping versus direct model.generate() for advanced preprocessing
Deterministic vs creative output modes with documented use cases

Transformers by the numbers

908 all-time installs (skills.sh)
+41 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #1,157 of 16,570 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 29, 2026 (Skillselion catalog sync)

npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill transformers

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/k-dense-ai/scientific-agent-skills/transformers.svg)](https://skillselion.com/skills/k-dense-ai/scientific-agent-skills/transformers)

Installs	908
repo stars	★ 32k
Security audit	2 / 3 scanners passed
Last updated	July 29, 2026
Repository	k-dense-ai/scientific-agent-skills ↗

How do you generate text with Hugging Face Transformers?

Generate high-quality, controllable text from language models inside custom agents and automation workflows.

Who is it for?

Python developers building scientific agents or ML automation who need direct control over Hugging Face model.generate() decoding parameters.

Skip if: Developers who only need hosted LLM API calls without local model loading, tokenization, or custom decoding logic.

When should I use this skill?

A developer asks to generate text with Transformers, configure model.generate() parameters, or choose Pipeline API versus direct generation control.

What you get

Python generation scripts using model.generate(), tokenized inputs, and configured decoding parameters.

Text generation script
Configured decoding parameters

Files

SKILL.mdMarkdownGitHub ↗

Transformers

Overview

The Hugging Face Transformers library provides access to thousands of pre-trained models for tasks across NLP, computer vision, audio, and multimodal domains. Use this skill to load models, perform inference, and fine-tune on custom data.

Installation

Tested against transformers 5.12.0 (current PyPI release; June 2026). Requires Python 3.10+; the torch extra currently requires PyTorch 2.4+.

uv pip install "transformers[torch]==5.12.0" huggingface_hub==1.19.0 datasets==5.0.0 evaluate==0.4.6 accelerate==1.14.0

For vision tasks, add:

uv pip install timm==1.0.27 pillow==12.2.0

For audio tasks, add:

uv pip install librosa==0.11.0 soundfile==0.14.0

These pins are for reproducible examples. For exploratory work, loosen them only after checking the Transformers and Hub release notes for API changes.

Check your version:

import transformers
print(transformers.__version__)

Authentication

Many models on the Hugging Face Hub are gated or private. Authenticate before loading them.

Recommended: CLI login (stores token in ~/.cache/huggingface/token):

hf auth login

Python:

from huggingface_hub import login
login()  # Interactive prompt; do not hardcode tokens in scripts

Servers / CI: set HF_TOKEN in the environment (never commit tokens to git or shell profiles):

export HF_TOKEN="..."  # Read token from a secret manager, not source code

Get tokens at: https://huggingface.co/settings/tokens

Security: Never paste tokens into notebooks, repos, or shared configs. Prefer hf auth login over exporting tokens in .bashrc or .zshrc.

Use the narrowest token scope that works: read for private or gated model downloads, write only for uploads. If a long-running environment should not send the stored token on every Hub request, set HF_HUB_DISABLE_IMPLICIT_TOKEN=1 and pass a token only where authentication is required.

Transformers v5

Transformers v5 is PyTorch-only (TensorFlow and JAX backends were removed). For upgrades from v4, see the v5 migration guide. New projects should pair transformers 5.x with huggingface_hub 1.x.

Gated or custom architectures: accept the model license on the Hub, then load with trust_remote_code=True only when the model card requires custom code you have reviewed.

Cache location: set HF_HOME for all Hugging Face caches, or HF_HUB_CACHE just for Hub files. Use HF_HUB_OFFLINE=1 only after required model snapshots are already cached.

Quick Start

Use the Pipeline API for fast inference without manual configuration:

from transformers import pipeline

# Text generation (prefer max_new_tokens for causal LMs)
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B")
result = generator("The future of AI is", max_new_tokens=50)

# Text classification
classifier = pipeline("text-classification")
result = classifier("This movie was excellent!")

# Question answering
qa = pipeline("question-answering")
result = qa(question="What is AI?", context="AI is artificial intelligence...")

Core Capabilities

1. Pipelines for Quick Inference

Use for simple, optimized inference across many tasks. Supports text generation, classification, NER, question answering, summarization, translation, image classification, object detection, audio classification, and more.

When to use: Quick prototyping, simple inference tasks, no custom preprocessing needed.

See references/pipelines.md for comprehensive task coverage and optimization.

2. Model Loading and Management

Load pre-trained models with fine-grained control over configuration, device placement, and precision.

When to use: Custom model initialization, advanced device management, model inspection.

See references/models.md for loading patterns and best practices.

3. Text Generation

Generate text with LLMs using various decoding strategies (greedy, beam search, sampling) and control parameters (temperature, top-k, top-p).

When to use: Creative text generation, code generation, conversational AI, text completion.

See references/generation.md for generation strategies and parameters.

4. Training and Fine-Tuning

Fine-tune pre-trained models on custom datasets using the Trainer API with automatic mixed precision, distributed training, and logging.

When to use: Task-specific model adaptation, domain adaptation, improving model performance.

See references/training.md for training workflows and best practices.

5. Tokenization

Convert text to tokens and token IDs for model input, with padding, truncation, and special token handling.

When to use: Custom preprocessing pipelines, understanding model inputs, batch processing.

See references/tokenizers.md for tokenization details.

Common Patterns

Pattern 1: Simple Inference

For straightforward tasks, use pipelines:

pipe = pipeline("task-name", model="model-id")
output = pipe(input_data)

Pattern 2: Custom Model Usage

For advanced control, load model and tokenizer separately:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForCausalLM.from_pretrained("model-id", device_map="auto")

inputs = tokenizer("text", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0])

Pattern 3: Fine-Tuning

For task adaptation, use Trainer:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

Reference Documentation

For detailed information on specific components:

Pipelines: references/pipelines.md - All supported tasks and optimization
Models: references/models.md - Loading, saving, and configuration
Generation: references/generation.md - Text generation strategies and parameters
Training: references/training.md - Fine-tuning with Trainer API
Tokenizers: references/tokenizers.md - Tokenization and preprocessing

Text Generation

Overview

Generate text with language models using the generate() method. Control output quality and style through generation strategies and parameters.

For quick prototyping, the Pipeline API wraps tokenization and generate(); use model.generate() directly when you need custom preprocessing or decoding control.

Basic Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")

# Generate
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Generation Strategies

Greedy Decoding

Select highest probability token at each step (deterministic):

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False  # Greedy decoding (default)
)

Use for: Factual text, translations, where determinism is needed.

Sampling

Randomly sample from probability distribution:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)

Use for: Creative writing, diverse outputs, open-ended generation.

Beam Search

Explore multiple hypotheses in parallel:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    early_stopping=True
)

Use for: Translations, summarization, where quality is critical.

Contrastive Search

Balance quality and diversity:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    penalty_alpha=0.6,
    top_k=4
)

Use for: Long-form generation, reducing repetition.

Key Parameters

Length Control

max_new_tokens: Maximum tokens to generate

max_new_tokens=100  # Generate up to 100 new tokens

max_length: Maximum total length (input + output)

max_length=512  # Total sequence length

min_new_tokens: Minimum tokens to generate

min_new_tokens=50  # Force at least 50 tokens

min_length: Minimum total length

min_length=100

Temperature

Controls randomness (only with sampling):

temperature=1.0   # Default, balanced
temperature=0.7   # More focused, less random
temperature=1.5   # More creative, more random

Lower temperature → more deterministic Higher temperature → more random

Top-K Sampling

Consider only top K most likely tokens:

do_sample=True
top_k=50  # Sample from top 50 tokens

Common values: 40-100 for balanced output, 10-20 for focused output.

Top-P (Nucleus) Sampling

Consider tokens with cumulative probability ≥ P:

do_sample=True
top_p=0.95  # Sample from smallest set with 95% cumulative probability

Common values: 0.9-0.95 for balanced, 0.7-0.85 for focused.

Repetition Penalty

Discourage repetition:

repetition_penalty=1.2  # Penalize repeated tokens

Values: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.

Beam Search Parameters

num_beams: Number of beams

num_beams=5  # Keep 5 hypotheses

early_stopping: Stop when num_beams sentences are finished

early_stopping=True

no_repeat_ngram_size: Prevent n-gram repetition

no_repeat_ngram_size=3  # Don't repeat any 3-gram

Output Control

num_return_sequences: Generate multiple outputs

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    num_return_sequences=3  # Return 3 different sequences
)

pad_token_id: Specify padding token

pad_token_id=tokenizer.eos_token_id

eos_token_id: Stop generation at specific token

eos_token_id=tokenizer.eos_token_id

Advanced Features

Batch Generation

Generate for multiple prompts:

prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)

outputs = model.generate(**inputs, max_new_tokens=50)

for i, output in enumerate(outputs):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Prompt {i}: {text}\n")

Streaming Generation

Stream tokens as generated:

from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

generation_kwargs = dict(
    inputs,
    streamer=streamer,
    max_new_tokens=100
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for text in streamer:
    print(text, end="", flush=True)

thread.join()

Constrained Generation

Force specific token sequences:

# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]

outputs = model.generate(
    **inputs,
    force_words_ids=force_words_ids,
    num_beams=5
)

Guidance and Control

Prevent bad words:

bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]

outputs = model.generate(
    **inputs,
    bad_words_ids=bad_words_ids
)

Generation Config

Save and reuse generation parameters:

from transformers import GenerationConfig

# Create config
generation_config = GenerationConfig(
    max_new_tokens=100,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

# Save
generation_config.save_pretrained("./my_generation_config")

# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)

Model-Specific Generation

Chat Models

Use chat templates:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)

Encoder-Decoder Models

For T5, BART, etc.:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

Optimization

Caching

Enable KV cache for faster generation:

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    use_cache=True  # Default, faster generation
)

Static Cache

For fixed sequence lengths:

from transformers import StaticCache

cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    past_key_values=cache
)

Attention Implementation

Use Flash Attention for speed:

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    attn_implementation="flash_attention_2"
)

Generation Recipes

Creative Writing

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2
)

Factual Generation

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=False,  # Greedy
    repetition_penalty=1.1
)

Diverse Outputs

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    num_beams=5,
    num_return_sequences=5,
    temperature=1.5,
    do_sample=True
)

Long-Form Generation

outputs = model.generate(
    **inputs,
    max_new_tokens=1000,
    penalty_alpha=0.6,  # Contrastive search
    top_k=4,
    repetition_penalty=1.2
)

Translation/Summarization

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=3
)

Common Issues

Repetitive output:

Increase repetition_penalty (1.2-1.5)
Use no_repeat_ngram_size (2-3)
Try contrastive search
Lower temperature

Poor quality:

Use beam search (num_beams=5)
Lower temperature
Adjust top_k/top_p

Too deterministic:

Enable sampling (do_sample=True)
Increase temperature (0.7-1.0)
Adjust top_k/top_p

Slow generation:

Reduce batch size
Enable use_cache=True
Use Flash Attention
Reduce max_new_tokens

Best Practices

1. Start with defaults: Then tune based on output 2. Use appropriate strategy: Greedy for factual, sampling for creative 3. Set max_new_tokens: Avoid unnecessarily long generation 4. Enable caching: For faster sequential generation 5. Tune temperature: Most impactful parameter for sampling 6. Use beam search carefully: Slower but higher quality 7. Test different seeds: For reproducibility with sampling 8. Monitor memory: Large beams use significant memory

Model Loading and Management

Overview

The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.

Loading Models

AutoModel Classes

Use AutoModel classes for automatic architecture selection:

from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM

# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")

# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Common AutoModel Classes

NLP Tasks:

AutoModelForSequenceClassification: Text classification, sentiment analysis
AutoModelForTokenClassification: NER, POS tagging
AutoModelForQuestionAnswering: Extractive QA
AutoModelForCausalLM: Text generation (GPT, Llama)
AutoModelForMaskedLM: Masked language modeling (BERT)
AutoModelForSeq2SeqLM: Translation, summarization (T5, BART)

Vision Tasks:

AutoModelForImageClassification: Image classification
AutoModelForObjectDetection: Object detection
AutoModelForImageSegmentation: Image segmentation

Audio Tasks:

AutoModelForAudioClassification: Audio classification
AutoModelForSpeechSeq2Seq: Speech recognition

Multimodal:

AutoModelForVision2Seq: Image captioning, VQA

Loading Parameters

Basic Parameters

pretrained_model_name_or_path: Model identifier or local path

model = AutoModel.from_pretrained("bert-base-uncased")  # From Hub
model = AutoModel.from_pretrained("./local/model/path")  # From disk

num_labels: Number of output labels for classification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

cache_dir: Custom cache location

model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")

Device Management

device_map: Automatic device allocation for large models

# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    device_map="sequential"
)

# Custom device map
device_map = {
    "transformer.layers.0": 0,      # GPU 0
    "transformer.layers.1": 1,      # GPU 1
    "transformer.layers.2": "cpu",  # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)

Manual device placement:

import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0")  # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

Precision Control

dtype: Set model precision (preferred in v5; torch_dtype still works but is deprecated)

import torch

# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", dtype=torch.float16)

# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", dtype=torch.bfloat16)

# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", dtype="auto")

Attention Implementation

attn_implementation: Choose attention mechanism

# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")

# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")

Memory Optimization

low_cpu_mem_usage: Reduce CPU memory during loading

model = AutoModelForCausalLM.from_pretrained(
    "large-model-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

BitsAndBytesConfig: 8-bit and 4-bit quantization (requires optional bitsandbytes; uv pip install bitsandbytes==0.49.2)

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    device_map="auto",
    quantization_config=quantization_config
)

4-bit QLoRA-style loading: use BitsAndBytesConfig instead of direct load_in_4bit arguments

import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    quantization_config=quantization_config,
    device_map="auto"
)

Model Configuration

Loading with Custom Config

from transformers import AutoConfig, AutoModel

# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2

# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)

Initializing from Config Only

config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config)  # Random weights

Model Modes

Training vs Evaluation Mode

Models load in evaluation mode by default:

model = AutoModel.from_pretrained("model-id")
print(model.training)  # False

# Switch to training mode
model.train(True)

# Switch back to evaluation mode (equivalent to eval mode on nn.Module)
model.train(False)

Evaluation mode disables dropout and uses batch norm statistics. model.train(False) is equivalent to model.eval() in PyTorch.

Saving Models

Save Locally

model.save_pretrained("./my_model")

This creates:

config.json: Model configuration
pytorch_model.bin or model.safetensors: Model weights

Save to Hugging Face Hub

model.push_to_hub("username/model-name")

# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")

# Private repository
model.push_to_hub("username/model-name", private=True)

Model Inspection

Parameter Count

# Total parameters
total_params = model.num_parameters()

# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)

print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")

Memory Footprint

memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")

Model Architecture

print(model)  # Print full architecture

# Access specific components
print(model.config)
print(model.base_model)

Forward Pass

Basic inference:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")

inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
predictions = logits.argmax(dim=-1)

Model Formats

SafeTensors vs PyTorch

SafeTensors is faster and safer:

# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)

# Load either format automatically
model = AutoModel.from_pretrained("./model")

ONNX Export

Export for optimized inference:

from transformers.onnx import export

# Export to ONNX
export(
    tokenizer=tokenizer,
    model=model,
    config=config,
    output=Path("model.onnx")
)

Best Practices

1. Use AutoModel classes: Automatic architecture detection 2. Specify `dtype` explicitly: Control precision and memory (avoid deprecated torch_dtype in new code) 3. Use device_map="auto": For large models 4. Enable low_cpu_mem_usage: When loading large models 5. Use safetensors format: Faster and safer serialization 6. Check model.training: Ensure correct mode for task 7. Consider quantization: For deployment on resource-constrained devices 8. Cache models locally: Set HF_HOME (Hub cache at $HF_HOME/hub)

Common Issues

CUDA out of memory:

import torch
from transformers import BitsAndBytesConfig

# Use smaller precision
model = AutoModel.from_pretrained("model-id", dtype=torch.float16)

# Or use quantization
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained("model-id", quantization_config=quantization_config)

# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")

Slow loading:

# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)

Model not found:

# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()

Pipeline API Reference

Overview

Pipelines provide the simplest way to use pre-trained models for inference. They abstract away tokenization, model loading, and post-processing, offering a unified interface for dozens of tasks.

Basic Usage

Create a pipeline by specifying a task:

from transformers import pipeline

# Auto-select default model for task
pipe = pipeline("text-classification")
result = pipe("This is great!")

Or specify a model:

pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

Supported Tasks

Natural Language Processing

text-generation: Generate text continuations

generator = pipeline("text-generation", model="gpt2")
output = generator("Once upon a time", max_new_tokens=50, num_return_sequences=2)

text-classification: Classify text into categories

classifier = pipeline("text-classification")
result = classifier("I love this product!")  # Returns label and score

token-classification: Label individual tokens (NER, POS tagging)

ner = pipeline("token-classification", model="dslim/bert-base-NER")
entities = ner("Hugging Face is based in New York City")

question-answering: Extract answers from context

qa = pipeline("question-answering")
result = qa(question="What is the capital?", context="Paris is the capital of France.")

fill-mask: Predict masked tokens

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("Paris is the [MASK] of France")

summarization: Summarize long texts

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer("Long article text...", max_length=130, min_length=30)

translation: Translate between languages

translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")

zero-shot-classification: Classify without training data

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
    "This is a course about Python programming",
    candidate_labels=["education", "politics", "business"]
)

sentiment-analysis: Alias for text-classification focused on sentiment

sentiment = pipeline("sentiment-analysis")
result = sentiment("This product exceeded my expectations!")

Computer Vision

image-classification: Classify images

classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("path/to/image.jpg")
# Or use PIL Image or URL
from PIL import Image
result = classifier(Image.open("image.jpg"))

object-detection: Detect objects in images

detector = pipeline("object-detection", model="facebook/detr-resnet-50")
results = detector("image.jpg")  # Returns bounding boxes and labels

image-segmentation: Segment images

segmenter = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic")
segments = segmenter("image.jpg")

depth-estimation: Estimate depth from images

depth = pipeline("depth-estimation", model="Intel/dpt-large")
result = depth("image.jpg")

zero-shot-image-classification: Classify images without training

classifier = pipeline("zero-shot-image-classification", model="openai/clip-vit-base-patch32")
result = classifier("image.jpg", candidate_labels=["cat", "dog", "bird"])

Audio

automatic-speech-recognition: Transcribe speech

asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
text = asr("audio.mp3")

audio-classification: Classify audio

classifier = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593")
result = classifier("audio.wav")

text-to-speech: Generate speech from text (with specific models)

tts = pipeline("text-to-speech", model="microsoft/speecht5_tts")
audio = tts("Hello, this is a test")

Multimodal

visual-question-answering: Answer questions about images

vqa = pipeline("visual-question-answering", model="dandelin/vilt-b32-finetuned-vqa")
result = vqa(image="image.jpg", question="What color is the car?")

document-question-answering: Answer questions about documents

doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
result = doc_qa(image="document.png", question="What is the invoice number?")

image-to-text: Generate captions for images

captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
caption = captioner("image.jpg")

Pipeline Parameters

Common Parameters

model: Model identifier or path

pipe = pipeline("task", model="model-id")

device: GPU device index (-1 for CPU, 0+ for GPU)

pipe = pipeline("task", device=0)  # Use first GPU

device_map: Automatic device allocation for large models

pipe = pipeline("task", model="large-model", device_map="auto")

dtype: Model precision (reduces memory; torch_dtype is deprecated but still accepted)

import torch
pipe = pipeline("task", dtype=torch.float16)

batch_size: Process multiple inputs at once

pipe = pipeline("task", batch_size=8)
results = pipe(["text1", "text2", "text3"])

Backend: Transformers v5 pipelines use PyTorch only (TensorFlow/JAX backends were removed in v5).

Batch Processing

Process multiple inputs efficiently:

classifier = pipeline("text-classification")
texts = ["Great product!", "Terrible experience", "Just okay"]
results = classifier(texts)

For large datasets, use generators or KeyDataset:

from transformers.pipelines.pt_utils import KeyDataset
import datasets

dataset = datasets.load_dataset("dataset-name", split="test")
pipe = pipeline("task", device=0)

for output in pipe(KeyDataset(dataset, "text")):
    print(output)

Performance Optimization

GPU Acceleration

Always specify device for GPU usage:

pipe = pipeline("task", device=0)

Mixed Precision

Use float16 for 2x speedup on supported GPUs:

import torch
pipe = pipeline("task", dtype=torch.float16, device=0)

Batching Guidelines

CPU: Usually skip batching
GPU with variable lengths: May reduce efficiency
GPU with similar lengths: Significant speedup
Real-time applications: Skip batching (increases latency)

# Good for throughput
pipe = pipeline("task", batch_size=32, device=0)
results = pipe(list_of_texts)

Streaming Output

For text generation, stream tokens as they're generated:

from transformers import AutoTokenizer, TextStreamer, pipeline

tokenizer = AutoTokenizer.from_pretrained("gpt2")
streamer = TextStreamer(tokenizer)
generator = pipeline("text-generation", model="gpt2", streamer=streamer)
generator("The future of AI", max_new_tokens=100)

Custom Pipeline Configuration

Specify tokenizer and model separately:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

Use custom pipeline classes:

from transformers import TextClassificationPipeline

class CustomPipeline(TextClassificationPipeline):
    def postprocess(self, model_outputs, **kwargs):
        # Custom post-processing
        return super().postprocess(model_outputs, **kwargs)

pipe = pipeline("text-classification", model="model-id", pipeline_class=CustomPipeline)

Input Formats

Pipelines accept various input types:

Text tasks: Strings or lists of strings

pipe("single text")
pipe(["text1", "text2"])

Image tasks: URLs, file paths, PIL Images, or numpy arrays

pipe("https://example.com/image.jpg")
pipe("local/path/image.png")
pipe(PIL.Image.open("image.jpg"))
pipe(numpy_array)

Audio tasks: File paths, numpy arrays, or raw waveforms

pipe("audio.mp3")
pipe(audio_array)

Error Handling

Handle common issues:

try:
    result = pipe(input_data)
except Exception as e:
    if "CUDA out of memory" in str(e):
        # Reduce batch size or use CPU
        pipe = pipeline("task", device=-1)
    elif "does not appear to have a file named" in str(e):
        # Model not found
        print("Check model identifier")
    else:
        raise

Best Practices

1. Use pipelines for prototyping: Fast iteration without boilerplate 2. Specify models explicitly: Default models may change 3. Enable GPU when available: Significant speedup 4. Use batching for throughput: When processing many inputs 5. Consider memory usage: Use float16 or smaller models for large batches 6. Cache models locally: Avoid repeated downloads

Tokenizers

Overview

Tokenizers convert text into numerical representations (tokens) that models can process. They handle special tokens, padding, truncation, and attention masks.

Loading Tokenizers

AutoTokenizer

Automatically load the correct tokenizer for a model:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Load from local path:

tokenizer = AutoTokenizer.from_pretrained("./local/tokenizer/path")

Basic Tokenization

Encode Text

# Simple encoding
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens)  # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]

# With text tokenization
tokens = tokenizer.tokenize(text)
print(tokens)  # ['hello', ',', 'how', 'are', 'you', '?']

Decode Tokens

token_ids = [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
text = tokenizer.decode(token_ids)
print(text)  # "hello, how are you?"

# Skip special tokens
text = tokenizer.decode(token_ids, skip_special_tokens=True)
print(text)  # "hello, how are you?"

The `call` Method

Primary tokenization interface:

# Single text
inputs = tokenizer("Hello, how are you?")

# Returns dictionary with input_ids, attention_mask
print(inputs)
# {
#   'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
#   'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
# }

Multiple texts:

texts = ["Hello", "How are you?"]
inputs = tokenizer(texts, padding=True, truncation=True)

Key Parameters

Return Tensors

return_tensors: Output format ("pt" for PyTorch, "np" for NumPy)

# PyTorch tensors (default for Transformers v5 workflows)
inputs = tokenizer("text", return_tensors="pt")

# NumPy arrays
inputs = tokenizer("text", return_tensors="np")

Padding

padding: Pad sequences to same length

# Pad to longest sequence in batch
inputs = tokenizer(texts, padding=True)

# Pad to specific length
inputs = tokenizer(texts, padding="max_length", max_length=128)

# No padding
inputs = tokenizer(texts, padding=False)

pad_to_multiple_of: Pad to multiple of specified value

inputs = tokenizer(texts, padding=True, pad_to_multiple_of=8)

Truncation

truncation: Limit sequence length

# Truncate to max_length
inputs = tokenizer(text, truncation=True, max_length=512)

# Truncate first sequence in pairs
inputs = tokenizer(text1, text2, truncation="only_first")

# Truncate second sequence
inputs = tokenizer(text1, text2, truncation="only_second")

# Truncate longest first (default for pairs)
inputs = tokenizer(text1, text2, truncation="longest_first", max_length=512)

Max Length

max_length: Maximum sequence length

inputs = tokenizer(text, max_length=512, truncation=True)

Additional Outputs

return_attention_mask: Include attention mask (default True)

inputs = tokenizer(text, return_attention_mask=True)

return_token_type_ids: Segment IDs for sentence pairs

inputs = tokenizer(text1, text2, return_token_type_ids=True)

return_offsets_mapping: Character position mapping (Fast tokenizers only)

inputs = tokenizer(text, return_offsets_mapping=True)

return_length: Include sequence lengths

inputs = tokenizer(texts, padding=True, return_length=True)

Special Tokens

Predefined Special Tokens

Access special tokens:

print(tokenizer.cls_token)      # [CLS] or <s>
print(tokenizer.sep_token)      # [SEP] or </s>
print(tokenizer.pad_token)      # [PAD]
print(tokenizer.unk_token)      # [UNK]
print(tokenizer.mask_token)     # [MASK]
print(tokenizer.eos_token)      # End of sequence
print(tokenizer.bos_token)      # Beginning of sequence

# Get IDs
print(tokenizer.cls_token_id)
print(tokenizer.sep_token_id)

Add Special Tokens

Manual control:

# Automatically add special tokens (default True)
inputs = tokenizer(text, add_special_tokens=True)

# Skip special tokens
inputs = tokenizer(text, add_special_tokens=False)

Custom Special Tokens

special_tokens_dict = {
    "additional_special_tokens": ["<CUSTOM>", "<SPECIAL>"]
}

num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added} tokens")

# Resize model embeddings after adding tokens
model.resize_token_embeddings(len(tokenizer))

Sentence Pairs

Tokenize text pairs:

text1 = "What is the capital of France?"
text2 = "Paris is the capital of France."

# Automatically handles separation
inputs = tokenizer(text1, text2, padding=True, truncation=True)

# Results in: [CLS] text1 [SEP] text2 [SEP]

Batch Encoding

Process multiple texts:

texts = ["First text", "Second text", "Third text"]

# Basic batch encoding
batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Access individual encodings
for i in range(len(texts)):
    input_ids = batch["input_ids"][i]
    attention_mask = batch["attention_mask"][i]

Fast Tokenizers

Use Rust-based tokenizers for speed:

from transformers import AutoTokenizer

# Automatically loads Fast version if available
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Check if Fast
print(tokenizer.is_fast)  # True

# Force Fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

# Force slow (Python) tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)

Fast Tokenizer Features

Offset mapping (character positions):

inputs = tokenizer("Hello world", return_offsets_mapping=True)
print(inputs["offset_mapping"])
# [(0, 0), (0, 5), (6, 11), (0, 0)]  # [CLS], "Hello", "world", [SEP]

Token to word mapping:

encoding = tokenizer("Hello world")
word_ids = encoding.word_ids()
print(word_ids)  # [None, 0, 1, None]  # [CLS]=None, "Hello"=0, "world"=1, [SEP]=None

Saving Tokenizers

Save locally:

tokenizer.save_pretrained("./my_tokenizer")

Push to Hub:

tokenizer.push_to_hub("username/my-tokenizer")

Advanced Usage

Vocabulary

Access vocabulary:

vocab = tokenizer.get_vocab()
vocab_size = len(vocab)

# Get token for ID
token = tokenizer.convert_ids_to_tokens(100)

# Get ID for token
token_id = tokenizer.convert_tokens_to_ids("hello")

Encoding Details

Get detailed encoding information:

encoding = tokenizer("Hello world", return_tensors="pt")

# Original methods still available
tokens = encoding.tokens()
word_ids = encoding.word_ids()
sequence_ids = encoding.sequence_ids()

Custom Preprocessing

Subclass for custom behavior:

class CustomTokenizer(AutoTokenizer):
    def __call__(self, text, **kwargs):
        # Custom preprocessing
        text = text.lower().strip()
        return super().__call__(text, **kwargs)

Chat Templates

For conversational models:

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"}
]

# Format for display or preprocessing
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(text)

# Tokenize directly for generation
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

Common Patterns

Pattern 1: Simple Text Classification

texts = ["I love this!", "I hate this!"]
labels = [1, 0]

inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Use with model
outputs = model(**inputs, labels=torch.tensor(labels))

Pattern 2: Question Answering

question = "What is the capital?"
context = "Paris is the capital of France."

inputs = tokenizer(
    question,
    context,
    padding=True,
    truncation=True,
    max_length=384,
    return_tensors="pt"
)

Pattern 3: Text Generation

prompt = "Once upon a time"

inputs = tokenizer(prompt, return_tensors="pt")

# Generate
outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id
)

# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Pattern 4: Dataset Tokenization

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Best Practices

1. Always specify return_tensors: For model input 2. Use padding and truncation: For batch processing 3. Set max_length explicitly: Prevent memory issues 4. Use Fast tokenizers: When available for speed 5. Handle pad_token: Set to eos_token if None for generation 6. Add special tokens: Leave enabled (default) unless specific reason 7. Resize embeddings: After adding custom tokens 8. Decode with skip_special_tokens: For cleaner output 9. Use batched processing: For efficiency with datasets 10. Save tokenizer with model: Ensure compatibility

Common Issues

Padding token not set:

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Sequence too long:

# Enable truncation
inputs = tokenizer(text, truncation=True, max_length=512)

Mismatched vocabulary:

# Always load tokenizer and model from same checkpoint
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModel.from_pretrained("model-id")

Attention mask issues:

# Ensure attention_mask is passed
outputs = model(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"]
)

Training and Fine-Tuning

Overview

Fine-tune pre-trained models on custom datasets using the Trainer API. The Trainer handles training loops, gradient accumulation, mixed precision, logging, and checkpointing.

Metrics: use evaluate.load("metric_name") — the old datasets.load_metric API was removed.

Hub uploads: trainer.push_to_hub() requires authentication (hf auth login or HF_TOKEN).

Basic Fine-Tuning Workflow

Step 1: Load and Preprocess Data

from datasets import load_dataset

# Load dataset
dataset = load_dataset("yelp_review_full")
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

# Tokenize
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

Step 2: Load Model

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=5  # Number of classes
)

Step 3: Define Metrics

import evaluate
import numpy as np

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Step 4: Configure Training

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

Step 5: Create Trainer and Train

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

# Start training
trainer.train()

# Evaluate
results = trainer.evaluate()
print(results)

Step 6: Save Model

trainer.save_model("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

# Or push to Hub
trainer.push_to_hub("username/my-finetuned-model")

TrainingArguments Parameters

Essential Parameters

output_dir: Directory for checkpoints and logs

output_dir="./results"

num_train_epochs: Number of training epochs

num_train_epochs=3

per_device_train_batch_size: Batch size per GPU/CPU

per_device_train_batch_size=8

learning_rate: Optimizer learning rate

learning_rate=2e-5  # Common for BERT-style models
learning_rate=5e-5  # Common for smaller models

weight_decay: L2 regularization

weight_decay=0.01

Evaluation and Saving

eval_strategy: When to evaluate ("no", "steps", "epoch")

eval_strategy="epoch"  # Evaluate after each epoch
eval_strategy="steps"  # Evaluate every eval_steps

save_strategy: When to save checkpoints

save_strategy="epoch"
save_strategy="steps"
save_steps=500

load_best_model_at_end: Load best checkpoint after training

load_best_model_at_end=True
metric_for_best_model="accuracy"  # Metric to compare

Optimization

gradient_accumulation_steps: Accumulate gradients over multiple steps

gradient_accumulation_steps=4  # Effective batch size = batch_size * 4

fp16: Enable mixed precision (NVIDIA GPUs without native bfloat16)

fp16=True

bf16: Enable bfloat16 (preferred on Ampere+ and newer GPUs when supported)

bf16=True

gradient_checkpointing: Trade compute for memory

gradient_checkpointing=True  # Slower but uses less memory

optim: Optimizer choice

optim="adamw_torch"  # Default
optim="adamw_8bit"    # 8-bit Adam (requires bitsandbytes)
optim="adafactor"     # Memory-efficient alternative

Learning Rate Scheduling

lr_scheduler_type: Learning rate schedule

lr_scheduler_type="linear"       # Linear decay
lr_scheduler_type="cosine"       # Cosine annealing
lr_scheduler_type="constant"     # No decay
lr_scheduler_type="constant_with_warmup"

warmup_steps or warmup_ratio: Warmup period

warmup_steps=500
# Or
warmup_ratio=0.1  # 10% of total steps

Logging

logging_dir: TensorBoard logs directory

logging_dir="./logs"

logging_steps: Log every N steps

logging_steps=10

report_to: Logging integrations

report_to=["tensorboard"]
report_to=["wandb"]
report_to=["tensorboard", "wandb"]

Distributed Training

ddp_backend: Distributed backend

ddp_backend="nccl"  # For multi-GPU

deepspeed: DeepSpeed config file

deepspeed="ds_config.json"

Data Collators

Handle dynamic padding and special preprocessing:

DataCollatorWithPadding

Pad sequences to longest in batch:

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

DataCollatorForLanguageModeling

For masked language modeling:

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

DataCollatorForSeq2Seq

For sequence-to-sequence tasks:

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

Custom Training

Custom Trainer

Override methods for custom behavior:

from transformers import Trainer

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Custom loss computation
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

Custom Callbacks

Monitor and control training:

from transformers import TrainerCallback

class CustomCallback(TrainerCallback):
    def on_epoch_end(self, args, state, control, **kwargs):
        print(f"Epoch {state.epoch} completed")
        # Custom logic here
        return control

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    callbacks=[CustomCallback],
)

Advanced Training Techniques

Parameter-Efficient Fine-Tuning (PEFT)

Use LoRA for efficient fine-tuning:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query", "value"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Shows reduced parameter count

# Train normally with Trainer
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

Gradient Checkpointing

Reduce memory at cost of speed:

model.gradient_checkpointing_enable()

training_args = TrainingArguments(
    gradient_checkpointing=True,
    ...
)

Mixed Precision Training

training_args = TrainingArguments(
    fp16=True,  # For NVIDIA GPUs with Tensor Cores
    # or
    bf16=True,  # For newer GPUs (A100, H100)
    ...
)

DeepSpeed Integration

For very large models:

# ds_config.json
{
  "train_batch_size": 16,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 2e-5
    }
  },
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 2
  }
}

training_args = TrainingArguments(
    deepspeed="ds_config.json",
    ...
)

Training Tips

Hyperparameter Tuning

Common starting points:

Learning rate: 2e-5 to 5e-5 for BERT-like models, 1e-4 to 1e-3 for smaller models
Batch size: 8-32 depending on GPU memory
Epochs: 2-4 for fine-tuning, more for domain adaptation
Warmup: 10% of total steps

Use Optuna for hyperparameter search:

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        "bert-base-uncased",
        num_labels=5
    )

def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
    }

trainer = Trainer(model_init=model_init, args=training_args, ...)
best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=optuna_hp_space,
    n_trials=10,
)

Monitoring Training

Use TensorBoard:

tensorboard --logdir ./logs

Or Weights & Biases:

import wandb
wandb.init(project="my-project")

training_args = TrainingArguments(
    report_to=["wandb"],
    ...
)

Resume Training

Resume from checkpoint:

trainer.train(resume_from_checkpoint="./results/checkpoint-1000")

Common Issues

CUDA out of memory:

Reduce batch size
Enable gradient checkpointing
Use gradient accumulation
Use 8-bit optimizers

Overfitting:

Increase weight_decay
Add dropout
Use early stopping
Reduce model size or training epochs

Slow training:

Increase batch size
Enable mixed precision (fp16/bf16)
Use multiple GPUs
Optimize data loading

Best Practices

1. Start small: Test on small dataset subset first 2. Use evaluation: Monitor validation metrics 3. Save checkpoints: Enable save_strategy 4. Log extensively: Use TensorBoard or W&B 5. Try different learning rates: Start with 2e-5 6. Use warmup: Helps training stability 7. Enable mixed precision: Faster training 8. Consider PEFT: For large models with limited resources

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Choose this skill for local Hugging Face generate() control in Python agents rather than hosted LLM API integration guides.

FAQ

What is the difference between Pipeline API and model.generate()?

The Transformers skill notes that the Pipeline API wraps tokenization and generate() for quick prototyping. Direct model.generate() with AutoModelForCausalLM and AutoTokenizer gives developers custom preprocessing and decoding control in agent workflows.

Which Hugging Face classes does the Transformers skill use?

The Transformers skill uses AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained, tokenizes inputs with return_tensors pt, and calls model.generate with parameters like max_new_tokens for controllable output.

Is Transformers safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsllmautomation