
Sentencepiece
Train and choose BPE vs Unigram SentencePiece tokenizers with correct tradeoffs before fine-tuning or serving a custom vocabulary.
Overview
sentencepiece is an agent skill for the Build phase that explains BPE vs Unigram tokenization and how to train SentencePiece models for LLM pipelines.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill sentencepieceWhat is this skill?
- Compares BPE (merge-by-frequency) vs Unigram (probabilistic pruning) with concrete corpus walkthrough
- Includes SentencePieceTrainer snippet for BPE with vocab_size control
- Explains deterministic BPE splits vs Unigram sampling and subword regularization behavior
- Covers when compression and training speed favor BPE vs when probabilistic tokenization helps
- Grounds choices in implementation via the sentencepiece Python API
- Example BPE trainer snippet uses vocab_size=16000
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are about to fine-tune or train a model but do not know which tokenizer algorithm, vocabulary size, or regularization strategy will match your corpus and inference constraints.
Who is it for?
Indie builders preparing custom or domain corpora before Hugging Face–style training runs that depend on .model tokenizer files.
Skip if: Builders who only consume a fixed tokenizer from an upstream checkpoint and never retrain vocabulary—use the hub tokenizer as-is.
When should I use this skill?
You need to train a tokenizer, compare BPE and Unigram, or explain subword regularization before model training.
What do I get? / Deliverables
You pick BPE or Unigram with clear tradeoffs and train a SentencePiece model file your training and serving stack can load consistently.
- Trained SentencePiece model artifact
- Documented BPE vs Unigram choice for your corpus
- Tokenizer settings aligned with downstream training code
Recommended Skills
Journey fit
Tokenization is core model/data plumbing—builders implement it while constructing training and inference backends, not during distribution or ops triage. Backend subphase fits because vocab training, subword regularization, and serializer files are prerequisites to model code paths and dataset pipelines.
How it compares
Algorithm and training reference—not a hosted tokenization API or prompt-engineering skill.
Common Questions / FAQ
Who is sentencepiece for?
Solo ML builders who train or adapt language models and must own vocabulary design, not only call a pretrained tokenizer.
When should I use sentencepiece?
Use it during Build when creating training data pipelines, choosing BPE vs Unigram, or running SentencePieceTrainer before fine-tuning.
Is sentencepiece safe to install?
The skill is documentation-heavy; review Security Audits on this page and limit agent shell access when pointing trainers at private corpora on disk.
SKILL.md
READMESKILL.md - Sentencepiece
# Tokenization Algorithms BPE vs Unigram comparison and subword regularization. ## BPE (Byte-Pair Encoding) ### Algorithm 1. Initialize vocabulary with characters 2. Count frequency of adjacent token pairs 3. Merge most frequent pair 4. Repeat until vocabulary size reached ### Example **Corpus**: ``` low: 5 lower: 2 newest: 6 widest: 3 ``` **Iteration 1**: - Most frequent pair: 'e' + 's' (9 times) - Merge → 'es' - Vocabulary: [chars] + ['es'] **Iteration 2**: - Most frequent: 'es' + 't' (9 times) - Merge → 'est' - Vocabulary: [chars] + ['es', 'est'] **Result**: `newest` → `new|est`, `widest` → `wid|est` ### Implementation ```python import sentencepiece as spm spm.SentencePieceTrainer.train( input='corpus.txt', model_type='bpe', vocab_size=16000 ) ``` ### Advantages - Simple algorithm - Fast training - Good compression ratio ### Disadvantages - Deterministic (no sampling) - May split common words unexpectedly ## Unigram ### Algorithm 1. Start with large vocabulary (all substrings) 2. Compute probability of each token 3. Remove tokens with minimal loss impact 4. Repeat until vocabulary size reached ### Probabilistic tokenization Given vocabulary with probabilities: ``` P('low') = 0.02 P('est') = 0.03 P('l') = 0.01 P('o') = 0.015 ... ``` Tokenize "lowest": ``` Option 1: ['low', 'est'] P = 0.02 × 0.03 = 0.0006 ← highest Option 2: ['l', 'o', 'w', 'est'] P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045 Choose option 1 (highest probability) ``` ### Implementation ```python spm.SentencePieceTrainer.train( input='corpus.txt', model_type='unigram', vocab_size=8000 ) ``` ### Advantages - Probabilistic (can sample) - Better for morphologically rich languages - Supports subword regularization ### Disadvantages - Slower training - More complex algorithm ## Comparison | Feature | BPE | Unigram | |---------|-----|---------| | Training speed | Fast | Slow | | Tokenization | Deterministic | Probabilistic | | Sampling | No | Yes | | Typical vocab size | 16k-32k | 8k-32k | | Used by | mBART | T5, ALBERT, XLNet | ## Subword regularization Sample different tokenizations during training for robustness. ### Enable sampling ```python sp = spm.SentencePieceProcessor(model_file='m.model') # Sample different tokenizations for _ in range(5): pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1) print(pieces) # Output (different each time): # ['▁token', 'ization'] # ['▁tok', 'en', 'ization'] # ['▁token', 'iz', 'ation'] # ['▁to', 'ken', 'ization'] # ['▁token', 'ization'] ``` ### Parameters - `alpha`: Regularization strength - 0.0 = deterministic (no sampling) - 0.1 = slight variation - 0.5 = high variation - 1.0 = maximum variation ### Benefits 1. **Robustness**: Model learns multiple tokenizations 2. **Data augmentation**: More diverse training data 3. **Better generalization**: Less overfitting to specific tokenization ### Use case ```python # Training loop with regularization for batch in dataloader: # Sample different tokenizations each epoch tokens = sp.encode(batch['text'], enable_sampling=True, alpha=0.1) # Train model... ``` **Used by**: mT5, XLM-RoBERTa ## NBest encoding Get multiple tokenization candidates with scores. ```python sp = spm.SentencePieceProcessor(model_file='m.model') # Get top-5 tokenizations nbest = sp.nbest_encode('tokenization', nbest_size=5, out_type=str) for pieces, score in nbest: print(f"{pieces} (log prob: {score:.4f})") # Output: # ['▁token', 'ization'] (log prob: -2.34) # ['▁tok', 'en', 'ization'] (log prob: -2.41) # ['▁token', 'iz', 'ation'] (log prob: -2.57) ``` ### Use cases 1. **Ensemble tokenization**: Average over multiple tokenizations 2. **Uncertainty estimation**: Check variance in scores 3. **Debugging**: Understand tokenizer behavior ## Best practices 1. **Use Unigram for multilingual** - Better for diverse languages 2. **Use BPE for speed** - Faster training and inference 3. **