
Nemo Curator
Deduplicate large text corpora before fine-tuning or RAG so solo builders do not train or index redundant documents.
Overview
Nemo-curator is an agent skill most often used in Build (also Operate) that teaches exact, fuzzy, and semantic text deduplication with NeMo Curator Python modules.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill nemo-curatorWhat is this skill?
- Exact deduplication on id/text fields with MD5 or SHA256 hashing
- Fuzzy near-duplicate removal via MinHash + LSH with tunable Jaccard threshold (default 0.8)
- Semantic deduplication using sentence-transformers embeddings and cosine similarity
- Documented GPU speedups (~16× exact dedup vs CPU; fuzzy pass 120h → 7.5h on 8TB-scale workloads)
- Configurable fuzzy parameters: num_hashes 128–512 (default 260), num_buckets 10–50 (default 20)
- Exact deduplication documented at ~16× faster on GPU versus CPU
- Fuzzy deduplication example: 8TB workload 120h reduced to 7.5h
- Default fuzzy num_hashes 260 with range 128–512
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a massive text dataset full of identical, near-duplicate, and paraphrased documents that will waste training time and pollute RAG answers.
Who is it for?
Solo builders curating LLM training data or RAG knowledge bases on GPU-backed machines who want library-native dedup stages.
Skip if: Builders who only need a few hundred pages in a vector DB without batch hygiene, or teams without any Python data environment.
When should I use this skill?
You are cleaning or merging large text datasets and need exact, fuzzy, or semantic duplicate removal with NeMo Curator.
What do I get? / Deliverables
You get a deduplicated dataset with chosen hash or embedding thresholds and a clear module setup to plug into your curation pipeline.
- Deduplicated dataset object from ExactDuplicates, FuzzyDuplicates, or SemanticDuplicates
- Tuned jaccard or cosine thresholds and hash/bucket settings
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Dataset curation sits in Build when assembling ML/RAG backends, with repeat runs in Operate as corpora grow. Backend subphase covers batch data processing pipelines rather than UI or agent prompt design.
Where it fits
Run fuzzy dedup on scraped docs before embedding them into a RAG store.
Wire ExactDuplicates into an ingestion job that lands JSONL from multiple sources.
Re-run semantic dedup after weekly content pulls to keep retrieval quality stable.
How it compares
Use for governed corpus cleaning instead of ad-hoc regex dedupe in notebook cells.
Common Questions / FAQ
Who is nemo-curator for?
Indie ML and agent builders preprocessing text corpora with NeMo Curator who need exact, fuzzy, and semantic duplicate removal.
When should I use nemo-curator?
During Build when assembling backend datasets for fine-tuning or RAG, and in Operate when refreshing production corpora after new crawls.
Is nemo-curator safe to install?
Review the Security Audits panel on this Prism page and inspect the skill source before running GPU jobs on sensitive corpora.
SKILL.md
READMESKILL.md - Nemo Curator
# Deduplication Guide Complete guide to exact, fuzzy, and semantic deduplication. ## Exact deduplication Remove documents with identical content. ```python from nemo_curator.modules import ExactDuplicates # Exact deduplication exact_dedup = ExactDuplicates( id_field="id", text_field="text", hash_method="md5" # or "sha256" ) deduped = exact_dedup(dataset) ``` **Performance**: ~16× faster on GPU vs CPU ## Fuzzy deduplication Remove near-duplicate documents using MinHash + LSH. ```python from nemo_curator.modules import FuzzyDuplicates fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash permutations (more = accurate) num_buckets=20, # LSH buckets (more = faster, less recall) hash_method="md5", jaccard_threshold=0.8 # Similarity threshold ) deduped = fuzzy_dedup(dataset) ``` **Parameters**: - `num_hashes`: 128-512 (default 260) - `num_buckets`: 10-50 (default 20) - `jaccard_threshold`: 0.7-0.9 (default 0.8) **Performance**: 16× faster on 8TB dataset (120h → 7.5h) ## Semantic deduplication Remove semantically similar documents using embeddings. ```python from nemo_curator.modules import SemanticDuplicates semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", embedding_batch_size=256, threshold=0.85, # Cosine similarity threshold device="cuda" ) deduped = semantic_dedup(dataset) ``` **Models**: - `all-MiniLM-L6-v2`: Fast, 384 dims - `all-mpnet-base-v2`: Better quality, 768 dims - Custom models supported ## Comparison | Method | Speed | Recall | Use Case | |--------|-------|--------|----------| | Exact | Fastest | 100% | Exact matches only | | Fuzzy | Fast | ~95% | Near-duplicates (recommended) | | Semantic | Slow | ~90% | Paraphrases, rewrites | ## Best practices 1. **Start with exact dedup** - Remove obvious duplicates 2. **Use fuzzy for large datasets** - Best speed/quality trade-off 3. **Semantic for high-value data** - Expensive but thorough 4. **GPU acceleration required** - 10-16× speedup # Quality Filtering Guide Complete guide to NeMo Curator's 30+ quality filters. ## Text-based filters ### Word count ```python from nemo_curator.filters import WordCountFilter # Filter by word count dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000)) ``` ### Repeated content ```python from nemo_curator.filters import RepeatedLinesFilter # Remove documents with >30% repeated lines dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3)) ``` ### Symbol ratio ```python from nemo_curator.filters import SymbolToWordRatioFilter # Remove documents with too many symbols dataset = dataset.filter(SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3)) ``` ### URL ratio ```python from nemo_curator.filters import UrlRatioFilter # Remove documents with many URLs dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2)) ``` ## Language filtering ```python from nemo_curator.filters import LanguageIdentificationFilter # Keep only English documents dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en"])) # Multiple languages dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en", "es", "fr"])) ``` ## Classifier-based filtering ### Quality classifier ```python from nemo_curator.classifiers import QualityClassifier quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" ) # Filter low-quality (threshold > 0.5 = high quality) dataset = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5) ``` ### NSFW classifier ```python from nemo_curator.classifiers import NSFWClassifier nsfw_clf = NSFWClassifier(threshold=0.9, device="cuda") # Remove NSFW content dataset = dataset.filter(lambda doc: nsfw_clf(doc["text"]) < 0.9) ``` ## Heuristic filters Full list of 30+ filters