
Kreuzberg
Extend Kreuzberg’s document extraction pipeline with custom post-processors, validators, and OCR plugins for RAG or agent ingestion workflows.
Overview
kreuzberg is an agent skill for the Build phase (also Ship review contexts) that documents Kreuzberg plugin APIs for custom post-processors, validators, and OCR extensions in document extraction pipelines.
Install
npx skills add https://github.com/kreuzberg-dev/kreuzberg --skill kreuzbergWhat is this skill?
- Plugin system for post-processors, validators, and OCR backends inside the extraction pipeline
- Non-destructive post-processors that log failures without breaking successful extractions
- Configurable processing_stage values: early, middle, and late
- Python register_post_processor flow with initialize and shutdown lifecycle hooks
- Metadata enrichment patterns such as char_count and processed_by on ExtractionResult
- Three processing_stage options: early, middle, late
Adoption & trust: 764 installs on skills.sh; 8.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent can extract text from files but cannot enforce custom metadata, validation, or OCR backends your product requires.
Who is it for?
Solo builders shipping document Q&A, compliance scanning, or knowledge-base ingestion on top of Kreuzberg.
Skip if: Projects that only need a single generic extract_file call with zero customization or teams not using Kreuzberg at all.
When should I use this skill?
You are customizing Kreuzberg with plugins, post-processors, validators, OCR backends, or semantic integration hooks.
What do I get? / Deliverables
After applying the skill, your agent can register Kreuzberg post-processors and run synchronized extractions that return enriched, validator-aware ExtractionResult objects.
- Registered post-processor classes wired into extraction
- Enriched ExtractionResult metadata and validation behavior
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Build is the primary home because the skill documents how to register and run customization inside Kreuzberg’s extraction pipeline for product features. Agent-tooling matches plugin registration, processing stages, and enrichment hooks that power document-aware agents.
Where it fits
Register a metadata enricher that stamps char_count before chunks land in your vector store.
Add a late-stage validator plugin to catch empty OCR output before release.
Enrich help-center PDF extractions with processed_by tags for support analytics.
How it compares
Skill package for extending an open-source extractor— not a hosted OCR SaaS or a raw LangChain loader tutorial.
Common Questions / FAQ
Who is kreuzberg for?
Indie developers and small teams building AI products that ingest PDFs and office documents via Kreuzberg and need agent-guided plugin customization.
When should I use kreuzberg?
Use it in Build when wiring agent-tooling ingestion, in Ship when hardening extraction quality before launch, or in Grow when enriching support or content pipelines with semantic metadata.
Is kreuzberg safe to install?
Check the Security Audits panel on this Prism page for hash and audit signals before letting agents register plugins that read local or uploaded documents.
SKILL.md
READMESKILL.md - Kreuzberg
# Advanced Features Reference Kreuzberg provides powerful advanced features for customization, semantic processing, and integration with external systems. ## Plugin System The plugin system allows you to extend Kreuzberg's extraction pipeline with custom post-processors, validators, and OCR backends. Plugins run within the extraction pipeline and have direct access to extraction results. ### Custom Post-Processors Post-processors enrich extraction results after document parsing. They run non-destructively—if a post-processor fails, the extraction succeeds anyway (errors are logged). === "Python" ```python from kreuzberg import register_post_processor, ExtractionResult class MetadataEnricher: def name(self) -> str: return "metadata_enricher" def process(self, result: ExtractionResult) -> ExtractionResult: result.metadata["processed_by"] = "metadata_enricher" result.metadata["char_count"] = len(result.content) return result def processing_stage(self) -> str: # "early", "middle", or "late" return "middle" def initialize(self) -> None: print("Initializing metadata enricher") def shutdown(self) -> None: print("Shutting down metadata enricher") register_post_processor(MetadataEnricher()) # Now use extraction with the registered processor from kreuzberg import extract_file_sync result = extract_file_sync("document.pdf") print(result.metadata["char_count"]) ``` === "TypeScript" ```typescript import { registerPostProcessor, ExtractionResult } from '@kreuzberg/node'; const enricher = { name(): string { return "metadata_enricher"; }, async process(result: ExtractionResult): Promise<ExtractionResult> { result.metadata.processed_by = "metadata_enricher"; result.metadata.char_count = result.content.length; return result; }, processingStage?(): "early" | "middle" | "late" { return "middle"; }, async initialize?(): Promise<void> { console.log("Initializing metadata enricher"); }, async shutdown?(): Promise<void> { console.log("Shutting down metadata enricher"); } }; registerPostProcessor(enricher); // Now use extraction with the registered processor const result = await extractFile("document.pdf"); console.log(result.metadata.char_count); ``` ### Custom Validators Validators perform quality checks on extraction results. Unlike post-processors, validator failures cause the entire extraction to fail. Use validators to enforce quality standards. === "Python" ```python from kreuzberg import register_validator, ExtractionResult, ValidationError class MinimumContentValidator: def name(self) -> str: return "min_content_validator" def validate(self, result: ExtractionResult) -> None: if len(result.content) < 100: raise ValidationError("Extracted content too short (< 100 chars)") def priority(self) -> int: # Higher priority runs first (0-1000, default 50) return 100 def should_validate(self, result: ExtractionResult) -> bool: # Only validate PDFs return "pdf" in result.mime_type.lower() def initialize(self) -> None: pass def shutdown(self) -> None: pass register_validator(MinimumContentValidator()) # Extraction will fail if content < 100 chars result = extract_file_sync("document.pdf") ``` === "TypeScript" ```typescript import { registerValidator, ExtractionResult } from '@kreuzberg/node'; const validator = { name(): string { return "min_content_validator"; }, async validate(result: ExtractionResult): Promise<void> {