Kreuzberg

Build is the primary home because the skill documents how to register and run customization inside Kreuzberg’s extraction pipeline for product features. Agent-tooling matches plugin registration, processing stages, and enrichment hooks that power document-aware agents.

Also useful

Also useful

Where it fits

Example use

Example use

Add a late-stage validator plugin to catch empty OCR output before release.

Example use

Enrich help-center PDF extractions with processed_by tags for support analytics.

How it compares

Skill package for extending an open-source extractor— not a hosted OCR SaaS or a raw LangChain loader tutorial.

Common Questions / FAQ

Who is kreuzberg for?

Indie developers and small teams building AI products that ingest PDFs and office documents via Kreuzberg and need agent-guided plugin customization.

When should I use kreuzberg?

Use it in Build when wiring agent-tooling ingestion, in Ship when hardening extraction quality before launch, or in Grow when enriching support or content pipelines with semantic metadata.

Is kreuzberg safe to install?

Check the Security Audits panel on this Prism page for hash and audit signals before letting agents register plugins that read local or uploaded documents.

SKILL.md

READMESKILL.md - Kreuzberg

# Advanced Features Reference

Kreuzberg provides powerful advanced features for customization, semantic processing, and integration with external systems.

## Plugin System

The plugin system allows you to extend Kreuzberg's extraction pipeline with custom post-processors, validators, and OCR backends. Plugins run within the extraction pipeline and have direct access to extraction results.

### Custom Post-Processors

Post-processors enrich extraction results after document parsing. They run non-destructively—if a post-processor fails, the extraction succeeds anyway (errors are logged).

=== "Python"

    ```python
    from kreuzberg import register_post_processor, ExtractionResult

    class MetadataEnricher:
        def name(self) -> str:
            return "metadata_enricher"

        def process(self, result: ExtractionResult) -> ExtractionResult:
            result.metadata["processed_by"] = "metadata_enricher"
            result.metadata["char_count"] = len(result.content)
            return result

        def processing_stage(self) -> str:
            # "early", "middle", or "late"
            return "middle"

        def initialize(self) -> None:
            print("Initializing metadata enricher")

        def shutdown(self) -> None:
            print("Shutting down metadata enricher")

    register_post_processor(MetadataEnricher())

    # Now use extraction with the registered processor
    from kreuzberg import extract_file_sync
    result = extract_file_sync("document.pdf")
    print(result.metadata["char_count"])
    ```

=== "TypeScript"

    ```typescript
    import { registerPostProcessor, ExtractionResult } from '@kreuzberg/node';

    const enricher = {
        name(): string {
            return "metadata_enricher";
        },

        async process(result: ExtractionResult): Promise<ExtractionResult> {
            result.metadata.processed_by = "metadata_enricher";
            result.metadata.char_count = result.content.length;
            return result;
        },

        processingStage?(): "early" | "middle" | "late" {
            return "middle";
        },

        async initialize?(): Promise<void> {
            console.log("Initializing metadata enricher");
        },

        async shutdown?(): Promise<void> {
            console.log("Shutting down metadata enricher");
        }
    };

    registerPostProcessor(enricher);

    // Now use extraction with the registered processor
    const result = await extractFile("document.pdf");
    console.log(result.metadata.char_count);
    ```

### Custom Validators

Validators perform quality checks on extraction results. Unlike post-processors, validator failures cause the entire extraction to fail. Use validators to enforce quality standards.

=== "Python"

    ```python
    from kreuzberg import register_validator, ExtractionResult, ValidationError

    class MinimumContentValidator:
        def name(self) -> str:
            return "min_content_validator"

        def validate(self, result: ExtractionResult) -> None:
            if len(result.content) < 100:
                raise ValidationError("Extracted content too short (< 100 chars)")

        def priority(self) -> int:
            # Higher priority runs first (0-1000, default 50)
            return 100

        def should_validate(self, result: ExtractionResult) -> bool:
            # Only validate PDFs
            return "pdf" in result.mime_type.lower()

        def initialize(self) -> None:
            pass

        def shutdown(self) -> None:
            pass

    register_validator(MinimumContentValidator())

    # Extraction will fail if content < 100 chars
    result = extract_file_sync("document.pdf")
    ```

=== "TypeScript"

    ```typescript
    import { registerValidator, ExtractionResult } from '@kreuzberg/node';

    const validator = {
        name(): string {
            return "min_content_validator";
        },

        async validate(result: ExtractionResult): Promise<void> {

What is this skill?

Plugin system for post-processors, validators, and OCR backends inside the extraction pipeline

Non-destructive post-processors that log failures without breaking successful extractions

Configurable processing_stage values: early, middle, and late

Python register_post_processor flow with initialize and shutdown lifecycle hooks

Metadata enrichment patterns such as char_count and processed_by on ExtractionResult

Three processing_stage options: early, middle, late

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 764 installs on skills.sh; 8.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Example use

Add a late-stage validator plugin to catch empty OCR output before release.

Example use