
Regex Vs Llm Structured Text
Design hybrid text parsers that use regex for repeating structure and LLM only on low-confidence rows to cut cost and keep output deterministic.
Overview
regex-vs-llm-structured-text is an agent skill most often used in Build (also Validate, Operate) that defines when to parse with regex versus LLM and how to wire a confidence-gated hybrid pipeline.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill regex-vs-llm-structured-textWhat is this skill?
- Decision tree: consistent repeating format → regex first; free-form → LLM direct
- Hybrid architecture: regex parser → text cleaner → confidence scorer → LLM validator on scores below 0.95
- Targets 95–98% regex coverage before any model call
- Explicit activation for quizzes, forms, invoices, and tables
- Cost/accuracy tradeoff guidance for solo builders running extraction at scale
- 95–98% accuracy target for regex-first extraction
- 0.95 confidence threshold example for direct output vs LLM validator
Adoption & trust: 4.6k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are paying for LLM parsing on every document even though most lines follow the same repeating pattern.
Who is it for?
Builders parsing semi-structured text with repeatable markers who need deterministic output and controlled model spend.
Skip if: Fully unstructured prose, scanned PDFs needing OCR-first pipelines, or problems where no pattern exists above ninety percent consistency.
When should I use this skill?
Parsing structured text with repeating patterns, choosing regex versus LLM, or building hybrid pipelines for cost and accuracy.
What do I get? / Deliverables
You ship a regex-first extractor with a confidence scorer and LLM fallback only on edge cases, improving cost and repeatability.
- Parser architecture decision
- Regex-first pipeline with confidence gating design
- LLM fallback spec for low-confidence segments
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Structured parsing pipelines are usually implemented while building ingestion, forms, or document features. Backend and data-path code owns regex parsers, confidence scoring, and optional LLM validation hooks.
Where it fits
Estimate import accuracy on a sample CSV before committing to a full LLM-only parser.
Implement quiz or invoice field extraction with regex and gate model calls to low-confidence blocks.
Adjust confidence thresholds when vendor document layouts shift slightly in production.
How it compares
Architecture guidance for parsers, not a drop-in OCR library or a single regex generator skill.
Common Questions / FAQ
Who is regex-vs-llm-structured-text for?
Solo developers and agent authors building importers, form processors, or document bots who must balance accuracy, latency, and token cost.
When should I use regex-vs-llm-structured-text?
In Build when designing extractors; in Validate when proving import accuracy on sample files; in Operate when tuning pipelines after production drift.
Is regex-vs-llm-structured-text safe to install?
It is procedural guidance; review the Security Audits panel on this Prism page and audit any code that calls external LLM APIs.
SKILL.md
READMESKILL.md - Regex Vs Llm Structured Text
# Regex vs LLM for Structured Text Parsing A practical decision framework for parsing structured text (quizzes, forms, invoices, documents). The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases. ## When to Activate - Parsing structured text with repeating patterns (questions, forms, tables) - Deciding between regex and LLM for text extraction - Building hybrid pipelines that combine both approaches - Optimizing cost/accuracy tradeoffs in text processing ## Decision Framework ``` Is the text format consistent and repeating? ├── Yes (>90% follows a pattern) → Start with Regex │ ├── Regex handles 95%+ → Done, no LLM needed │ └── Regex handles <95% → Add LLM for edge cases only └── No (free-form, highly variable) → Use LLM directly ``` ## Architecture Pattern ``` Source Text │ ▼ [Regex Parser] ─── Extracts structure (95-98% accuracy) │ ▼ [Text Cleaner] ─── Removes noise (markers, page numbers, artifacts) │ ▼ [Confidence Scorer] ─── Flags low-confidence extractions │ ├── High confidence (≥0.95) → Direct output │ └── Low confidence (<0.95) → [LLM Validator] → Output ``` ## Implementation ### 1. Regex Parser (Handles the Majority) ```python import re from dataclasses import dataclass @dataclass(frozen=True) class ParsedItem: id: str text: str choices: tuple[str, ...] answer: str confidence: float = 1.0 def parse_structured_text(content: str) -> list[ParsedItem]: """Parse structured text using regex patterns.""" pattern = re.compile( r"(?P<id>\d+)\.\s*(?P<text>.+?)\n" r"(?P<choices>(?:[A-D]\..+?\n)+)" r"Answer:\s*(?P<answer>[A-D])", re.MULTILINE | re.DOTALL, ) items = [] for match in pattern.finditer(content): choices = tuple( c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices")) ) items.append(ParsedItem( id=match.group("id"), text=match.group("text").strip(), choices=choices, answer=match.group("answer"), )) return items ``` ### 2. Confidence Scoring Flag items that may need LLM review: ```python @dataclass(frozen=True) class ConfidenceFlag: item_id: str score: float reasons: tuple[str, ...] def score_confidence(item: ParsedItem) -> ConfidenceFlag: """Score extraction confidence and flag issues.""" reasons = [] score = 1.0 if len(item.choices) < 3: reasons.append("few_choices") score -= 0.3 if not item.answer: reasons.append("missing_answer") score -= 0.5 if len(item.text) < 10: reasons.append("short_text") score -= 0.2 return ConfidenceFlag( item_id=item.id, score=max(0.0, score), reasons=tuple(reasons), ) def identify_low_confidence( items: list[ParsedItem], threshold: float = 0.95, ) -> list[ConfidenceFlag]: """Return items below confidence threshold.""" flags = [score_confidence(item) for item in items] return [f for f in flags if f.score < threshold] ``` ### 3. LLM Validator (Edge Cases Only) ```python def validate_with_llm( item: ParsedItem, original_text: str, client, ) -> ParsedItem: """Use LLM to fix low-confidence extractions.""" response = client.messages.create( model="claude-haiku-4-5-20251001", # Cheapest model for validation max_tokens=500, messages=[{ "role": "user", "content": ( f"Extract the question, choices, and answer from this text.\n\n" f"Text: {original_text}\n\n" f"Current extraction: {item}\n\n" f"Retu