
Liteparse
Parse PDFs, Office files, spreadsheets, and images locally into text or JSON for RAG pipelines and agents without cloud parse APIs or an LLM.
Overview
LiteParse is an agent skill for the Build phase that parses PDFs and other unstructured files locally with the `lit` CLI—no cloud or LLM required.
Install
npx skills add https://github.com/run-llama/llamaparse-agent-skills --skill liteparseWhat is this skill?
- Local parsing for PDF, DOCX, PPTX, XLSX, images, and more via `@llamaindex/liteparse`
- Uses the `lit` CLI after `npm i -g @llamaindex/liteparse`—no cloud dependency or LLM required
- Agent confirms setup then emits approved CLI commands or TypeScript scripts with format, page range, OCR, and DPI option
- Outputs json or text suitable for downstream chunking, search, or agent context
- Step 0 install and `lit --version` verification baked into the workflow
- Node 18+ required
- metadata version 0.1.0
Adoption & trust: 2k installs on skills.sh; 62 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need text or JSON from PDFs and Office files for an agent or RAG pipeline but want to avoid cloud parsers and extra API cost.
Who is it for?
Builders on Node 18+ who can install a global npm CLI and want fast local document extraction for agent workflows.
Skip if: Fully managed cloud-only parse pipelines with no local Node shell, or teams forbidden from global npm installs.
When should I use this skill?
User asks to parse, convert, or spatially extract text from unstructured files (PDF, DOCX, PPTX, XLSX, images) locally without cloud dependencies.
What do I get? / Deliverables
You get verified `lit` CLI commands or TypeScript scripts and parsed json/text output ready for chunking, indexing, or agent context.
- Approved lit CLI command or TypeScript script
- Parsed json or text output
- Setup verification steps
Recommended Skills
Journey fit
Document ingestion is a Build-time integration step when you wire agent tooling and local data prep before ship. LiteParse is installed via npm global CLI and invoked against arbitrary files—classic third-party/local tool integration rather than frontend or auth backend work.
How it compares
Local CLI document extraction skill—not a hosted LlamaParse MCP replacement for regulated cloud-only environments.
Common Questions / FAQ
Who is liteparse for?
Indie developers and agent users ingesting PDFs, slides, spreadsheets, and images into local AI workflows using Claude Code, Cursor, or Codex.
When should I use liteparse?
During Build/integrations when wiring document intake for RAG, support bots, or internal tools that must stay offline or on-device.
Is liteparse safe to install?
It requires shell and filesystem access to run npm global install and `lit`; review the Security Audits panel on this page and only parse files you trust.
SKILL.md
READMESKILL.md - Liteparse
# LiteParse Skill Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required. ## Initial Setup When this skill is invoked, respond with: ``` I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that: - `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`) - The `lit` CLI command is available in your terminal If both are set, please provide: 1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.) 2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc. 3. What you'd like to do with the parsed content. I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results. ``` Then wait for the user's input. --- ## Step 0 — Install LiteParse (if needed) If `liteparse` is not yet installed, install it globally: ```bash npm i -g @llamaindex/liteparse ``` Verify installation: ```bash lit --version ``` For Office document support (DOCX, PPTX, XLSX), LibreOffice is required: ```bash # macOS brew install --cask libreoffice # Ubuntu/Debian apt-get install libreoffice ``` For image parsing, ImageMagick is required: ```bash # macOS brew install imagemagick # Ubuntu/Debian apt-get install imagemagick ``` --- ## Step 1 — Produce the CLI Command or Script ### Parse a Single File ```bash # Basic text extraction lit parse document.pdf # JSON output saved to a file lit parse document.pdf --format json -o output.json # Specific page range lit parse document.pdf --target-pages "1-5,10,15-20" # Disable OCR (faster, text-only PDFs) lit parse document.pdf --no-ocr # Use an external HTTP OCR server for higher accuracy lit parse document.pdf --ocr-server-url http://localhost:8828/ocr # Higher DPI for better quality lit parse document.pdf --dpi 300 ``` ### Batch Parse a Directory ```bash lit batch-parse ./input-directory ./output-directory # Only process PDFs, recursively lit batch-parse ./input ./output --extension .pdf --recursive ``` ### Generate Page Screenshots Screenshots are useful for LLM agents that need to see visual layout. ```bash # All pages lit screenshot document.pdf -o ./screenshots # Specific pages lit screenshot document.pdf --pages "1,3,5" -o ./screenshots # High-DPI PNG lit screenshot document.pdf --dpi 300 --format png -o ./screenshots # Page range lit screenshot document.pdf --pages "1-10" -o ./screenshots ``` --- ## Step 3 — Key Options Reference ### OCR Options | Option | Description | |--------|-------------| | (default) | Tesseract.js — zero setup, built-in | | `--ocr-language fra` | Set OCR language (ISO code) | | `--ocr-server-url <url>` | Use external HTTP OCR server (EasyOCR, PaddleOCR, custom) | | `--no-ocr` | Disable OCR entirely | ### Output Options | Option | Description | |--------|-------------| | `--format json` | Structured JSON with bounding boxes | | `--format text` | Plain text (default) | | `-o <file>` | Save output to file | ### Performance / Quality Options | Option | Description | |--------|-------------| | `--dpi <n>` | Rendering DPI (default: 150; use 300 for high quality) | | `--max-pages <n>` | Limit pages parsed | | `--target-pages <pages>` | Parse specific pages (e.g. `"1-5,10"`) | | `--no-precise-bbox` | Disable precise bounding boxes (faster) | | `--skip-diagonal-text` | Ignore rotated/diagonal text | | `--preserve-small-text` | Keep very small text that would otherwise be dropped | --- ## Step