
Kordoc Korean Document Parser
Turn Korean HWP, HWPX, and PDF files into Markdown or structured blocks for agents, scripts, or an MCP server.
Install
npx skills add https://github.com/aradotso/trending-skills --skill kordoc-korean-document-parserWhat is this skill?
- Parses HWP 5.x, HWPX, and PDF into Markdown plus IRBlock[] structured output
- Auto-detect format from buffer; exposes metadata (title, author) on success
- Optional pdfjs-dist peer for PDF; CLI via npx kordoc without global install
- Form field extraction and two-document compare for Korean office files
- MCP server setup path for agent-driven parse workflows
Adoption & trust: 673 installs on skills.sh; 31 GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Lark Maillarksuite/cli
Lark Slideslarksuite/cli
Pptxanthropics/skills
Pdfanthropics/skills
Lark Markdownlarksuite/cli
Docxanthropics/skills
Journey fit
Common Questions / FAQ
Is Kordoc Korean Document Parser safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Kordoc Korean Document Parser
# kordoc Korean Document Parser > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. kordoc is a TypeScript library and CLI for parsing Korean government documents (HWP 5.x, HWPX, PDF) into Markdown and structured `IRBlock[]` data. It handles proprietary HWP binary formats, table extraction, form field recognition, document diffing, and reverse Markdown→HWPX generation. --- ## Installation ```bash # Core library npm install kordoc # PDF support (optional peer dependency) npm install pdfjs-dist # CLI (no install needed) npx kordoc document.hwpx ``` --- ## Core API ### Auto-detect and Parse Any Document ```typescript import { parse } from "kordoc" import { readFileSync } from "fs" const buffer = readFileSync("document.hwpx") const result = await parse(buffer.buffer) // ArrayBuffer required if (result.success) { console.log(result.markdown) // string: full Markdown console.log(result.blocks) // IRBlock[]: structured data console.log(result.metadata) // { title, author, createdAt, pageCount, ... } console.log(result.outline) // OutlineItem[]: document structure console.log(result.warnings) // ParseWarning[]: skipped elements } else { console.error(result.error) // string message console.error(result.code) // ErrorCode: "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ... } ``` ### Format-Specific Parsers ```typescript import { parseHwpx, parseHwp, parsePdf, detectFormat } from "kordoc" // Detect format first const fmt = detectFormat(buffer.buffer) // "hwpx" | "hwp" | "pdf" | "unknown" // Parse by format const hwpxResult = await parseHwpx(buffer.buffer) const hwpResult = await parseHwp(buffer.buffer) const pdfResult = await parsePdf(buffer.buffer) ``` ### Parse Options ```typescript import { parse, ParseOptions } from "kordoc" const result = await parse(buffer.buffer, { pages: "1-3", // page range string // pages: [1, 5, 10], // or specific page numbers ocr: async (pageImage, pageNumber, mimeType) => { // Pluggable OCR for image-based PDFs // pageImage: ArrayBuffer of the page image return await myOcrService.recognize(pageImage) } }) ``` --- ## Working with IRBlocks ```typescript import type { IRBlock, IRBlockType, IRTable, IRCell } from "kordoc" // IRBlock types: "heading" | "paragraph" | "table" | "list" | "image" | "separator" for (const block of result.blocks) { if (block.type === "heading") { console.log(`H${block.level}: ${block.text}`) console.log(block.bbox) // { x, y, width, height, page } } if (block.type === "table") { const table = block as IRTable for (const row of table.rows) { for (const cell of row) { console.log(cell.text, cell.colspan, cell.rowspan) } } } if (block.type === "paragraph") { console.log(block.text) console.log(block.style) // InlineStyle: { bold, italic, fontSize, ... } console.log(block.pageNumber) } } ``` ### Convert Blocks Back to Markdown ```typescript import { blocksToMarkdown } from "kordoc" const markdown = blocksToMarkdown(result.blocks) ``` --- ## Document Comparison ```typescript import { compare } from "kordoc" const bufA = readFileSync("v1.hwp").buffer const bufB = readFileSync("v2.hwpx").buffer // cross-format supported const diff = await compare(bufA, bufB) console.log(diff.stats) // { added: 3, removed: 1, modified: 5, unchanged: 42 } for (const d of diff.diffs) { // d.type: "added" | "removed" | "modified" | "unchanged" // d.b