
Neo4j Document Import Skill
Turn PDFs and text corpora into a queryable Neo4j knowledge graph with chunking, extraction, and loader choices your agent can execute end to end.
Overview
neo4j-document-import-skill is an agent skill for the Build phase that guides importing unstructured documents into Neo4j as a knowledge graph via chunking, LLM extraction, and standard loaders.
Install
npx skills add https://github.com/neo4j-contrib/neo4j-skills --skill neo4j-document-import-skillWhat is this skill?
- Covers PDF/text chunking through entity extraction via SimpleKGPipeline and related Neo4j graphrag patterns
- Compares five chunking strategies (fixed-size, sentence/paragraph, semantic, n-gram, structural) with neo4j-graphrag and
- Documents no-code LLM Graph Builder path plus apoc.load.json and LangChain/LlamaIndex document loaders
- Extended reference SKILL overflow for chunking strategy and entity-resolution detail when imports get noisy
- Install via npx skills add from neo4j-contrib/neo4j-skills repo
- Five chunking strategies documented in the strategy comparison table
- Supports SimpleKGPipeline, apoc.load.json, and LangChain/LlamaIndex loaders
Adoption & trust: 1 installs on skills.sh; 80 GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You have piles of PDFs and text but no repeatable agent playbook to chunk, extract entities, and load them into Neo4j without re-researching every tool path.
Who is it for?
Solo builders shipping RAG or graph-augmented features on Neo4j who need structured import steps instead of one-off notebook experiments.
Skip if: Teams that only need simple CRUD on Postgres, have no Neo4j instance, or want a finished production ETL without reviewing draft skill guidance.
When should I use this skill?
You need to import unstructured documents into Neo4j as a knowledge graph and must choose chunking, extraction, and load mechanisms.
What do I get? / Deliverables
Your agent picks a chunking strategy, extraction pipeline, and Neo4j load path aligned to your corpus, with extended reference for entity resolution when quality slips.
- Chosen chunking and extraction pipeline configuration for the corpus
- Step-by-step import runbook aligned to Neo4j loaders in use
- Notes on entity resolution when extended reference is loaded
Recommended Skills
Journey fit
Document-to-graph ingestion is core product integration work once you are building backends and data layers, not early ideation. The skill wires external document pipelines (LLM extraction, LangChain/LlamaIndex, APOC JSON) into Neo4j—classic integrations subphase work.
How it compares
Procedural import orchestration for Neo4j graphs—not a hosted MCP server or a generic vector-only embedding tutorial.
Common Questions / FAQ
Who is neo4j-document-import-skill for?
Indie developers and small teams using Claude Code, Cursor, or Codex who already chose Neo4j and want the agent to execute document-to-graph imports with known chunking and loader options.
When should I use neo4j-document-import-skill?
During Build integrations when ingesting PDFs or text, when comparing fixed-size versus semantic chunking for long docs, or when wiring LangChain/LlamaIndex loaders into your graph schema.
Is neo4j-document-import-skill safe to install?
Review the Security Audits panel on this Prism page and your agent’s network/filesystem permissions before pointing it at proprietary documents or production Neo4j credentials.
SKILL.md
READMESKILL.md - Neo4j Document Import Skill
> **Status: Draft / WIP** # neo4j-document-import-skill Guides agents through importing unstructured documents into Neo4j as a knowledge graph: PDF/text chunking, LLM entity extraction (SimpleKGPipeline), LLM Graph Builder (no-code), apoc.load.json, and LangChain/LlamaIndex document loaders. **Install:** ```bash npx skills add https://github.com/neo4j-contrib/neo4j-skills --skill neo4j-document-import-skill ``` Or paste this link into your coding assistant: https://github.com/neo4j-contrib/neo4j-skills/tree/main/neo4j-document-import-skill # KG Construction — Extended Reference Overflow from `SKILL.md` — load when detailed chunking strategy or entity resolution config needed. --- ## Chunking Strategy Comparison | Strategy | How it splits | Best for | neo4j-graphrag class | |---|---|---|---| | Fixed-size | Token count with optional boundary respect | Dense technical docs; most use-cases | `FixedSizeSplitter(chunk_size, chunk_overlap)` | | Sentence/paragraph | Natural language boundaries (`\n\n`, `.`) | Narrative text, news articles, course content | LangChain `CharacterTextSplitter(separator="\n\n")` | | Semantic | Embedding similarity between adjacent sentences | Long-form documents with topic shifts | LangChain `SemanticChunker` (requires embedder) | | N-gram | Overlapping windows of n words | Short snippets, keyword-dense text | Custom — not built into neo4j-graphrag | | Structural | By section/heading/method (doc-specific) | API docs, legal contracts, structured PDFs | Custom — parse structure then chunk | **Rule**: Start with `FixedSizeSplitter(chunk_size=512, chunk_overlap=50)`. Switch to paragraph-based when sentences must not break (courses, articles). Switch to semantic chunking only when topic coherence within chunks is critical and embedder calls during ingestion are affordable. **Combination pattern** (course content model from GraphAcademy course): ``` Course → Module → Lesson → Paragraph ``` Split doc into structural units (Module/Lesson), then chunk each Lesson into Paragraphs (`\n\n`). Store both levels; query at Paragraph for vector search, traverse to Lesson for context. Pattern: ```python from langchain_text_splitters import CharacterTextSplitter splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1500, chunk_overlap=200) paragraphs = splitter.split_documents(lesson_docs) ``` LangChain `CharacterTextSplitter` behavior: 1. Split by `separator` (paragraph breaks) 2. Combine paragraphs up to `chunk_size` chars 3. If single paragraph > `chunk_size`: keep as-is (no mid-paragraph cut) 4. Add last paragraph of chunk N to start of chunk N+1 only if it's ≤ `chunk_overlap` chars --- ## Entity Resolver — Full Config Resolvers merge duplicate entities after bulk ingest. All use APOC `refactor.mergeNodes` internally. ### Class Hierarchy ``` EntityResolver (base) ├── SinglePropertyExactMatchResolver — exact name match ├── BasePropertySimilarityResolver (abstract) │ ├── FuzzyMatchResolver — Levenshtein; pip install rapidfuzz │ └── SpaCySemanticMatchResolver — cosine; pip install neo4j-graphrag[nlp] └── (custom subclass) ``` ### SinglePropertyExactMatchResolver ```python from neo4j_graphrag.experimental.components.resolver import SinglePropertyExactMatchResolver resolver = SinglePropertyExactMatchResolver( driver=driver, filter_query="WHERE n:Organization OR n:Person", # optional: narrow scope resolve_property="name", # default: "name" neo4j_database="neo4j", # optional ) stats = asyncio.run(resolver.run()) # stats.number_of_nodes_to_resolve, stats.number_of_created_nodes ``` ### FuzzyMatchResolver ```python from neo4j_graphrag.experimental.components.resolver import FuzzyMatchResolver resolver = FuzzyMatchResolver( driver=driver, resolve_properties=["name"], # list of properties to concatenate + compare threshold=0.9, # Levenshtein similarity 0–1; lower = more aggressive merging filter_query="WHERE n:Organization", ) a