
Neo4j Spark Skill
Wire Apache Spark or Databricks jobs to read and write Neo4j with correct connector options, partitioning, and Delta-to-graph ingestion patterns.
Install
npx skills add https://github.com/neo4j-contrib/neo4j-skills --skill neo4j-spark-skillWhat is this skill?
- SparkSession setup with org.neo4j:neo4j-connector-apache-spark Maven coordinates
- Read paths: label scan, Cypher query, relationship scan; write paths with CREATE/MERGE and node.keys
- Partition and batch tuning (partitions, batch.size, schema.flatten.limit)
- Databricks cluster install, secrets, and Unity Catalog notes
- Delta Lake → Neo4j ingestion pipeline pattern with PySpark and Scala examples
Adoption & trust: 1 installs on skills.sh; 80 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
Recommended Skills
Supabase Postgres Best Practicessupabase/agent-skills
Lark Baselarksuite/cli
Convex Migration Helperget-convex/agent-skills
Neon Postgresneondatabase/agent-skills
Firebase Firestore Standardfirebase/agent-skills
Postgresql Table Designwshobson/agents
Journey fit
Primary fit
Spark–Neo4j integration is build-time systems work connecting analytics pipelines to the graph store. Connector setup, DataFrame read/write modes, and cloud cluster install belong in integrations rather than generic backend CRUD.
Common Questions / FAQ
Is Neo4j Spark Skill safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Neo4j Spark Skill
# neo4j-spark-skill Skill for reading and writing Neo4j data using the Neo4j Connector for Apache Spark, including Databricks, EMR, and standalone Spark environments. **Covers:** - SparkSession setup with Maven artifact `org.neo4j:neo4j-connector-apache-spark` - DataFrame reads: label scan, Cypher query, relationship scan - DataFrame writes: node CREATE/MERGE, relationship write with source/target mapping - `node.keys` for Overwrite (MERGE) mode - Partition and batch tuning (`partitions`, `batch.size`, `schema.flatten.limit`) - Databricks cluster installation, secrets management, Unity Catalog notes - Delta Lake → Neo4j ingestion pipeline pattern - PySpark and Scala code examples **Version / Compatibility:** - Connector: `5.4.2_for_spark_3` (Scala 2.12 or 2.13) - Spark: 3.3, 3.4, 3.5 - Databricks Runtime: 12.2, 13.3, 14.3 LTS - Neo4j: 4.4, 5.x, 2025.x **Not covered:** - Cypher query authoring → `neo4j-cypher-skill` - Neo4j Python bolt driver → `neo4j-driver-python-skill` - GDS graph algorithms → `neo4j-gds-skill` - Spring Boot + Neo4j → `neo4j-spring-data-skill` **Install:** ```bash npx skills add https://github.com/neo4j-contrib/neo4j-skills --skill neo4j-spark-skill ``` Or paste this link into your coding assistant: https://github.com/neo4j-contrib/neo4j-skills/tree/main/neo4j-spark-skill # Neo4j Spark Connector — Read Options Reference Full option reference for `.read.format("org.neo4j.spark.DataSource")`. ## Core Read Options (mutually exclusive — pick one) | Option | Value | Description | |--------|-------|-------------| | `labels` | `:Label` or `:Label1:Label2` | Read nodes with given label(s). Multiple = AND. | | `query` | Cypher string | Custom MATCH ... RETURN query. Aliases become column names. | | `relationship` | `REL_TYPE` | Read relationships of given type. Requires source/target label options. | ## Label Read Sub-Options | Option | Default | Description | |--------|---------|-------------| | `node.keys` | — | Comma-separated property names to include as match keys | ## Relationship Read Sub-Options | Option | Required | Description | |--------|----------|-------------| | `relationship.source.labels` | Yes | Colon-prefixed labels of source node `:Label` | | `relationship.target.labels` | Yes | Colon-prefixed labels of target node `:Label` | ## Query Read Sub-Options | Option | Description | |--------|-------------| | `query.count` | Cypher count query for partition planning (e.g. `MATCH (n:Person) RETURN count(n)`). Avoids full count scan. | ## Partition and Performance Options | Option | Default | Description | |--------|---------|-------------| | `partitions` | `1` | Number of Spark partitions. Connector uses SKIP/LIMIT internally. | | `batch.size` | `5000` | Rows per partition batch. | | `schema.flatten.limit` | `10` | Rows sampled for schema inference (no APOC). Increase for heterogeneous nodes. | ## Output Columns **Label scan result columns:** - `<id>` — internal Neo4j element ID - `<labels>` — array of node labels - One column per node property **Relationship scan result columns:** - `<rel.id>` — internal relationship ID - `<rel.type>` — relationship type string - `<source.id>`, `<source.labels>`, `source.<prop>` — source node fields - `<target.id>`, `<target.labels>`, `target.<prop>` — target node fields - Relationship property columns at top level ## Schema Inference Notes - Without APOC: samples `schema.flatten.limit` rows to infer types - With APOC: uses `apoc.meta.nodeTypeProperties` — more accurate - Map/list properties: flattened into dot-notation columns (e.g. `address.city`) - Use `query` mode with explicit RETURN types when inference is unreliable ## Examples ### Multi-label AND filter ```python df = (spark.read.format("org.neo4j.spark.DataSource") .option("labels", ":Person:Employee") .load()) ``` ### Cypher with explicit column types ```python df = (spark.read.format("org.neo4j.spark.DataSource") .option("query", """ MATCH (p:Person)-[r:ACTED_IN]