Neo4j Import Skill
When to Use
- Importing CSV, JSON, or Parquet files into Neo4j
- Batch-upserting nodes and relationships (UNWIND + CALL IN TRANSACTIONS)
- Migrating relational data (SQL → graph)
- Bulk-loading large datasets offline (neo4j-admin import)
- Choosing between online (Cypher) and offline (admin) import methods
- Verifying import completeness (counts, constraints, index states)
When NOT to Use
- Unstructured docs, PDFs, vector chunks →
neo4j-document-import-skill
- Live application writes (MERGE/CREATE in app code) →
- neo4j-admin backup/restore/config →
- GDS algorithm projection from existing graph →
Method Decision Table
| Dataset size | DB state | Source | Method |
|---|
| Any size | Online | CSV (Aura or local) | LOAD CSV + CALL IN TRANSACTIONS |
| < 1M rows | Online | List/API response | UNWIND + CALL IN TRANSACTIONS |
| > 10M rows | Offline (local/self-managed) | CSV / Parquet | neo4j-admin database import full
|
| Any size | Online | APOC available | + |
| Any size | Online | JSON/API | or driver batching |
| Incremental delta | Offline (Enterprise) | CSV | neo4j-admin database import incremental
|
Aura: only
URLs — no
. Use neo4j-admin import only on self-managed.
Pre-Import Checklist
Run in this exact order — skipping causes hard-to-debug duplicates or missed index usage:
Constraints BEFORE import. Additional indexes AFTER import.
- Constraints create implicit RANGE indexes used by MERGE during load + enforce uniqueness
- Additional non-unique indexes (TEXT, RANGE on non-key props, FULLTEXT) created after load — Neo4j populates them async from the committed data; poll until 100%
- Creating extra indexes before import slows every write during load with no benefit
-
Create uniqueness constraints (enables index used by MERGE):
cypher
CREATE CONSTRAINT IF NOT EXISTS FOR (n:Person) REQUIRE n.id IS UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS FOR (n:Movie) REQUIRE n.movieId IS UNIQUE;
-
Verify APOC if using apoc. procedures*:
If fails → APOC not installed. Use plain LOAD CSV instead.
-
Confirm target is PRIMARY (not replica):
cypher
CALL dbms.cluster.role() YIELD role RETURN role;
If role ≠
→ stop. Redirect write to PRIMARY endpoint.
-
Count source file rows before import (catch encoding issues early):
bash
wc -l data/persons.csv # Linux/macOS
-
Verify UTF-8 encoding — LOAD CSV requires UTF-8. Re-encode if needed:
bash
file -i persons.csv # Check encoding
iconv -f latin1 -t utf-8 persons.csv > persons_utf8.csv
LOAD CSV Patterns
Basic node import with type coercion and null handling
cypher
CYPHER 25
LOAD CSV WITH HEADERS FROM 'file:///persons.csv' AS row
CALL (row) {
MERGE (p:Person {id: row.id})
ON CREATE SET
p.name = row.name,
p.age = toIntegerOrNull(row.age),
p.score = toFloatOrNull(row.score),
p.active = toBoolean(row.active),
p.born = CASE WHEN row.born IS NOT NULL AND row.born <> '' THEN date(row.born) ELSE null END,
p.createdAt = datetime()
ON MATCH SET
p.updatedAt = datetime()
} IN TRANSACTIONS OF 10000 ROWS
ON ERROR CONTINUE
REPORT STATUS AS s
RETURN s.transactionId, s.committed, s.errorMessage
Null/empty-string rules:
- CSV missing column → (safe)
- CSV empty string → stored as not — use to convert
- throws → always use
- throws → always use
- Neo4j never stores properties — they are silently dropped on SET
Relationship import (nodes must exist first)
cypher
CYPHER 25
LOAD CSV WITH HEADERS FROM 'file:///knows.csv' AS row
CALL (row) {
MATCH (a:Person {id: row.fromId})
MATCH (b:Person {id: row.toId})
MERGE (a)-[:KNOWS {since: toIntegerOrNull(row.year)}]->(b)
} IN TRANSACTIONS OF 5000 ROWS
ON ERROR CONTINUE
REPORT STATUS AS s
Always import ALL nodes before ANY relationships — MATCH fails on missing nodes.
Tab-separated or custom delimiter
cypher
CYPHER 25
LOAD CSV WITH HEADERS FROM 'file:///data.tsv' AS row FIELDTERMINATOR '\t'
CALL (row) { MERGE (p:Person {id: row.id}) }
IN TRANSACTIONS OF 10000 ROWS ON ERROR CONTINUE
Compressed files (ZIP / gzip — local files only)
cypher
LOAD CSV WITH HEADERS FROM 'file:///archive.csv.gz' AS row ...
Cloud storage (Enterprise Edition)
| Scheme | Example |
|---|
| AWS S3 | s3://my-bucket/data/persons.csv
|
| Google Cloud Storage | gs://my-bucket/persons.csv
|
| Azure Blob | azb://account/container/persons.csv
|
Useful built-in functions inside LOAD CSV
cypher
linenumber() // current line number — use as fallback ID
file() // absolute path of file being loaded
CALL IN TRANSACTIONS — Full Reference
Syntax
cypher
CALL (row) {
// write logic
} IN [n CONCURRENT] TRANSACTIONS
[OF batchSize ROW[S]]
[ON ERROR {CONTINUE | BREAK | FAIL | RETRY [FOR duration SECONDS] [THEN {CONTINUE|BREAK|FAIL}]}]
[REPORT STATUS AS statusVar]
ON ERROR modes
| Mode | Behavior | Use when |
|---|
| Default. Rolls back entire outer tx on first error | All-or-nothing strict import |
| Skips failed batch, continues remaining batches | Resilient bulk load — track errors via REPORT STATUS |
| Stops after first failed batch; keeps completed work | Semi-strict: stop early, keep successful batches |
| Exponential backoff retry (default 30s) + fallback | Concurrent writes with deadlock risk |
→ outer transaction
succeeds even if inner batches fail.
→ cannot be combined with
.
CONCURRENT TRANSACTIONS (parallel batches)
cypher
CYPHER 25
LOAD CSV WITH HEADERS FROM 'file:///large.csv' AS row
CALL (row) {
MERGE (p:Person {id: row.id}) SET p.name = row.name
} IN 4 CONCURRENT TRANSACTIONS OF 5000 ROWS
ON ERROR RETRY FOR 30 SECONDS THEN CONTINUE
REPORT STATUS AS s
Use CONCURRENT for read-heavy MERGE on non-overlapping key spaces. Risk: deadlocks on overlapping writes → combine with
.
REPORT STATUS columns
| Column | Type | Meaning |
|---|
| BOOLEAN | Batch transaction started |
| BOOLEAN | Batch committed successfully |
| STRING | Transaction ID |
| STRING or null | Error detail if batch failed |
Batch size guidance
| Row count | Recommended batch size | Notes |
|---|
| < 100k | 10 000 | Default is fine |
| 100k – 1M | 10 000 – 50 000 | Monitor heap; increase if fast |
| 1M – 10M | 50 000 – 100 000 | Enable CONCURRENT if CPUs available |
| > 10M online | 50 000 | Consider neo4j-admin import instead |
| Relationship import | 5 000 | Lower — each batch does 2x MATCH |
neo4j-admin import (Offline Bulk Load)
Fastest method: ~3 min for 31M nodes / 78M rels on SSD. DB must be stopped or non-existent.
Command structure
bash
neo4j-admin database import full \
--nodes=Person="persons_header.csv,persons.csv" \
--nodes=Movie="movies_header.csv,movies.csv" \
--relationships=ACTED_IN="acted_in_header.csv,acted_in.csv" \
--relationships=DIRECTED="directed_header.csv,directed.csv" \
--delimiter=, \
--id-type=STRING \
--bad-tolerance=0 \
--threads=$(nproc) \
--high-parallel-io=on \
neo4j
For SSDs: always set
. For large graphs (>34B nodes/rels):
.
Dry run (2026.02+) — validate without writing:
bash
neo4j-admin database import full --dry-run ...
Node header file format
# persons_header.csv
personId:ID,name,born:int,score:float,active:boolean,:LABEL
# persons.csv (data file — no header row)
p001,Alice,1985,9.2,true,Person
p002,Bob,1990,7.1,false,Person
| Field | Meaning |
|---|
| Unique ID for relationship wiring (not stored as property by default) |
| Scoped ID space — use when node types share IDs |
| One or more labels; semicolon-separated: |
| Typed property; types: int long float double boolean byte short string
|
| Temporal: date localtime time localdatetime datetime duration
|
| Array — semicolon-separated values in cell: |
| Float vector (2025.10+) — semicolon-separated coordinates |
Relationship header file format
# acted_in_header.csv
:START_ID(Person),:END_ID(Movie),role,:TYPE
# acted_in.csv
p001,tt0133093,Neo,ACTED_IN
p002,tt0133093,Morpheus,ACTED_IN
/
must reference the same
group as the node files.
Key flags
| Flag | Default | Notes |
|---|
| | Single char or |
| | STRING | INTEGER | ACTUAL
|
| (unlimited, changed 2025.12) | Set for strict prod imports |
| CPU count | Set explicitly on shared hosts |
| 90% RAM | Reduce if other services share host |
| | Set for SSD/NVMe |
| | for >34B nodes/rels |
| false | Required if DB already exists |
| false | 2026.02+ — validate without writing |
Schema file (--schema) [Enterprise, block format]
Pass a Cypher file with
/
statements; executed automatically after import completes. Constraints are created first (correct order enforced). File paths can be local or remote (
,
,
).
bash
neo4j-admin database import full \
--format=block \
--schema=schema.cypher \
--nodes=Person="persons_header.csv,persons.csv" \
neo4j
cypher
// schema.cypher
CREATE CONSTRAINT person_id IF NOT EXISTS FOR (n:Person) REQUIRE n.id IS UNIQUE;
CREATE CONSTRAINT movie_id IF NOT EXISTS FOR (n:Movie) REQUIRE n.id IS UNIQUE;
CREATE RANGE INDEX person_email IF NOT EXISTS FOR (n:Person) ON (n.email);
CREATE TEXT INDEX movie_title IF NOT EXISTS FOR (n:Movie) ON (n.title);
For incremental import,
/
are also supported [2025.02+] — used to remove indexes before the merge phase and recreate them after for faster writes.
Incremental import (Enterprise only)
Three-phase process — use when DB must stay online during import preparation:
bash
# Phase 1: Prepare staging area
neo4j-admin database import incremental --stage=prepare \
--nodes=Person=persons_header.csv,delta.csv --force neo4j
# Phase 2: Build indexes (DB can be read-only during this phase)
neo4j-admin database import incremental --stage=build neo4j
# Phase 3: Merge into live database (brief write-lock)
neo4j-admin database import incremental --stage=merge neo4j
Requires Enterprise Edition +
store format.
APOC Patterns (when APOC is available)
Verify first:
— if fails, use LOAD CSV or driver instead.
apoc.periodic.iterate — batch-process existing graph data
cypher
CALL apoc.periodic.iterate(
"MATCH (p:Person) WHERE NOT (p)-[:HAS_ACCOUNT]->() RETURN p",
"CREATE (p)-[:HAS_ACCOUNT]->(a:Account {id: randomUUID()})",
{batchSize: 10000, parallel: false, retries: 2}
) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
| Config key | Default | Notes |
|---|
| 10000 | Rows per inner transaction |
| false | Enable for non-overlapping writes; risk: deadlocks |
| 0 | Retry failed batches N times with 100ms delay |
Prefer
(native Cypher) over
for new code — it has
,
, and
built in without APOC dependency.
apoc.load.csv — load with config options
cypher
CALL apoc.load.csv('file:///persons.csv', {
header: true,
sep: ',',
skip: 1,
limit: 1000000
}) YIELD lineNo, map, list
CALL (map) {
MERGE (p:Person {id: map.id}) SET p.name = map.name
} IN TRANSACTIONS OF 10000 ROWS ON ERROR CONTINUE
apoc.load.json — load JSON from file or URL
cypher
CALL apoc.load.json('https://api.example.com/persons') YIELD value
CALL (value) {
MERGE (p:Person {id: value.id}) SET p.name = value.name
} IN TRANSACTIONS OF 1000 ROWS ON ERROR CONTINUE
Driver Batch Write Pattern
Use when source is not a file (API responses, DB migrations). Collect into
(10 000) lists, call
UNWIND $rows AS row MERGE ...
per batch. ~10x faster than row-at-a-time. →
Python + JS examples
MCP Tool Usage
| Operation | MCP tool | Notes |
|---|
| , | | Always inspect before import |
| , | | Gate: show planned constraint, confirm |
| LOAD CSV / CALL IN TRANSACTIONS | | Gate: show row count + Cypher, confirm |
| Verify counts | | Post-import: MATCH (n:Label) RETURN count(n)
|
| Poll index state | | Poll until all |
Write gate — before any bulk write via MCP, show:
- Query + affected labels
- Estimated row count from source
- plan
Wait for user confirmation. Never auto-execute
or
without confirmation.
Always pass
param if not default:
{"code": "...", "database": "neo4j"}
.
Common Errors
| Error | Cause | Fix |
|---|
Couldn't load the external resource
| path not in Neo4j import dir | Move file to ; check dbms.security.allow_csv_import_from_file_urls=true
|
Cannot merge node using null property value
| MERGE key resolved to null | Validate before MERGE; add |
toInteger() called on null
| Null column fed to non-null-safe fn | Replace → , → |
| / constraint violation mid-import | Duplicate source IDs | Dedup source CSV; use not ; add to constraint |
| Heap overflow / OutOfMemoryError | Batch too large or file too large | Reduce batch size; switch to ; neo4j-admin for offline |
Invalid input 'IN': expected...'
| used | Replace → — PERIODIC COMMIT removed in Cypher 25 |
| neo4j-admin: | Wrong header format or type mismatch | Check , , present; check typed columns parse correctly |
| neo4j-admin: import fails silently | default was unlimited pre-2025.12 | Set to surface all errors |
| Index not used during MERGE | Constraint not created before import | Drop data, create constraint, re-import |
| Relationship import missing nodes | Relationships imported before nodes | Always import ALL node files before ANY relationship files |
Post-Import Validation
After import completes — run all:
cypher
// Row counts per label
MATCH (n:Person) RETURN count(n) AS persons;
MATCH ()-[:KNOWS]->() RETURN count(*) AS knows_rels;
// After import: create additional non-unique indexes (populated async)
CREATE TEXT INDEX movie_title IF NOT EXISTS FOR (n:Movie) ON (n.title);
CREATE RANGE INDEX person_born IF NOT EXISTS FOR (n:Person) ON (n.born);
// Poll population — wait until populationPercent = 100 before opening to queries
SHOW INDEXES YIELD name, state, populationPercent
WHERE state <> 'ONLINE' OR populationPercent < 100
RETURN name, state, populationPercent
ORDER BY populationPercent;
// Spot check: null keys = import bug
MATCH (p:Person) WHERE p.id IS NULL RETURN count(p) AS missing_id;
Do NOT run production queries until all indexes are ONLINE.
References
- LOAD CSV — Cypher Manual 25
- CALL IN TRANSACTIONS — Cypher Manual
- neo4j-admin database import
- APOC periodic execution
- APOC load procedures
- GraphAcademy: Importing CSV Data
- Indexes and constraints — types, MERGE lock semantics, import pre-flight
- Data Importer GUI — when to use, Aura access, multi-pass, gotchas
- Post-import refactoring — split lists, extract nodes, add labels, FK validation
Checklist