neo4j-document-import-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Neo4j Document Import Skill

Neo4j 文档导入技能

When to Use

适用场景

  • Ingesting PDFs, HTML, plain text, Markdown into Neo4j as a knowledge graph
  • Chunking documents and storing
    :Chunk
    nodes with embeddings
  • Extracting entities and relationships from text with an LLM
  • Using
    SimpleKGPipeline
    (neo4j-graphrag) programmatically
  • Using Neo4j LLM Graph Builder (no-code web UI)
  • Loading semi-structured JSON via
    apoc.load.json
  • Connecting LangChain or LlamaIndex document loaders to Neo4j
  • 将PDF、HTML、纯文本、Markdown导入Neo4j,构建为知识图谱
  • 对文档进行分块,并存储带有嵌入向量的
    :Chunk
    节点
  • 通过LLM从文本中提取实体与关系
  • 以编程方式使用
    SimpleKGPipeline
    (neo4j-graphrag)
  • 使用Neo4j LLM Graph Builder(无代码网页UI)
  • 通过
    apoc.load.json
    加载半结构化JSON数据
  • 将LangChain或LlamaIndex文档加载器连接到Neo4j

When NOT to Use

不适用场景

  • Structured CSV / relational data
    neo4j-import-skill
  • GraphRAG retrieval after ingestion
    neo4j-graphrag-skill
  • Vector index creation
    neo4j-vector-search-skill
  • Cypher query writing
    neo4j-cypher-skill

  • 结构化CSV/关系型数据 → 使用
    neo4j-import-skill
  • 导入后的GraphRAG检索 → 使用
    neo4j-graphrag-skill
  • 向量索引创建 → 使用
    neo4j-vector-search-skill
  • Cypher查询编写 → 使用
    neo4j-cypher-skill

Approach Decision Table

方案决策表

SituationApproach
No code; drag-and-drop UX wantedLLM Graph Builder web UI
Programmatic pipeline; PDFs/text
SimpleKGPipeline
(neo4j-graphrag)
JSON / REST API responses
apoc.load.json
or Python + UNWIND
LangChain already in stack
Neo4jGraph
+ document loader
LlamaIndex already in stack
Neo4jQueryEngine
/
Neo4jVectorStore
Chunk-only (no entity extraction)Manual chunking + MERGE pattern

场景方案
无需代码;需要拖拽式交互体验LLM Graph Builder网页UI
编程式管道;处理PDF/文本
SimpleKGPipeline
(neo4j-graphrag)
JSON/REST API响应数据
apoc.load.json
或 Python + UNWIND
技术栈已包含LangChain
Neo4jGraph
+ 文档加载器
技术栈已包含LlamaIndex
Neo4jQueryEngine
/
Neo4jVectorStore
仅需分块(无需实体提取)手动分块 + MERGE模式

Install

安装

bash
pip install neo4j-graphrag              # includes SimpleKGPipeline
pip install neo4j-graphrag[openai]      # + OpenAI LLM/embedder
pip install neo4j-graphrag[anthropic]   # + Anthropic Claude
pip install neo4j-graphrag[google]      # + Vertex AI / Gemini
bash
pip install neo4j-graphrag              # 包含SimpleKGPipeline
pip install neo4j-graphrag[openai]      # + OpenAI LLM/嵌入模型
pip install neo4j-graphrag[anthropic]   # + Anthropic Claude
pip install neo4j-graphrag[google]      # + Vertex AI / Gemini

spaCy entity resolver (Python <= 3.13 only — unsupported on 3.14+):

spaCy实体解析器(仅支持Python <= 3.13 — 3.14+版本不兼容):

pip install neo4j-graphrag[nlp]

Requires: `neo4j>=6.0.0`, Python>=3.10, Neo4j>=5.18.1 (Aura>=5.18.0).

---
pip install neo4j-graphrag[nlp]

依赖要求:`neo4j>=6.0.0`,Python>=3.10,Neo4j>=5.18.1(Aura>=5.18.0)。

---

Step 1 — Define Graph Schema

步骤1 — 定义图Schema

Schema controls what the LLM extracts. Define before pipeline construction.
python
undefined
Schema控制LLM的提取逻辑,需在构建管道前定义。
python
undefined

Option A — Simple string lists (LLM infers descriptions)

选项A — 简单字符串列表(LLM自动推断描述)

entities = ["Person", "Organization", "Location", "Product", "Event"] relations = ["WORKS_AT", "LOCATED_IN", "KNOWS", "MENTIONS", "PART_OF"] patterns = [ ("Person", "WORKS_AT", "Organization"), ("Organization", "LOCATED_IN", "Location"), ("Person", "KNOWS", "Person"), ("Article", "MENTIONS", "Organization"), ]
entities = ["Person", "Organization", "Location", "Product", "Event"] relations = ["WORKS_AT", "LOCATED_IN", "KNOWS", "MENTIONS", "PART_OF"] patterns = [ ("Person", "WORKS_AT", "Organization"), ("Organization", "LOCATED_IN", "Location"), ("Person", "KNOWS", "Person"), ("Article", "MENTIONS", "Organization"), ]

Option B — Rich schema (better extraction quality)

选项B — 富Schema(提取质量更高)

from neo4j_graphrag.experimental.components.schema import ( SchemaBuilder, SchemaEntity, SchemaRelation ) schema = SchemaBuilder().create_schema_from_dict({ "entities": { "Person": {"description": "A human individual", "properties": {"name": "str", "role": "str"}}, "Organization": {"description": "A company or institution", "properties": {"name": "str", "industry": "str"}}, }, "relations": { "WORKS_AT": {"description": "Employment relationship"}, }, "patterns": [("Person", "WORKS_AT", "Organization")], })
from neo4j_graphrag.experimental.components.schema import ( SchemaBuilder, SchemaEntity, SchemaRelation ) schema = SchemaBuilder().create_schema_from_dict({ "entities": { "Person": {"description": "人类个体", "properties": {"name": "str", "role": "str"}}, "Organization": {"description": "公司或机构", "properties": {"name": "str", "industry": "str"}}, }, "relations": { "WORKS_AT": {"description": "雇佣关系"}, }, "patterns": [("Person", "WORKS_AT", "Organization")], })

Option C — Auto-extract schema from text (no constraints)

选项C — 从文本自动提取Schema(无约束)

schema = "EXTRACTED" # LLM infers types; noisier output schema = "FREE" # No schema guidance; most noise

Use Option B for production; Option A for prototyping; `"EXTRACTED"` only for exploration.

---
schema = "EXTRACTED" # LLM推断类型;输出噪声较大 schema = "FREE" # 无Schema引导;噪声最大

生产环境使用选项B;原型开发使用选项A;`"EXTRACTED"`仅用于探索性场景。

---

Step 2 — SimpleKGPipeline Setup

步骤2 — SimpleKGPipeline配置

python
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings

driver = GraphDatabase.driver(
    "neo4j+s://xxxx.databases.neo4j.io",
    auth=("neo4j", "password")
)

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={"temperature": 0, "response_format": {"type": "json_object"}},
)
embedder = OpenAIEmbeddings()   # OPENAI_API_KEY from env

pipeline = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    embedder=embedder,
    entities=entities,          # from Step 1
    relations=relations,
    patterns=patterns,
    from_file=True,             # False → pass text= instead of file_path=
    on_error="IGNORE",          # RAISE to surface extraction failures
    perform_entity_resolution=True,
    neo4j_database="neo4j",     # omit to use default
)
LLM alternatives (same interface):
  • AnthropicLLM(model_name="claude-3-5-sonnet-20241022")
  • VertexAILLM(model_name="gemini-1.5-pro-002")
  • OllamaLLM(model_name="llama3")
    — local; no API key needed

python
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings

driver = GraphDatabase.driver(
    "neo4j+s://xxxx.databases.neo4j.io",
    auth=("neo4j", "password")
)

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={"temperature": 0, "response_format": {"type": "json_object"}},
)
embedder = OpenAIEmbeddings()   # 从环境变量读取OPENAI_API_KEY

pipeline = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    embedder=embedder,
    entities=entities,          # 来自步骤1
    relations=relations,
    patterns=patterns,
    from_file=True,             # False → 传入text=而非file_path=
    on_error="IGNORE",          # 设置为RAISE可暴露提取失败信息
    perform_entity_resolution=True,
    neo4j_database="neo4j",     # 省略则使用默认数据库
)
LLM替代方案(接口一致):
  • AnthropicLLM(model_name="claude-3-5-sonnet-20241022")
  • VertexAILLM(model_name="gemini-1.5-pro-002")
  • OllamaLLM(model_name="llama3")
    — 本地部署;无需API密钥

Step 3 — Run the Pipeline

步骤3 — 运行管道

python
undefined
python
undefined

From PDF or Markdown file:

从PDF或Markdown文件运行:

result = asyncio.run(pipeline.run_async( file_path="report.pdf", document_metadata={"source": "Q4 report", "year": 2025}, ))
result = asyncio.run(pipeline.run_async( file_path="report.pdf", document_metadata={"source": "Q4报告", "year": 2025}, ))

From raw text:

从原始文本运行:

result = asyncio.run(pipeline.run_async( text=document_text, ))
result = asyncio.run(pipeline.run_async( text=document_text, ))

Batch — process multiple files:

批量处理 — 处理多个文件:

async def ingest_all(paths): for p in paths: await pipeline.run_async(file_path=str(p))
asyncio.run(ingest_all(list(pdf_dir.glob("*.pdf"))))

`document_metadata` dict is stored as properties on the `:Document` node.

---
async def ingest_all(paths): for p in paths: await pipeline.run_async(file_path=str(p))
asyncio.run(ingest_all(list(pdf_dir.glob("*.pdf"))))

`document_metadata`字典会被存储为`:Document`节点的属性。

---

Step 4 — Chunking Configuration

步骤4 — 分块配置

Default splitter:
FixedSizeSplitter(chunk_size=300, chunk_overlap=50)
.
python
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

splitter = FixedSizeSplitter(
    chunk_size=512,       # tokens; 300–512 typical for GPT-4o
    chunk_overlap=50,     # ~10% of chunk_size; preserves boundary context
    approximate=True,     # respect sentence/word boundaries when possible
)

pipeline = SimpleKGPipeline(
    ...,
    text_splitter=splitter,
)
Chunking guidance:
Document typechunk_sizechunk_overlap
Dense technical text256–51250–80
Narrative / news articles512–102480–128
Legal / financial docs256–38440–64
Rule: chunk must fit within LLM context for extraction + within embedding model limits. GPT-4o: 128k context;
text-embedding-3-small
: 8191 tokens. Never set chunk_size > 2048.

默认分割器:
FixedSizeSplitter(chunk_size=300, chunk_overlap=50)
python
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

splitter = FixedSizeSplitter(
    chunk_size=512,       # 令牌数;GPT-4o典型值为300–512
    chunk_overlap=50,     # 约为chunk_size的10%;保留边界上下文
    approximate=True,     # 尽可能尊重句子/单词边界
)

pipeline = SimpleKGPipeline(
    ...,
    text_splitter=splitter,
)
分块指导:
文档类型chunk_sizechunk_overlap
高密度技术文本256–51250–80
叙事/新闻文章512–102480–128
法律/金融文档256–38440–64
规则:分块必须能放入LLM的提取上下文,且不超过嵌入模型的限制。GPT-4o:128k上下文;
text-embedding-3-small
:8191令牌。chunk_size绝对不能超过2048。

Step 5 — Entity Resolution

步骤5 — 实体解析

Merge duplicate extracted entities after pipeline run.
python
from neo4j_graphrag.experimental.components.resolver import (
    SinglePropertyExactMatchResolver,   # identical name → merge
    FuzzyMatchResolver,                  # Levenshtein similarity; needs rapidfuzz
    SpaCySemanticMatchResolver,          # cosine similarity; needs neo4j-graphrag[nlp]
)
管道运行后合并重复提取的实体。
python
from neo4j_graphrag.experimental.components.resolver import (
    SinglePropertyExactMatchResolver,   # 名称完全一致则合并
    FuzzyMatchResolver,                  # 基于Levenshtein相似度;需要rapidfuzz
    SpaCySemanticMatchResolver,          # 基于余弦相似度;需要neo4j-graphrag[nlp]
)

Exact match (fastest; good baseline)

精确匹配(最快;基础方案)

resolver = SinglePropertyExactMatchResolver(driver) asyncio.run(resolver.run())
resolver = SinglePropertyExactMatchResolver(driver) asyncio.run(resolver.run())

Fuzzy match (handles typos / alternate spellings)

模糊匹配(处理拼写错误/别名)

from neo4j_graphrag.experimental.components.resolver import FuzzyMatchResolver resolver = FuzzyMatchResolver(driver, threshold=0.9) asyncio.run(resolver.run())
from neo4j_graphrag.experimental.components.resolver import FuzzyMatchResolver resolver = FuzzyMatchResolver(driver, threshold=0.9) asyncio.run(resolver.run())

Scope resolution to specific labels only:

仅对特定标签执行解析:

resolver = SinglePropertyExactMatchResolver( driver, filter_query="WHERE n:Organization OR n:Person", ) asyncio.run(resolver.run())

Run resolvers after ingestion, not inline — bulk merges are faster.

---
resolver = SinglePropertyExactMatchResolver( driver, filter_query="WHERE n:Organization OR n:Person", ) asyncio.run(resolver.run())

请在导入完成后运行解析器,而非内嵌执行——批量合并速度更快。

---

Resulting Graph Structure

生成的图结构

Pipeline always produces this lexical graph layer:
(:Document {id, fileName, status, ...metadata})
    -[:HAS_CHUNK]->
(:Chunk {id, text, index, embedding, ...})
    -[:NEXT_CHUNK]->          ← linked list for ordered traversal
(:Chunk {...})

(:Chunk)-[:FROM_DOCUMENT]->(:Document)   ← back-pointer
Entity extraction adds:
(:Chunk)-[:MENTIONS]->(:Person {name, ...})
(:Chunk)-[:MENTIONS]->(:Organization {name, ...})
(:Person)-[:WORKS_AT]->(:Organization)
Verify after ingestion:
cypher
CYPHER 25
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk)
RETURN d.fileName, count(c) AS chunks LIMIT 10;

MATCH (c:Chunk)-[:MENTIONS]->(e)
RETURN labels(e)[0] AS type, count(*) AS cnt ORDER BY cnt DESC LIMIT 20;

管道始终会生成以下词汇图层:
(:Document {id, fileName, status, ...metadata})
    -[:HAS_CHUNK]->
(:Chunk {id, text, index, embedding, ...})
    -[:NEXT_CHUNK]->          ← 用于有序遍历的链表
(:Chunk {...})

(:Chunk)-[:FROM_DOCUMENT]->(:Document)   ← 反向指针
实体提取会添加以下结构:
(:Chunk)-[:MENTIONS]->(:Person {name, ...})
(:Chunk)-[:MENTIONS]->(:Organization {name, ...})
(:Person)-[:WORKS_AT]->(:Organization)
导入完成后验证:
cypher
CYPHER 25
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk)
RETURN d.fileName, count(c) AS chunks LIMIT 10;

MATCH (c:Chunk)-[:MENTIONS]->(e)
RETURN labels(e)[0] AS type, count(*) AS cnt ORDER BY cnt DESC LIMIT 20;

LLM Graph Builder (No-Code UI)

LLM Graph Builder(无代码UI)

Use when: non-developers need to ingest docs; rapid prototyping; no Python environment.
Local (Docker):
bash
git clone https://github.com/neo4j-labs/llm-graph-builder
cd llm-graph-builder
适用于:非开发人员需要导入文档;快速原型开发;无Python环境。
本地部署(Docker):
bash
git clone https://github.com/neo4j-labs/llm-graph-builder
cd llm-graph-builder

Set OPENAI_API_KEY (or other provider keys) in .env

在.env中设置OPENAI_API_KEY(或其他提供商密钥)

docker-compose up
docker-compose up

Supported sources: PDF, plain text, Markdown, images, web pages, YouTube transcripts, S3/GCS bucket uploads.

LLM providers: OpenAI, Gemini, Claude, Llama3, Diffbot, Qwen.

Limitations: best with long-form English text; poor on tabular data (use `neo4j-import-skill` for CSV/Excel); visual diagrams not extracted.

---

支持的数据源:PDF、纯文本、Markdown、图片、网页、YouTube字幕、S3/GCS桶上传。

LLM提供商:OpenAI、Gemini、Claude、Llama3、Diffbot、Qwen。

局限性:最适合长英文文本;表格数据处理效果差(CSV/Excel请使用`neo4j-import-skill`);无法提取可视化图表。

---

APOC JSON Ingestion (Semi-Structured)

APOC JSON导入(半结构化数据)

Use when source is JSON from REST APIs, S3, or file exports.
cypher
CYPHER 25
CALL apoc.load.json("https://example.com/articles.json") YIELD value
UNWIND value.articles AS article
CALL (article) {
  MERGE (d:Document {id: article.id})
  SET d.title = article.title, d.url = article.url, d.publishedAt = article.publishedAt
  FOREACH (tag IN article.tags |
    MERGE (t:Tag {name: tag})
    MERGE (d)-[:HAS_TAG]->(t)
  )
} IN TRANSACTIONS OF 1000 ROWS
Local file:
apoc.load.json("file:///import/data.json")
. File must be in
$NEO4J_HOME/import/
or APOC
allowlist
configured.
Check APOC available:
RETURN apoc.version()
. APOC is included on all Aura tiers.

适用于数据源为REST API、S3或文件导出的JSON数据。
cypher
CYPHER 25
CALL apoc.load.json("https://example.com/articles.json") YIELD value
UNWIND value.articles AS article
CALL (article) {
  MERGE (d:Document {id: article.id})
  SET d.title = article.title, d.url = article.url, d.publishedAt = article.publishedAt
  FOREACH (tag IN article.tags |
    MERGE (t:Tag {name: tag})
    MERGE (d)-[:HAS_TAG]->(t)
  )
} IN TRANSACTIONS OF 1000 ROWS
本地文件:
apoc.load.json("file:///import/data.json")
。文件必须位于
$NEO4J_HOME/import/
目录,或已配置APOC允许列表。
检查APOC是否可用:
RETURN apoc.version()
。所有Aura套餐均包含APOC。

LangChain Integration Pattern

LangChain集成模式

python
from langchain_community.graphs import Neo4jGraph
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from neo4j import GraphDatabase

graph = Neo4jGraph(
    url="neo4j+s://xxxx.databases.neo4j.io",
    username="neo4j",
    password="password",
)

loader = PyPDFLoader("report.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)

embedder = OpenAIEmbeddings()
driver = GraphDatabase.driver(url, auth=("neo4j", "password"))

for i, chunk in enumerate(chunks):
    emb = embedder.embed_query(chunk.page_content)
    driver.execute_query(
        """
        MERGE (doc:Document {id: $doc_id})
        SET doc.source = $source
        CREATE (c:Chunk {id: $chunk_id, text: $text, embedding: $emb, index: $idx})
        CREATE (doc)-[:HAS_CHUNK]->(c)
        """,
        doc_id=chunk.metadata.get("source", "unknown"),
        source=chunk.metadata.get("source"),
        chunk_id=f"chunk-{i}",
        text=chunk.page_content,
        emb=emb,
        idx=i,
    )
For entity extraction with LangChain: use
LLMGraphTransformer
(from
langchain_experimental.graph_transformers
). Produces same
:Document
/
:Chunk
/entity pattern.

python
from langchain_community.graphs import Neo4jGraph
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from neo4j import GraphDatabase

graph = Neo4jGraph(
    url="neo4j+s://xxxx.databases.neo4j.io",
    username="neo4j",
    password="password",
)

loader = PyPDFLoader("report.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)

embedder = OpenAIEmbeddings()
driver = GraphDatabase.driver(url, auth=("neo4j", "password"))

for i, chunk in enumerate(chunks):
    emb = embedder.embed_query(chunk.page_content)
    driver.execute_query(
        """
        MERGE (doc:Document {id: $doc_id})
        SET doc.source = $source
        CREATE (c:Chunk {id: $chunk_id, text: $text, embedding: $emb, index: $idx})
        CREATE (doc)-[:HAS_CHUNK]->(c)
        """,
        doc_id=chunk.metadata.get("source", "unknown"),
        source=chunk.metadata.get("source"),
        chunk_id=f"chunk-{i}",
        text=chunk.page_content,
        emb=emb,
        idx=i,
    )
如需通过LangChain进行实体提取:使用
LLMGraphTransformer
(来自
langchain_experimental.graph_transformers
)。生成的
:Document
/
:Chunk
/实体模式与本技能一致。

Constraints and Indexes (Run Before Ingestion)

约束与索引(导入前运行)

cypher
CYPHER 25
// Prevent duplicate documents
CREATE CONSTRAINT doc_id_unique IF NOT EXISTS
  FOR (d:Document) REQUIRE d.id IS UNIQUE;

// Prevent duplicate chunks
CREATE CONSTRAINT chunk_id_unique IF NOT EXISTS
  FOR (c:Chunk) REQUIRE c.id IS UNIQUE;

// Entity deduplication
CREATE CONSTRAINT person_name_unique IF NOT EXISTS
  FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT org_name_unique IF NOT EXISTS
  FOR (o:Organization) REQUIRE o.name IS UNIQUE;

// Vector index for chunk embeddings (adjust dims for your model)
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
  FOR (c:Chunk) ON c.embedding
  OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};

// Poll until index ONLINE:
// SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE'
Do not start ingestion until all indexes are ONLINE:
cypher
SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE';
If rows returned: wait, then re-run. ONLINE = safe to ingest.

cypher
CYPHER 25
// 防止重复文档
CREATE CONSTRAINT doc_id_unique IF NOT EXISTS
  FOR (d:Document) REQUIRE d.id IS UNIQUE;

// 防止重复分块
CREATE CONSTRAINT chunk_id_unique IF NOT EXISTS
  FOR (c:Chunk) REQUIRE c.id IS UNIQUE;

// 实体去重
CREATE CONSTRAINT person_name_unique IF NOT EXISTS
  FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT org_name_unique IF NOT EXISTS
  FOR (o:Organization) REQUIRE o.name IS UNIQUE;

// 分块嵌入向量的索引(根据模型调整维度)
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
  FOR (c:Chunk) ON c.embedding
  OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};

// 轮询直到索引状态为ONLINE:
// SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE'
所有索引状态变为ONLINE后再开始导入:
cypher
SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE';
如果返回结果:请等待后重新执行。状态为ONLINE时方可安全导入。

Common Errors

常见错误

ErrorCauseFix
LLM extracts node types not in schemaSchema too loose or
"EXTRACTED"
mode
Define explicit
entities
+
patterns
; use Option B schema
MissingEmbedderError
embedder=
omitted
Always pass
embedder=
even if not doing vector search — pipeline stores embeddings on Chunk nodes
Zero entities extractedLLM context overflowReduce
chunk_size
; switch to model with larger context
Duplicate entity nodes after ingestionEntity resolution not runRun
SinglePropertyExactMatchResolver
after bulk ingest
apoc.load.json
permission denied
APOC allowlist not configuredAdd URL to
apoc.import.file.enabled=true
and
dbms.security.allow_csv_import_from_file_urls=true
Chunking loses sentence mid-way
approximate=False
(default) cuts at exact token count
Set
approximate=True
in
FixedSizeSplitter
chunk_size
too large → LLM timeouts
Extraction prompt + chunk exceeds contextKeep chunk_size ≤ 512 for GPT-4o extraction; ≤ 2048 absolute max
SpaCySemanticMatchResolver
fails on Python 3.14
spaCy not supported on 3.14+Use
FuzzyMatchResolver
or downgrade to Python 3.13
neo4j-driver
package not found
Deprecated package name since 6.0Use
neo4j
package:
pip install neo4j>=6.0.0

错误原因修复方案
LLM提取了Schema中未定义的节点类型Schema过于宽松或使用
"EXTRACTED"
模式
定义明确的
entities
+
patterns
;使用选项B的Schema
MissingEmbedderError
省略了
embedder=
参数
即使不进行向量搜索,也必须传入
embedder=
——管道会在Chunk节点存储嵌入向量
未提取到任何实体LLM上下文溢出减小
chunk_size
;切换到上下文更大的模型
导入后出现重复实体节点未运行实体解析批量导入后运行
SinglePropertyExactMatchResolver
apoc.load.json
权限被拒绝
未配置APOC允许列表将URL添加到
apoc.import.file.enabled=true
dbms.security.allow_csv_import_from_file_urls=true
分块截断了句子
approximate=False
(默认)按精确令牌数截断
FixedSizeSplitter
中设置
approximate=True
chunk_size
过大导致LLM超时
提取提示+分块超出上下文限制GPT-4o提取时chunk_size≤512;绝对最大值≤2048
Python 3.14中
SpaCySemanticMatchResolver
运行失败
spaCy不支持3.14+版本使用
FuzzyMatchResolver
或降级到Python 3.13
找不到
neo4j-driver
6.0版本后包名已弃用使用
neo4j
包:
pip install neo4j>=6.0.0

Verification Checklist

验证清单

  • Constraints created and ONLINE before ingestion starts
  • Vector index created before storing embeddings
  • chunk_size
    within embedding model limit (≤2048; ≤512 for extraction)
  • chunk_overlap
    set to 10–15% of chunk_size
  • Document
    HAS_CHUNK
    Chunk
    pattern used (enables graph traversal in retrieval)
  • document_metadata
    populated with source identifier
  • Entity resolver run after bulk ingestion
  • apoc.version()
    confirmed if using
    apoc.load.json
  • .env
    has API keys;
    .env
    in
    .gitignore
  • Verify structure:
    MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk) RETURN count(c)
  • Verify entities:
    MATCH (c:Chunk)-[:MENTIONS]->(e) RETURN labels(e)[0], count(*)

  • 导入前已创建约束且状态为ONLINE
  • 存储嵌入向量前已创建向量索引
  • chunk_size
    在嵌入模型限制内(≤2048;提取时≤512)
  • chunk_overlap
    设置为chunk_size的10–15%
  • 使用了
    Document
    HAS_CHUNK
    Chunk
    模式(支持检索时的图遍历)
  • document_metadata
    已填充来源标识符
  • 批量导入后已运行实体解析器
  • 使用
    apoc.load.json
    时已确认
    apoc.version()
    可用
  • .env
    文件包含API密钥;且
    .env
    已加入
    .gitignore
  • 验证结构:
    MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk) RETURN count(c)
  • 验证实体:
    MATCH (c:Chunk)-[:MENTIONS]->(e) RETURN labels(e)[0], count(*)

GraphSchema — Current API (≥1.7.1)

GraphSchema — 当前API(≥1.7.1)

entities
/
relations
/
potential_schema
deprecated since 1.7.1. Use
schema=GraphSchema(...)
:
python
from neo4j_graphrag.experimental.components.schema import (
    GraphSchema, NodeType, RelationshipType, PropertyType
)
schema = GraphSchema(
    node_types=[
        NodeType(label="Person", properties=[PropertyType(name="name", type="STRING")]),
        NodeType(label="Organization", properties=[PropertyType(name="name", type="STRING")]),
    ],
    relationship_types=[RelationshipType(label="WORKS_AT")],
    patterns=[("Person", "WORKS_AT", "Organization")],
)
pipeline = SimpleKGPipeline(llm=llm, driver=driver, embedder=embedder, schema=schema)
schema="FREE"
(no guidance) or
schema="EXTRACTED"
(LLM infers) — exploration only, noisier output.

entities
/
relations
/
potential_schema
自1.7.1版本起已弃用。请使用
schema=GraphSchema(...)
:
python
from neo4j_graphrag.experimental.components.schema import (
    GraphSchema, NodeType, RelationshipType, PropertyType
)
schema = GraphSchema(
    node_types=[
        NodeType(label="Person", properties=[PropertyType(name="name", type="STRING")]),
        NodeType(label="Organization", properties=[PropertyType(name="name", type="STRING")]),
    ],
    relationship_types=[RelationshipType(label="WORKS_AT")],
    patterns=[("Person", "WORKS_AT", "Organization")],
)
pipeline = SimpleKGPipeline(llm=llm, driver=driver, embedder=embedder, schema=schema)
schema="FREE"
(无引导)或
schema="EXTRACTED"
(LLM推断)——仅用于探索性场景,输出噪声较大。

LexicalGraphConfig — Customize Labels

LexicalGraphConfig — 自定义标签

Override default lexical layer labels (keep defaults unless integrating with existing graph):
python
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig
覆盖默认词汇层标签(除非与现有图集成,否则建议保留默认值):
python
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig

All fields have sensible defaults — only override what differs from your graph's conventions

所有字段均有合理默认值——仅覆盖与你的图约定不同的部分

config = LexicalGraphConfig( document_node_label="Article", # default: "Document" chunk_node_label="Passage", # default: "Chunk" node_to_chunk_relationship_type="HAS_ENTITY", # default: "MENTIONS" chunk_text_property="content", # default: "text" ) pipeline = SimpleKGPipeline(..., lexical_graph_config=config)

---
config = LexicalGraphConfig( document_node_label="Article", # 默认值: "Document" chunk_node_label="Passage", # 默认值: "Chunk" node_to_chunk_relationship_type="HAS_ENTITY", # 默认值: "MENTIONS" chunk_text_property="content", # 默认值: "text" ) pipeline = SimpleKGPipeline(..., lexical_graph_config=config)

---

Custom Document Loaders

自定义文档加载器

Default
file_loader
auto-dispatches by extension (
.pdf
PdfLoader
,
.md
MarkdownLoader
). Supports fsspec URIs (
s3://
,
gcs://
). Subclass
DataLoader
for HTML/web/custom formats:
python
from neo4j_graphrag.experimental.components.data_loader import DataLoader
from neo4j_graphrag.experimental.components.types import DocumentInfo, LoadedDocument

class WebPageLoader(DataLoader):
    async def run(self, filepath, metadata=None):
        import httpx
        text = httpx.get(filepath).text   # strip HTML in real impl
        return LoadedDocument(text=text,
            document_info=DocumentInfo(path=filepath, metadata=metadata))

pipeline = SimpleKGPipeline(..., file_loader=WebPageLoader(), from_file=True)
Chunking strategy by use-case and full resolver config: references/kg-construction.md.

默认
file_loader
会根据扩展名自动分发(
.pdf
PdfLoader
,
.md
MarkdownLoader
)。 支持fsspec URI(
s3://
,
gcs://
)。如需处理HTML/网页/自定义格式,请继承
DataLoader
:
python
from neo4j_graphrag.experimental.components.data_loader import DataLoader
from neo4j_graphrag.experimental.components.types import DocumentInfo, LoadedDocument

class WebPageLoader(DataLoader):
    async def run(self, filepath, metadata=None):
        import httpx
        text = httpx.get(filepath).text   # 实际实现中需剥离HTML
        return LoadedDocument(text=text,
            document_info=DocumentInfo(path=filepath, metadata=metadata))

pipeline = SimpleKGPipeline(..., file_loader=WebPageLoader(), from_file=True)
按场景划分的分块策略和完整解析器配置:references/kg-construction.md

References

参考资料