neo4j-document-import-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNeo4j Document Import Skill
Neo4j 文档导入技能
When to Use
适用场景
- Ingesting PDFs, HTML, plain text, Markdown into Neo4j as a knowledge graph
- Chunking documents and storing nodes with embeddings
:Chunk - Extracting entities and relationships from text with an LLM
- Using (neo4j-graphrag) programmatically
SimpleKGPipeline - Using Neo4j LLM Graph Builder (no-code web UI)
- Loading semi-structured JSON via
apoc.load.json - Connecting LangChain or LlamaIndex document loaders to Neo4j
- 将PDF、HTML、纯文本、Markdown导入Neo4j,构建为知识图谱
- 对文档进行分块,并存储带有嵌入向量的节点
:Chunk - 通过LLM从文本中提取实体与关系
- 以编程方式使用(neo4j-graphrag)
SimpleKGPipeline - 使用Neo4j LLM Graph Builder(无代码网页UI)
- 通过加载半结构化JSON数据
apoc.load.json - 将LangChain或LlamaIndex文档加载器连接到Neo4j
When NOT to Use
不适用场景
- Structured CSV / relational data →
neo4j-import-skill - GraphRAG retrieval after ingestion →
neo4j-graphrag-skill - Vector index creation →
neo4j-vector-search-skill - Cypher query writing →
neo4j-cypher-skill
- 结构化CSV/关系型数据 → 使用
neo4j-import-skill - 导入后的GraphRAG检索 → 使用
neo4j-graphrag-skill - 向量索引创建 → 使用
neo4j-vector-search-skill - Cypher查询编写 → 使用
neo4j-cypher-skill
Approach Decision Table
方案决策表
| Situation | Approach |
|---|---|
| No code; drag-and-drop UX wanted | LLM Graph Builder web UI |
| Programmatic pipeline; PDFs/text | |
| JSON / REST API responses | |
| LangChain already in stack | |
| LlamaIndex already in stack | |
| Chunk-only (no entity extraction) | Manual chunking + MERGE pattern |
| 场景 | 方案 |
|---|---|
| 无需代码;需要拖拽式交互体验 | LLM Graph Builder网页UI |
| 编程式管道;处理PDF/文本 | |
| JSON/REST API响应数据 | |
| 技术栈已包含LangChain | |
| 技术栈已包含LlamaIndex | |
| 仅需分块(无需实体提取) | 手动分块 + MERGE模式 |
Install
安装
bash
pip install neo4j-graphrag # includes SimpleKGPipeline
pip install neo4j-graphrag[openai] # + OpenAI LLM/embedder
pip install neo4j-graphrag[anthropic] # + Anthropic Claude
pip install neo4j-graphrag[google] # + Vertex AI / Geminibash
pip install neo4j-graphrag # 包含SimpleKGPipeline
pip install neo4j-graphrag[openai] # + OpenAI LLM/嵌入模型
pip install neo4j-graphrag[anthropic] # + Anthropic Claude
pip install neo4j-graphrag[google] # + Vertex AI / GeminispaCy entity resolver (Python <= 3.13 only — unsupported on 3.14+):
spaCy实体解析器(仅支持Python <= 3.13 — 3.14+版本不兼容):
pip install neo4j-graphrag[nlp]
Requires: `neo4j>=6.0.0`, Python>=3.10, Neo4j>=5.18.1 (Aura>=5.18.0).
---pip install neo4j-graphrag[nlp]
依赖要求:`neo4j>=6.0.0`,Python>=3.10,Neo4j>=5.18.1(Aura>=5.18.0)。
---Step 1 — Define Graph Schema
步骤1 — 定义图Schema
Schema controls what the LLM extracts. Define before pipeline construction.
python
undefinedSchema控制LLM的提取逻辑,需在构建管道前定义。
python
undefinedOption A — Simple string lists (LLM infers descriptions)
选项A — 简单字符串列表(LLM自动推断描述)
entities = ["Person", "Organization", "Location", "Product", "Event"]
relations = ["WORKS_AT", "LOCATED_IN", "KNOWS", "MENTIONS", "PART_OF"]
patterns = [
("Person", "WORKS_AT", "Organization"),
("Organization", "LOCATED_IN", "Location"),
("Person", "KNOWS", "Person"),
("Article", "MENTIONS", "Organization"),
]
entities = ["Person", "Organization", "Location", "Product", "Event"]
relations = ["WORKS_AT", "LOCATED_IN", "KNOWS", "MENTIONS", "PART_OF"]
patterns = [
("Person", "WORKS_AT", "Organization"),
("Organization", "LOCATED_IN", "Location"),
("Person", "KNOWS", "Person"),
("Article", "MENTIONS", "Organization"),
]
Option B — Rich schema (better extraction quality)
选项B — 富Schema(提取质量更高)
from neo4j_graphrag.experimental.components.schema import (
SchemaBuilder, SchemaEntity, SchemaRelation
)
schema = SchemaBuilder().create_schema_from_dict({
"entities": {
"Person": {"description": "A human individual", "properties": {"name": "str", "role": "str"}},
"Organization": {"description": "A company or institution", "properties": {"name": "str", "industry": "str"}},
},
"relations": {
"WORKS_AT": {"description": "Employment relationship"},
},
"patterns": [("Person", "WORKS_AT", "Organization")],
})
from neo4j_graphrag.experimental.components.schema import (
SchemaBuilder, SchemaEntity, SchemaRelation
)
schema = SchemaBuilder().create_schema_from_dict({
"entities": {
"Person": {"description": "人类个体", "properties": {"name": "str", "role": "str"}},
"Organization": {"description": "公司或机构", "properties": {"name": "str", "industry": "str"}},
},
"relations": {
"WORKS_AT": {"description": "雇佣关系"},
},
"patterns": [("Person", "WORKS_AT", "Organization")],
})
Option C — Auto-extract schema from text (no constraints)
选项C — 从文本自动提取Schema(无约束)
schema = "EXTRACTED" # LLM infers types; noisier output
schema = "FREE" # No schema guidance; most noise
Use Option B for production; Option A for prototyping; `"EXTRACTED"` only for exploration.
---schema = "EXTRACTED" # LLM推断类型;输出噪声较大
schema = "FREE" # 无Schema引导;噪声最大
生产环境使用选项B;原型开发使用选项A;`"EXTRACTED"`仅用于探索性场景。
---Step 2 — SimpleKGPipeline Setup
步骤2 — SimpleKGPipeline配置
python
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
driver = GraphDatabase.driver(
"neo4j+s://xxxx.databases.neo4j.io",
auth=("neo4j", "password")
)
llm = OpenAILLM(
model_name="gpt-4o",
model_params={"temperature": 0, "response_format": {"type": "json_object"}},
)
embedder = OpenAIEmbeddings() # OPENAI_API_KEY from env
pipeline = SimpleKGPipeline(
llm=llm,
driver=driver,
embedder=embedder,
entities=entities, # from Step 1
relations=relations,
patterns=patterns,
from_file=True, # False → pass text= instead of file_path=
on_error="IGNORE", # RAISE to surface extraction failures
perform_entity_resolution=True,
neo4j_database="neo4j", # omit to use default
)LLM alternatives (same interface):
AnthropicLLM(model_name="claude-3-5-sonnet-20241022")VertexAILLM(model_name="gemini-1.5-pro-002")- — local; no API key needed
OllamaLLM(model_name="llama3")
python
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
driver = GraphDatabase.driver(
"neo4j+s://xxxx.databases.neo4j.io",
auth=("neo4j", "password")
)
llm = OpenAILLM(
model_name="gpt-4o",
model_params={"temperature": 0, "response_format": {"type": "json_object"}},
)
embedder = OpenAIEmbeddings() # 从环境变量读取OPENAI_API_KEY
pipeline = SimpleKGPipeline(
llm=llm,
driver=driver,
embedder=embedder,
entities=entities, # 来自步骤1
relations=relations,
patterns=patterns,
from_file=True, # False → 传入text=而非file_path=
on_error="IGNORE", # 设置为RAISE可暴露提取失败信息
perform_entity_resolution=True,
neo4j_database="neo4j", # 省略则使用默认数据库
)LLM替代方案(接口一致):
AnthropicLLM(model_name="claude-3-5-sonnet-20241022")VertexAILLM(model_name="gemini-1.5-pro-002")- — 本地部署;无需API密钥
OllamaLLM(model_name="llama3")
Step 3 — Run the Pipeline
步骤3 — 运行管道
python
undefinedpython
undefinedFrom PDF or Markdown file:
从PDF或Markdown文件运行:
result = asyncio.run(pipeline.run_async(
file_path="report.pdf",
document_metadata={"source": "Q4 report", "year": 2025},
))
result = asyncio.run(pipeline.run_async(
file_path="report.pdf",
document_metadata={"source": "Q4报告", "year": 2025},
))
From raw text:
从原始文本运行:
result = asyncio.run(pipeline.run_async(
text=document_text,
))
result = asyncio.run(pipeline.run_async(
text=document_text,
))
Batch — process multiple files:
批量处理 — 处理多个文件:
async def ingest_all(paths):
for p in paths:
await pipeline.run_async(file_path=str(p))
asyncio.run(ingest_all(list(pdf_dir.glob("*.pdf"))))
`document_metadata` dict is stored as properties on the `:Document` node.
---async def ingest_all(paths):
for p in paths:
await pipeline.run_async(file_path=str(p))
asyncio.run(ingest_all(list(pdf_dir.glob("*.pdf"))))
`document_metadata`字典会被存储为`:Document`节点的属性。
---Step 4 — Chunking Configuration
步骤4 — 分块配置
Default splitter: .
FixedSizeSplitter(chunk_size=300, chunk_overlap=50)python
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
splitter = FixedSizeSplitter(
chunk_size=512, # tokens; 300–512 typical for GPT-4o
chunk_overlap=50, # ~10% of chunk_size; preserves boundary context
approximate=True, # respect sentence/word boundaries when possible
)
pipeline = SimpleKGPipeline(
...,
text_splitter=splitter,
)Chunking guidance:
| Document type | chunk_size | chunk_overlap |
|---|---|---|
| Dense technical text | 256–512 | 50–80 |
| Narrative / news articles | 512–1024 | 80–128 |
| Legal / financial docs | 256–384 | 40–64 |
Rule: chunk must fit within LLM context for extraction + within embedding model limits. GPT-4o: 128k context; : 8191 tokens. Never set chunk_size > 2048.
text-embedding-3-small默认分割器:。
FixedSizeSplitter(chunk_size=300, chunk_overlap=50)python
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
splitter = FixedSizeSplitter(
chunk_size=512, # 令牌数;GPT-4o典型值为300–512
chunk_overlap=50, # 约为chunk_size的10%;保留边界上下文
approximate=True, # 尽可能尊重句子/单词边界
)
pipeline = SimpleKGPipeline(
...,
text_splitter=splitter,
)分块指导:
| 文档类型 | chunk_size | chunk_overlap |
|---|---|---|
| 高密度技术文本 | 256–512 | 50–80 |
| 叙事/新闻文章 | 512–1024 | 80–128 |
| 法律/金融文档 | 256–384 | 40–64 |
规则:分块必须能放入LLM的提取上下文,且不超过嵌入模型的限制。GPT-4o:128k上下文;:8191令牌。chunk_size绝对不能超过2048。
text-embedding-3-smallStep 5 — Entity Resolution
步骤5 — 实体解析
Merge duplicate extracted entities after pipeline run.
python
from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver, # identical name → merge
FuzzyMatchResolver, # Levenshtein similarity; needs rapidfuzz
SpaCySemanticMatchResolver, # cosine similarity; needs neo4j-graphrag[nlp]
)管道运行后合并重复提取的实体。
python
from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver, # 名称完全一致则合并
FuzzyMatchResolver, # 基于Levenshtein相似度;需要rapidfuzz
SpaCySemanticMatchResolver, # 基于余弦相似度;需要neo4j-graphrag[nlp]
)Exact match (fastest; good baseline)
精确匹配(最快;基础方案)
resolver = SinglePropertyExactMatchResolver(driver)
asyncio.run(resolver.run())
resolver = SinglePropertyExactMatchResolver(driver)
asyncio.run(resolver.run())
Fuzzy match (handles typos / alternate spellings)
模糊匹配(处理拼写错误/别名)
from neo4j_graphrag.experimental.components.resolver import FuzzyMatchResolver
resolver = FuzzyMatchResolver(driver, threshold=0.9)
asyncio.run(resolver.run())
from neo4j_graphrag.experimental.components.resolver import FuzzyMatchResolver
resolver = FuzzyMatchResolver(driver, threshold=0.9)
asyncio.run(resolver.run())
Scope resolution to specific labels only:
仅对特定标签执行解析:
resolver = SinglePropertyExactMatchResolver(
driver,
filter_query="WHERE n:Organization OR n:Person",
)
asyncio.run(resolver.run())
Run resolvers after ingestion, not inline — bulk merges are faster.
---resolver = SinglePropertyExactMatchResolver(
driver,
filter_query="WHERE n:Organization OR n:Person",
)
asyncio.run(resolver.run())
请在导入完成后运行解析器,而非内嵌执行——批量合并速度更快。
---Resulting Graph Structure
生成的图结构
Pipeline always produces this lexical graph layer:
(:Document {id, fileName, status, ...metadata})
-[:HAS_CHUNK]->
(:Chunk {id, text, index, embedding, ...})
-[:NEXT_CHUNK]-> ← linked list for ordered traversal
(:Chunk {...})
(:Chunk)-[:FROM_DOCUMENT]->(:Document) ← back-pointerEntity extraction adds:
(:Chunk)-[:MENTIONS]->(:Person {name, ...})
(:Chunk)-[:MENTIONS]->(:Organization {name, ...})
(:Person)-[:WORKS_AT]->(:Organization)Verify after ingestion:
cypher
CYPHER 25
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk)
RETURN d.fileName, count(c) AS chunks LIMIT 10;
MATCH (c:Chunk)-[:MENTIONS]->(e)
RETURN labels(e)[0] AS type, count(*) AS cnt ORDER BY cnt DESC LIMIT 20;管道始终会生成以下词汇图层:
(:Document {id, fileName, status, ...metadata})
-[:HAS_CHUNK]->
(:Chunk {id, text, index, embedding, ...})
-[:NEXT_CHUNK]-> ← 用于有序遍历的链表
(:Chunk {...})
(:Chunk)-[:FROM_DOCUMENT]->(:Document) ← 反向指针实体提取会添加以下结构:
(:Chunk)-[:MENTIONS]->(:Person {name, ...})
(:Chunk)-[:MENTIONS]->(:Organization {name, ...})
(:Person)-[:WORKS_AT]->(:Organization)导入完成后验证:
cypher
CYPHER 25
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk)
RETURN d.fileName, count(c) AS chunks LIMIT 10;
MATCH (c:Chunk)-[:MENTIONS]->(e)
RETURN labels(e)[0] AS type, count(*) AS cnt ORDER BY cnt DESC LIMIT 20;LLM Graph Builder (No-Code UI)
LLM Graph Builder(无代码UI)
Use when: non-developers need to ingest docs; rapid prototyping; no Python environment.
Local (Docker):
bash
git clone https://github.com/neo4j-labs/llm-graph-builder
cd llm-graph-builder适用于:非开发人员需要导入文档;快速原型开发;无Python环境。
本地部署(Docker):
bash
git clone https://github.com/neo4j-labs/llm-graph-builder
cd llm-graph-builderSet OPENAI_API_KEY (or other provider keys) in .env
在.env中设置OPENAI_API_KEY(或其他提供商密钥)
docker-compose up
docker-compose up
Opens at http://localhost:8080
Supported sources: PDF, plain text, Markdown, images, web pages, YouTube transcripts, S3/GCS bucket uploads.
LLM providers: OpenAI, Gemini, Claude, Llama3, Diffbot, Qwen.
Limitations: best with long-form English text; poor on tabular data (use `neo4j-import-skill` for CSV/Excel); visual diagrams not extracted.
---
支持的数据源:PDF、纯文本、Markdown、图片、网页、YouTube字幕、S3/GCS桶上传。
LLM提供商:OpenAI、Gemini、Claude、Llama3、Diffbot、Qwen。
局限性:最适合长英文文本;表格数据处理效果差(CSV/Excel请使用`neo4j-import-skill`);无法提取可视化图表。
---APOC JSON Ingestion (Semi-Structured)
APOC JSON导入(半结构化数据)
Use when source is JSON from REST APIs, S3, or file exports.
cypher
CYPHER 25
CALL apoc.load.json("https://example.com/articles.json") YIELD value
UNWIND value.articles AS article
CALL (article) {
MERGE (d:Document {id: article.id})
SET d.title = article.title, d.url = article.url, d.publishedAt = article.publishedAt
FOREACH (tag IN article.tags |
MERGE (t:Tag {name: tag})
MERGE (d)-[:HAS_TAG]->(t)
)
} IN TRANSACTIONS OF 1000 ROWSLocal file: . File must be in or APOC configured.
apoc.load.json("file:///import/data.json")$NEO4J_HOME/import/allowlistCheck APOC available: . APOC is included on all Aura tiers.
RETURN apoc.version()适用于数据源为REST API、S3或文件导出的JSON数据。
cypher
CYPHER 25
CALL apoc.load.json("https://example.com/articles.json") YIELD value
UNWIND value.articles AS article
CALL (article) {
MERGE (d:Document {id: article.id})
SET d.title = article.title, d.url = article.url, d.publishedAt = article.publishedAt
FOREACH (tag IN article.tags |
MERGE (t:Tag {name: tag})
MERGE (d)-[:HAS_TAG]->(t)
)
} IN TRANSACTIONS OF 1000 ROWS本地文件:。文件必须位于目录,或已配置APOC允许列表。
apoc.load.json("file:///import/data.json")$NEO4J_HOME/import/检查APOC是否可用:。所有Aura套餐均包含APOC。
RETURN apoc.version()LangChain Integration Pattern
LangChain集成模式
python
from langchain_community.graphs import Neo4jGraph
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from neo4j import GraphDatabase
graph = Neo4jGraph(
url="neo4j+s://xxxx.databases.neo4j.io",
username="neo4j",
password="password",
)
loader = PyPDFLoader("report.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
embedder = OpenAIEmbeddings()
driver = GraphDatabase.driver(url, auth=("neo4j", "password"))
for i, chunk in enumerate(chunks):
emb = embedder.embed_query(chunk.page_content)
driver.execute_query(
"""
MERGE (doc:Document {id: $doc_id})
SET doc.source = $source
CREATE (c:Chunk {id: $chunk_id, text: $text, embedding: $emb, index: $idx})
CREATE (doc)-[:HAS_CHUNK]->(c)
""",
doc_id=chunk.metadata.get("source", "unknown"),
source=chunk.metadata.get("source"),
chunk_id=f"chunk-{i}",
text=chunk.page_content,
emb=emb,
idx=i,
)For entity extraction with LangChain: use (from ). Produces same //entity pattern.
LLMGraphTransformerlangchain_experimental.graph_transformers:Document:Chunkpython
from langchain_community.graphs import Neo4jGraph
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from neo4j import GraphDatabase
graph = Neo4jGraph(
url="neo4j+s://xxxx.databases.neo4j.io",
username="neo4j",
password="password",
)
loader = PyPDFLoader("report.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
embedder = OpenAIEmbeddings()
driver = GraphDatabase.driver(url, auth=("neo4j", "password"))
for i, chunk in enumerate(chunks):
emb = embedder.embed_query(chunk.page_content)
driver.execute_query(
"""
MERGE (doc:Document {id: $doc_id})
SET doc.source = $source
CREATE (c:Chunk {id: $chunk_id, text: $text, embedding: $emb, index: $idx})
CREATE (doc)-[:HAS_CHUNK]->(c)
""",
doc_id=chunk.metadata.get("source", "unknown"),
source=chunk.metadata.get("source"),
chunk_id=f"chunk-{i}",
text=chunk.page_content,
emb=emb,
idx=i,
)如需通过LangChain进行实体提取:使用(来自)。生成的//实体模式与本技能一致。
LLMGraphTransformerlangchain_experimental.graph_transformers:Document:ChunkConstraints and Indexes (Run Before Ingestion)
约束与索引(导入前运行)
cypher
CYPHER 25
// Prevent duplicate documents
CREATE CONSTRAINT doc_id_unique IF NOT EXISTS
FOR (d:Document) REQUIRE d.id IS UNIQUE;
// Prevent duplicate chunks
CREATE CONSTRAINT chunk_id_unique IF NOT EXISTS
FOR (c:Chunk) REQUIRE c.id IS UNIQUE;
// Entity deduplication
CREATE CONSTRAINT person_name_unique IF NOT EXISTS
FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT org_name_unique IF NOT EXISTS
FOR (o:Organization) REQUIRE o.name IS UNIQUE;
// Vector index for chunk embeddings (adjust dims for your model)
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON c.embedding
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
// Poll until index ONLINE:
// SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE'Do not start ingestion until all indexes are ONLINE:
cypher
SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE';If rows returned: wait, then re-run. ONLINE = safe to ingest.
cypher
CYPHER 25
// 防止重复文档
CREATE CONSTRAINT doc_id_unique IF NOT EXISTS
FOR (d:Document) REQUIRE d.id IS UNIQUE;
// 防止重复分块
CREATE CONSTRAINT chunk_id_unique IF NOT EXISTS
FOR (c:Chunk) REQUIRE c.id IS UNIQUE;
// 实体去重
CREATE CONSTRAINT person_name_unique IF NOT EXISTS
FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT org_name_unique IF NOT EXISTS
FOR (o:Organization) REQUIRE o.name IS UNIQUE;
// 分块嵌入向量的索引(根据模型调整维度)
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON c.embedding
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
// 轮询直到索引状态为ONLINE:
// SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE'所有索引状态变为ONLINE后再开始导入:
cypher
SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE';如果返回结果:请等待后重新执行。状态为ONLINE时方可安全导入。
Common Errors
常见错误
| Error | Cause | Fix |
|---|---|---|
| LLM extracts node types not in schema | Schema too loose or | Define explicit |
| | Always pass |
| Zero entities extracted | LLM context overflow | Reduce |
| Duplicate entity nodes after ingestion | Entity resolution not run | Run |
| APOC allowlist not configured | Add URL to |
| Chunking loses sentence mid-way | | Set |
| Extraction prompt + chunk exceeds context | Keep chunk_size ≤ 512 for GPT-4o extraction; ≤ 2048 absolute max |
| spaCy not supported on 3.14+ | Use |
| Deprecated package name since 6.0 | Use |
| 错误 | 原因 | 修复方案 |
|---|---|---|
| LLM提取了Schema中未定义的节点类型 | Schema过于宽松或使用 | 定义明确的 |
| 省略了 | 即使不进行向量搜索,也必须传入 |
| 未提取到任何实体 | LLM上下文溢出 | 减小 |
| 导入后出现重复实体节点 | 未运行实体解析 | 批量导入后运行 |
| 未配置APOC允许列表 | 将URL添加到 |
| 分块截断了句子 | | 在 |
| 提取提示+分块超出上下文限制 | GPT-4o提取时chunk_size≤512;绝对最大值≤2048 |
Python 3.14中 | spaCy不支持3.14+版本 | 使用 |
找不到 | 6.0版本后包名已弃用 | 使用 |
Verification Checklist
验证清单
- Constraints created and ONLINE before ingestion starts
- Vector index created before storing embeddings
- within embedding model limit (≤2048; ≤512 for extraction)
chunk_size - set to 10–15% of chunk_size
chunk_overlap - →
Document→HAS_CHUNKpattern used (enables graph traversal in retrieval)Chunk - populated with source identifier
document_metadata - Entity resolver run after bulk ingestion
- confirmed if using
apoc.version()apoc.load.json - has API keys;
.envin.env.gitignore - Verify structure:
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk) RETURN count(c) - Verify entities:
MATCH (c:Chunk)-[:MENTIONS]->(e) RETURN labels(e)[0], count(*)
- 导入前已创建约束且状态为ONLINE
- 存储嵌入向量前已创建向量索引
- 在嵌入模型限制内(≤2048;提取时≤512)
chunk_size - 设置为chunk_size的10–15%
chunk_overlap - 使用了→
Document→HAS_CHUNK模式(支持检索时的图遍历)Chunk - 已填充来源标识符
document_metadata - 批量导入后已运行实体解析器
- 使用时已确认
apoc.load.json可用apoc.version() - 文件包含API密钥;且
.env已加入.env.gitignore - 验证结构:
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk) RETURN count(c) - 验证实体:
MATCH (c:Chunk)-[:MENTIONS]->(e) RETURN labels(e)[0], count(*)
GraphSchema — Current API (≥1.7.1)
GraphSchema — 当前API(≥1.7.1)
entitiesrelationspotential_schemaschema=GraphSchema(...)python
from neo4j_graphrag.experimental.components.schema import (
GraphSchema, NodeType, RelationshipType, PropertyType
)
schema = GraphSchema(
node_types=[
NodeType(label="Person", properties=[PropertyType(name="name", type="STRING")]),
NodeType(label="Organization", properties=[PropertyType(name="name", type="STRING")]),
],
relationship_types=[RelationshipType(label="WORKS_AT")],
patterns=[("Person", "WORKS_AT", "Organization")],
)
pipeline = SimpleKGPipeline(llm=llm, driver=driver, embedder=embedder, schema=schema)schema="FREE"schema="EXTRACTED"entitiesrelationspotential_schemaschema=GraphSchema(...)python
from neo4j_graphrag.experimental.components.schema import (
GraphSchema, NodeType, RelationshipType, PropertyType
)
schema = GraphSchema(
node_types=[
NodeType(label="Person", properties=[PropertyType(name="name", type="STRING")]),
NodeType(label="Organization", properties=[PropertyType(name="name", type="STRING")]),
],
relationship_types=[RelationshipType(label="WORKS_AT")],
patterns=[("Person", "WORKS_AT", "Organization")],
)
pipeline = SimpleKGPipeline(llm=llm, driver=driver, embedder=embedder, schema=schema)schema="FREE"schema="EXTRACTED"LexicalGraphConfig — Customize Labels
LexicalGraphConfig — 自定义标签
Override default lexical layer labels (keep defaults unless integrating with existing graph):
python
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig覆盖默认词汇层标签(除非与现有图集成,否则建议保留默认值):
python
from neo4j_graphrag.experimental.components.types import LexicalGraphConfigAll fields have sensible defaults — only override what differs from your graph's conventions
所有字段均有合理默认值——仅覆盖与你的图约定不同的部分
config = LexicalGraphConfig(
document_node_label="Article", # default: "Document"
chunk_node_label="Passage", # default: "Chunk"
node_to_chunk_relationship_type="HAS_ENTITY", # default: "MENTIONS"
chunk_text_property="content", # default: "text"
)
pipeline = SimpleKGPipeline(..., lexical_graph_config=config)
---config = LexicalGraphConfig(
document_node_label="Article", # 默认值: "Document"
chunk_node_label="Passage", # 默认值: "Chunk"
node_to_chunk_relationship_type="HAS_ENTITY", # 默认值: "MENTIONS"
chunk_text_property="content", # 默认值: "text"
)
pipeline = SimpleKGPipeline(..., lexical_graph_config=config)
---Custom Document Loaders
自定义文档加载器
Default auto-dispatches by extension (→, →).
Supports fsspec URIs (, ). Subclass for HTML/web/custom formats:
file_loader.pdfPdfLoader.mdMarkdownLoaders3://gcs://DataLoaderpython
from neo4j_graphrag.experimental.components.data_loader import DataLoader
from neo4j_graphrag.experimental.components.types import DocumentInfo, LoadedDocument
class WebPageLoader(DataLoader):
async def run(self, filepath, metadata=None):
import httpx
text = httpx.get(filepath).text # strip HTML in real impl
return LoadedDocument(text=text,
document_info=DocumentInfo(path=filepath, metadata=metadata))
pipeline = SimpleKGPipeline(..., file_loader=WebPageLoader(), from_file=True)Chunking strategy by use-case and full resolver config: references/kg-construction.md.
默认会根据扩展名自动分发(→, →)。
支持fsspec URI(, )。如需处理HTML/网页/自定义格式,请继承:
file_loader.pdfPdfLoader.mdMarkdownLoaders3://gcs://DataLoaderpython
from neo4j_graphrag.experimental.components.data_loader import DataLoader
from neo4j_graphrag.experimental.components.types import DocumentInfo, LoadedDocument
class WebPageLoader(DataLoader):
async def run(self, filepath, metadata=None):
import httpx
text = httpx.get(filepath).text # 实际实现中需剥离HTML
return LoadedDocument(text=text,
document_info=DocumentInfo(path=filepath, metadata=metadata))
pipeline = SimpleKGPipeline(..., file_loader=WebPageLoader(), from_file=True)按场景划分的分块策略和完整解析器配置:references/kg-construction.md。
References
参考资料
Load on demand: