neo4j-vector-index-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWhen to Use
适用场景
- Creating a vector index () on nodes or relationships
CREATE VECTOR INDEX - Running vector similarity / nearest-neighbor search
- Storing embeddings on graph nodes during ingestion
- Choosing similarity function, dimensions, HNSW params, or quantization
- Using clause (2026.01+) or
SEARCH(2025.x)db.index.vector.queryNodes() - Batch-updating embeddings after model change
- Combining vector results with immediate graph neighborhood (full retrieval_query pipelines → )
neo4j-graphrag-skill
- 在节点或关系上创建向量索引()
CREATE VECTOR INDEX - 运行向量相似度/最近邻搜索
- 在数据摄入阶段将嵌入向量存储在图节点上
- 选择相似度函数、维度、HNSW参数或量化配置
- 使用子句(2026.01+)或
SEARCH(2025.x版本)db.index.vector.queryNodes() - 模型更新后批量更新嵌入向量
- 将向量搜索结果与直接图邻域结合(完整retrieval_query管道请使用)
neo4j-graphrag-skill
When NOT to Use
不适用场景
- GraphRAG pipelines (VectorCypherRetriever, HybridCypherRetriever, retrieval_query) →
neo4j-graphrag-skill - Fulltext / keyword search (FULLTEXT INDEX, ) →
db.index.fulltext.queryNodesneo4j-cypher-skill - GDS graph embeddings (FastRP, Node2Vec, GraphSAGE) →
neo4j-gds-skill - Index admin (list all indexes, drop range/text/lookup indexes) →
neo4j-cypher-skill
- GraphRAG管道(VectorCypherRetriever、HybridCypherRetriever、retrieval_query)→ 使用
neo4j-graphrag-skill - 全文/关键词搜索(FULLTEXT INDEX、)→ 使用
db.index.fulltext.queryNodesneo4j-cypher-skill - GDS图嵌入(FastRP、Node2Vec、GraphSAGE)→ 使用
neo4j-gds-skill - 索引管理(列出所有索引、删除范围/文本/查找索引)→ 使用
neo4j-cypher-skill
Pre-flight — Determine Version
前置步骤——确定版本
Drives syntax choice:
cypher
CALL dbms.components() YIELD versions RETURN versions[0] AS neo4j_version| Version | Use |
|---|---|
| |
| |
版本决定语法选择:
cypher
CALL dbms.components() YIELD versions RETURN versions[0] AS neo4j_version| 版本 | 使用方式 |
|---|---|
| |
| |
Step 1 — Create Vector Index
步骤1——创建向量索引
Node index (single label):
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine',
`vector.quantization.enabled`: true,
`vector.hnsw.m`: 16,
`vector.hnsw.ef_construction`: 100
}
}Node index with filterable properties [2026.01+] — declares which properties can be used in :
WITHSEARCH ... WHEREcypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
WITH [c.source, c.lang, c.published_year] // stored as metadata; filterable in SEARCH WHERE
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }Multi-label index with filterable properties [2026.01+]:
cypher
CYPHER 25
CREATE VECTOR INDEX doc_embedding IF NOT EXISTS
FOR (n:Document|Article) ON n.embedding
WITH [n.author, n.published_year, n.lang]
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }Relationship index:
cypher
CYPHER 25
CREATE VECTOR INDEX rel_embedding IF NOT EXISTS
FOR ()-[r:HAS_CHUNK]-() ON (r.embedding)
OPTIONS { indexConfig: { `vector.dimensions`: 768, `vector.similarity_function`: 'cosine' } }WITHINTEGERFLOATSTRINGBOOLEANDATEZONED DATETIMELOCAL DATETIMEZONED TIMELOCAL TIMEDURATIONLISTPOINTIndex config reference:
| Parameter | Type | Default | Notes |
|---|---|---|---|
| INTEGER 1–4096 | none | Required; must match embedding model exactly |
| STRING | | |
| BOOLEAN | | Reduces storage; slight accuracy tradeoff; needs vector-2.0+ (5.18+) |
| INTEGER 1–512 | | HNSW graph connections; higher = better recall, more memory |
| INTEGER 1–3200 | | Build-time candidates; higher = better recall, slower build |
Similarity function choice:
| Use case | Function |
|---|---|
| Normalized embeddings (OpenAI, Cohere, Voyage, Google) | |
| Unnormalized / raw distance matters | |
单标签节点索引:
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine',
`vector.quantization.enabled`: true,
`vector.hnsw.m`: 16,
`vector.hnsw.ef_construction`: 100
}
}带可过滤属性的节点索引[2026.01+]——声明可在中使用的属性:
WITHSEARCH ... WHEREcypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
WITH [c.source, c.lang, c.published_year] // 存储为元数据;可在SEARCH WHERE中过滤
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }带可过滤属性的多标签索引[2026.01+]:
cypher
CYPHER 25
CREATE VECTOR INDEX doc_embedding IF NOT EXISTS
FOR (n:Document|Article) ON n.embedding
WITH [n.author, n.published_year, n.lang]
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }关系索引:
cypher
CYPHER 25
CREATE VECTOR INDEX rel_embedding IF NOT EXISTS
FOR ()-[r:HAS_CHUNK]-() ON (r.embedding)
OPTIONS { indexConfig: { `vector.dimensions`: 768, `vector.similarity_function`: 'cosine' } }WITHINTEGERFLOATSTRINGBOOLEANDATEZONED DATETIMELOCAL DATETIMEZONED TIMELOCAL TIMEDURATIONLISTPOINT索引配置参考:
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
| 整数1–4096 | 无 | 必填;必须与嵌入模型完全匹配 |
| 字符串 | | 可选值 |
| 布尔值 | | 减少存储空间;精度略有损失;需要vector-2.0+(5.18+版本) |
| 整数1–512 | | HNSW图连接数;值越高召回率越好,但占用内存更多 |
| 整数1–3200 | | 构建时候选数;值越高召回率越好,但构建速度越慢 |
相似度函数选择:
| 使用场景 | 函数 |
|---|---|
| 归一化嵌入向量(OpenAI、Cohere、Voyage、Google) | |
| 非归一化/原始距离重要的场景 | |
Step 2 — Wait for Index ONLINE
步骤2——等待索引变为ONLINE状态
Index builds asynchronously — do NOT query until ONLINE:
cypher
SHOW VECTOR INDEXES YIELD name, state, populationPercent
WHERE name = 'chunk_embedding'
RETURN name, state, populationPercentPoll every 5s until and . If → stop, check logs.
state = 'ONLINE'populationPercent = 100.0state = 'FAILED'Shell poll (cypher-shell):
bash
until cypher-shell -u neo4j -p "$NEO4J_PASSWORD" \
"SHOW VECTOR INDEXES YIELD name, state WHERE name='chunk_embedding' RETURN state" \
| grep -q ONLINE; do
sleep 5
done索引异步构建——索引未变为ONLINE前请勿查询:
cypher
SHOW VECTOR INDEXES YIELD name, state, populationPercent
WHERE name = 'chunk_embedding'
RETURN name, state, populationPercent每5秒轮询一次,直到且。如果→停止操作,检查日志。
state = 'ONLINE'populationPercent = 100.0state = 'FAILED'Shell轮询(cypher-shell):
bash
until cypher-shell -u neo4j -p "$NEO4J_PASSWORD" \
"SHOW VECTOR INDEXES YIELD name, state WHERE name='chunk_embedding' RETURN state" \
| grep -q ONLINE; do
sleep 5
doneStep 3 — Ingest Embeddings
步骤3——摄入嵌入向量
Batch UNWIND pattern (use for > 100 nodes — never one-node-per-transaction):
python
from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(user, password))
def embed_batch(texts: list[str]) -> list[list[float]]:
response = openai_client.embeddings.create(
model="text-embedding-3-small", input=texts
)
return [r.embedding for r in response.data]
def store_embeddings(records: list[dict], batch_size: int = 500):
expected_dim = 1536 # must match vector.dimensions
texts = [r["text"] for r in records]
embeddings = embed_batch(texts)
for emb in embeddings:
assert len(emb) == expected_dim, f"Dim mismatch: {len(emb)} != {expected_dim}"
rows = [{"id": r["id"], "embedding": emb}
for r, emb in zip(records, embeddings)]
for i in range(0, len(rows), batch_size):
driver.execute_query(
"UNWIND $rows AS row MATCH (c:Chunk {id: row.id}) SET c.embedding = row.embedding",
rows=rows[i:i+batch_size]
)❌ Never create index after embeddings are already stored — always create index first.
✅ Create index → poll ONLINE → ingest embeddings.
批量UNWIND模式(适用于超过100个节点的情况——绝不要单节点单事务):
python
from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(user, password))
def embed_batch(texts: list[str]) -> list[list[float]]:
response = openai_client.embeddings.create(
model="text-embedding-3-small", input=texts
)
return [r.embedding for r in response.data]
def store_embeddings(records: list[dict], batch_size: int = 500):
expected_dim = 1536 # 必须与vector.dimensions匹配
texts = [r["text"] for r in records]
embeddings = embed_batch(texts)
for emb in embeddings:
assert len(emb) == expected_dim, f"维度不匹配:{len(emb)} != {expected_dim}"
rows = [{"id": r["id"], "embedding": emb}
for r, emb in zip(records, embeddings)]
for i in range(0, len(rows), batch_size):
driver.execute_query(
"UNWIND $rows AS row MATCH (c:Chunk {id: row.id}) SET c.embedding = row.embedding",
rows=rows[i:i+batch_size]
)❌ 绝不要在嵌入向量已存储后创建索引——务必先创建索引。
✅ 创建索引 → 轮询至ONLINE状态 → 摄入嵌入向量。
Step 4 — Run Vector Search
步骤4——运行向量搜索
SEARCH clause (2026.01+, preferred)
SEARCH子句(2026.01+,推荐使用)
cypher
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
RETURN c.text, score
ORDER BY score DESCWith in-index filter [2026.01+] — properties must be declared in at index creation:
WITHcypher
// Index must have been created with: WITH [c.source, c.lang, c.published_year]
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
WHERE c.source = $source AND c.lang = 'en' AND c.published_year >= 2024
LIMIT 10
) SCORE AS score
RETURN c.text, c.source, score
ORDER BY score DESCFiltering strategy — choose one:
| Strategy | When to use | Tradeoff |
|---|---|---|
In-index | Filters on pre-declared | Fast, consistent latency; properties must be declared upfront |
| Post-filter (MATCH + procedure) | Arbitrary Cypher predicates, graph traversal, OR/NOT | Full flexibility; may over-fetch then discard |
| Pre-filter (MATCH first, then SEARCH) | Small known candidate set; exact nearest-neighbor within subset | Deterministic; slow on large candidate sets |
In-index hard limits [2026.01+]:
WHERE- Property must be listed in at index creation — undeclared properties silently fall back to post-filtering
WITH [...] - AND predicates only — no OR, NOT, list ops, string ops
- Scalar types only: ,
INTEGER,FLOAT,STRING, temporal types — not VECTOR/LIST/POINTBOOLEAN
cypher
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
RETURN c.text, score
ORDER BY score DESC带索引内过滤[2026.01+]——属性必须在索引创建时的中声明:
WITHcypher
// 索引创建时必须声明:WITH [c.source, c.lang, c.published_year]
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
WHERE c.source = $source AND c.lang = 'en' AND c.published_year >= 2024
LIMIT 10
) SCORE AS score
RETURN c.text, c.source, score
ORDER BY score DESC过滤策略——三选一:
| 策略 | 使用场景 | 权衡 |
|---|---|---|
索引内 | 对预声明的 | 速度快,延迟稳定;属性必须预先声明 |
| 后过滤(MATCH + 存储过程) | 任意Cypher谓词、图遍历、OR/NOT逻辑 | 完全灵活;可能会过度查询后再丢弃结果 |
| 预过滤(先MATCH,再SEARCH) | 已知候选集较小;在子集内查找精确最近邻 | 确定性强;候选集较大时速度慢 |
索引内限制[2026.01+]:
WHERE- 属性必须在索引创建时的中列出——未声明的属性会自动降级为后过滤
WITH [...] - 仅支持AND谓词——不支持OR、NOT、列表操作、字符串操作
- 仅支持标量类型:、
INTEGER、FLOAT、STRING、时间类型——不支持VECTOR/LIST/POINTBOOLEAN
Post-filter pattern (2025.x or arbitrary predicates)
后过滤模式(2025.x版本或需任意谓词)
cypher
CYPHER 25
CALL db.index.vector.queryNodes('chunk_embedding', 50, $queryEmbedding)
YIELD node AS c, score
WHERE c.source = $source // post-filter: fetch more, then filter
RETURN c.text, score
ORDER BY score DESC LIMIT 10Relationship index procedure:
cypher
CYPHER 25
CALL db.index.vector.queryRelationships('rel_embedding', 5, $queryEmbedding)
YIELD relationship AS r, score
RETURN r.text, scoreSEARCH clause hard limits (all versions):
- Index name cannot be a parameter (not allowed — use literal string)
$indexName - Binding variable must come from the enclosing MATCH pattern
- Query vector cannot reference the binding variable
cypher
CYPHER 25
CALL db.index.vector.queryNodes('chunk_embedding', 50, $queryEmbedding)
YIELD node AS c, score
WHERE c.source = $source // 后过滤:先查询更多结果,再过滤
RETURN c.text, score
ORDER BY score DESC LIMIT 10关系索引存储过程:
cypher
CYPHER 25
CALL db.index.vector.queryRelationships('rel_embedding', 5, $queryEmbedding)
YIELD relationship AS r, score
RETURN r.text, scoreSEARCH子句限制(所有版本):
- 索引名称不能是参数(不允许——必须使用字面量字符串)
$indexName - 绑定变量必须来自外层MATCH模式
- 查询向量不能引用绑定变量
Step 5 — Combine with Graph Traversal (simple cases)
步骤5——结合图遍历(简单场景)
Vector search as entry point, then graph hop:
cypher
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
MATCH (c)<-[:HAS_CHUNK]-(a:Article)
OPTIONAL MATCH (a)-[:MENTIONS]->(org:Organization)
RETURN c.text, a.title, score, collect(DISTINCT org.name) AS organizations
ORDER BY score DESCFor full retrieval_query pipelines, HybridCypherRetriever, or library → delegate to .
neo4j-graphragneo4j-graphrag-skill以向量搜索为入口,再进行图跳转:
cypher
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
MATCH (c)<-[:HAS_CHUNK]-(a:Article)
OPTIONAL MATCH (a)-[:MENTIONS]->(org:Organization)
RETURN c.text, a.title, score, collect(DISTINCT org.name) AS organizations
ORDER BY score DESC如需完整retrieval_query管道、HybridCypherRetriever或库→请使用。
neo4j-graphragneo4j-graphrag-skillEmbedding Provider Quick-Reference
嵌入提供者速查表
| Provider / Model | Dimensions | Similarity | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | cosine | Default; reducible to 256–1536 via |
| OpenAI text-embedding-3-large | 3072 | cosine | Reducible to 256–3072 |
| OpenAI text-embedding-ada-002 | 1536 | cosine | Legacy; prefer 3-small |
| Cohere embed-v3 (English) | 1024 | cosine | Use |
| Voyage voyage-3-large | 1024 | cosine | High quality; needs |
| Google text-embedding-004 | 768 | cosine | Via Vertex AI |
| Ollama nomic-embed-text | 768 | cosine | Local dev/testing |
| Ollama mxbai-embed-large | 1024 | cosine | Local; production-quality |
vector.dimensions| 提供者/模型 | 维度 | 相似度函数 | 说明 |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | cosine | 默认选项;可通过 |
| OpenAI text-embedding-3-large | 3072 | cosine | 可调整维度为256–3072 |
| OpenAI text-embedding-ada-002 | 1536 | cosine | 旧版;推荐使用3-small |
| Cohere embed-v3(英文) | 1024 | cosine | 摄入时使用 |
| Voyage voyage-3-large | 1024 | cosine | 高质量;需要 |
| Google text-embedding-004 | 768 | cosine | 通过Vertex AI使用 |
| Ollama nomic-embed-text | 768 | cosine | 本地开发/测试使用 |
| Ollama mxbai-embed-large | 1024 | cosine | 本地使用;达到生产级质量 |
vector.dimensionsVector Functions
向量函数
Ad-hoc similarity (not for kNN search — use index for that):
cypher
MATCH (a:Chunk {id: $id1}), (b:Chunk {id: $id2})
RETURN vector.similarity.cosine(a.embedding, b.embedding) AS sim
// vector.similarity.euclidean(a, b) — same signature, 0–1 range
// vector_distance (2025.10+) — metrics: EUCLIDEAN, EUCLIDEAN_SQUARED, MANHATTAN, COSINE, DOT, HAMMING
// Returns distance (lower = more similar, inverse of similarity)
RETURN vector_distance(a.embedding, b.embedding, 'COSINE') AS dist
// vector_dimension_count (2025.10+)
RETURN vector_dimension_count(n.embedding) AS dims
// vector_norm (2025.20+) — metrics: EUCLIDEAN, MANHATTAN
RETURN vector_norm(n.embedding, 'EUCLIDEAN') AS normConvert LIST to typed VECTOR:
cypher
// vector(value, dimension, coordinateType)
// coordinateType: FLOAT64, FLOAT32, INTEGER8/16/32/64
WITH vector([1.0, 2.0, 3.0], 3, 'FLOAT32') AS v
RETURN vector_dimension_count(v)临时相似度计算(不适用于kNN搜索——请使用索引):
cypher
MATCH (a:Chunk {id: $id1}), (b:Chunk {id: $id2})
RETURN vector.similarity.cosine(a.embedding, b.embedding) AS sim
// vector.similarity.euclidean(a, b) —— 签名相同,结果范围0–1
// vector_distance (2025.10+) —— 支持的度量:EUCLIDEAN、EUCLIDEAN_SQUARED、MANHATTAN、COSINE、DOT、HAMMING
// 返回距离值(值越小相似度越高,与相似度结果相反)
RETURN vector_distance(a.embedding, b.embedding, 'COSINE') AS dist
// vector_dimension_count (2025.10+)
RETURN vector_dimension_count(n.embedding) AS dims
// vector_norm (2025.20+) —— 支持的度量:EUCLIDEAN、MANHATTAN
RETURN vector_norm(n.embedding, 'EUCLIDEAN') AS norm将LIST转换为类型化VECTOR:
cypher
// vector(value, dimension, coordinateType)
// coordinateType: FLOAT64, FLOAT32, INTEGER8/16/32/64
WITH vector([1.0, 2.0, 3.0], 3, 'FLOAT32') AS v
RETURN vector_dimension_count(v)Index Management
索引管理
cypher
// Show all vector indexes with config
SHOW VECTOR INDEXES YIELD name, state, populationPercent,
labelsOrTypes, properties, indexConfig
RETURN name, state, populationPercent, labelsOrTypes, properties, indexConfig;
// Drop (node data unchanged — only index structure removed)
DROP INDEX chunk_embedding IF EXISTS;
// No ALTER VECTOR INDEX — to change dimensions or similarity function:
// 1. DROP INDEX old_index IF EXISTS
// 2. CREATE VECTOR INDEX new_index ... with new OPTIONS
// 3. Re-generate all embeddings with new model
// 4. Poll until ONLINEcypher
// 显示所有向量索引及其配置
SHOW VECTOR INDEXES YIELD name, state, populationPercent,
labelsOrTypes, properties, indexConfig
RETURN name, state, populationPercent, labelsOrTypes, properties, indexConfig;
// 删除索引(节点数据不受影响——仅移除索引结构)
DROP INDEX chunk_embedding IF EXISTS;
// 无ALTER VECTOR INDEX语法——如需修改维度或相似度函数:
// 1. DROP INDEX old_index IF EXISTS
// 2. CREATE VECTOR INDEX new_index ... 使用新OPTIONS
// 3. 使用新模型重新生成所有嵌入向量
// 4. 轮询至索引变为ONLINE状态Common Errors
常见错误
| Error | Cause | Fix |
|---|---|---|
| Stored embedding dim ≠ | Fix embed generation; drop + recreate index with correct dim |
| Search returns incomplete results | Index still | Poll until |
| Neo4j < 5.11 | No vector index support below 5.11; upgrade |
| Neo4j < 2026.01 | Use |
| SEARCH in-index filter restriction | Move complex predicates to outer WHERE after SEARCH |
| Zero results from correct query | Wrong similarity function or all-zeros embedding | Verify with |
| Score always 1.0 | All-zeros or identical vectors | Embedding generation failed; add dimension assertion before ingest |
| provider vector-1.0 (Neo4j < 5.18) | Omit quantization option or upgrade to 5.18+ |
| 错误 | 原因 | 修复方案 |
|---|---|---|
| 存储的嵌入向量维度与 | 修复嵌入向量生成逻辑;删除并重新创建维度正确的索引 |
| 搜索返回不完整结果 | 索引仍处于 | 轮询至 |
| Neo4j版本低于5.11 | 5.11以下版本不支持向量索引;升级版本 |
| Neo4j版本低于2026.01 | 使用 |
| SEARCH索引内过滤的限制 | 将复杂谓词移至SEARCH后的外层WHERE中 |
| 查询正确但返回零结果 | 相似度函数选择错误或嵌入向量全为0 | 使用 |
| 分数始终为1.0 | 嵌入向量全为0或完全相同 | 嵌入向量生成失败;摄入前添加维度断言 |
| 使用的是vector-1.0提供者(Neo4j版本低于5.18) | 移除量化选项或升级至5.18+版本 |
Checklist
检查清单
- matches embedding model output exactly
vector.dimensions - Vector index created before ingesting embeddings
- Similarity function chosen explicitly (for normalized,
cosinefor distance-based)euclidean - Index polled to before first query
state = 'ONLINE' - Dimension validated on every embedding before ingest
- clause on Neo4j >= 2026.01 (preferred); procedure fallback only on 2025.x (deprecated 2026.04)
SEARCH - SEARCH uses AND-only predicates with scalar types
WHERE - Batch UNWIND pattern used for > 100 nodes
- If model changes: drop index → recreate with new dimensions → re-generate all embeddings
- 与嵌入模型输出完全匹配
vector.dimensions - 先创建向量索引,再摄入嵌入向量
- 明确选择相似度函数(归一化向量用,基于距离的场景用
cosine)euclidean - 首次查询前轮询至索引
state = 'ONLINE' - 摄入前验证每个嵌入向量的维度
- Neo4j >=2026.01版本使用子句(推荐);仅2025.x版本使用存储过程作为 fallback(2026.04起弃用)
SEARCH - SEARCH的仅使用AND谓词和标量类型
WHERE - 超过100个节点时使用批量UNWIND模式
- 模型变更时:删除索引 → 重新创建新维度的索引 → 重新生成所有嵌入向量
In-Cypher Embedding Generation — ai.text.embed() [2025.12]
Cypher内嵌入向量生成——ai.text.embed() [2025.12]
Generate embeddings at query time without external Python code. Use — the current API since [2025.12]:
ai.text.embed()cypher
// Syntax (requires CYPHER 25)
CYPHER 25
// ai.text.embed(resource :: STRING, provider :: STRING, configuration :: MAP) :: VECTORProvider strings are lowercase (, , , ). Full provider config → .
'openai''vertexai''bedrock-titan''azure-openai'neo4j-genai-plugin-skillFull query pattern — embed at query time, search immediately (procedure fallback for 2025.x):
cypher
CYPHER 25
WITH ai.text.embed(
"What are good open source projects",
"openai",
{ token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
CALL db.index.vector.queryNodes('chunk_embedding', 6, userEmbedding) // deprecated 2026.04
YIELD node AS c, score
RETURN c.text, score
ORDER BY score DESCWith SEARCH clause (2026.01+):
cypher
CYPHER 25
WITH ai.text.embed("my query", "openai", { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
MATCH (c:Chunk)
SEARCH c IN (VECTOR INDEX chunk_embedding FOR userEmbedding LIMIT 6) SCORE AS score
RETURN c.text, score
ORDER BY score DESC❌ Never pass API key as literal string in production — use or .
✅ Use parameter; inject via driver params dict.
$paramapoc.static.get()$openaiKeyRule: Use same model at ingest time and query time — embeddings from different models are not comparable.
Deprecated (still works but do not use in new code):
- [deprecated] → use
genai.vector.encode()[2025.12]ai.text.embed() - [deprecated] → use
genai.vector.encodeBatch()[2025.12]CALL ai.text.embedBatch() - [deprecated] → use
genai.vector.listEncodingProviders()[2025.12]CALL ai.text.embed.providers()
For full reference (completion, structured output, chat, tokenization) → .
ai.text.*neo4j-genai-plugin-skill无需外部Python代码,在查询时生成嵌入向量。使用——2025.12起的当前API:
ai.text.embed()cypher
// 语法(需要CYPHER 25)
CYPHER 25
// ai.text.embed(resource :: STRING, provider :: STRING, configuration :: MAP) :: VECTOR提供者字符串为小写(、、、)。完整提供者配置请查看。
'openai''vertexai''bedrock-titan''azure-openai'neo4j-genai-plugin-skill完整查询模式——查询时生成嵌入向量并立即搜索(2025.x版本用存储过程作为fallback):
cypher
CYPHER 25
WITH ai.text.embed(
"What are good open source projects",
"openai",
{ token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
CALL db.index.vector.queryNodes('chunk_embedding', 6, userEmbedding) // 2026.04起弃用
YIELD node AS c, score
RETURN c.text, score
ORDER BY score DESC使用SEARCH子句(2026.01+):
cypher
CYPHER 25
WITH ai.text.embed("my query", "openai", { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
MATCH (c:Chunk)
SEARCH c IN (VECTOR INDEX chunk_embedding FOR userEmbedding LIMIT 6) SCORE AS score
RETURN c.text, score
ORDER BY score DESC❌ 生产环境中绝不要将API密钥作为字面量字符串传递——使用或。
✅ 使用参数;通过驱动参数字典注入。
$paramapoc.static.get()$openaiKey规则:摄入和查询时使用相同的模型——不同模型生成的嵌入向量不可比较。
已弃用(仍可工作但请勿在新代码中使用):
- [已弃用]→使用
genai.vector.encode()[2025.12]ai.text.embed() - [已弃用]→使用
genai.vector.encodeBatch()[2025.12]CALL ai.text.embedBatch() - [已弃用]→使用
genai.vector.listEncodingProviders()[2025.12]CALL ai.text.embed.providers()
完整参考(补全、结构化输出、聊天、分词)请查看。
ai.text.*neo4j-genai-plugin-skillCypher-Based Embedding Ingestion — db.create.setNodeVectorProperty
基于Cypher的嵌入向量摄入——db.create.setNodeVectorProperty
Set vector property via Cypher (e.g. during LOAD CSV or MERGE pipeline):
cypher
LOAD CSV WITH HEADERS FROM 'https://example.com/data.csv' AS row
MERGE (q:Question {text: row.question})
WITH q, row
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))Use when embedding is already in CSV/JSON form as a string — converts to .
For Python-generated embeddings, use the Python UNWIND batch pattern (Step 3) instead.
apoc.convert.fromJsonList()"[0.1,0.2,...]"LIST<FLOAT>通过Cypher设置向量属性(例如在LOAD CSV或MERGE管道中):
cypher
LOAD CSV WITH HEADERS FROM 'https://example.com/data.csv' AS row
MERGE (q:Question {text: row.question})
WITH q, row
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))适用于嵌入向量已以字符串形式存储在CSV/JSON中的场景——将转换为。
对于Python生成的嵌入向量,请使用步骤3中的Python UNWIND批量模式。
apoc.convert.fromJsonList()"[0.1,0.2,...]"LIST<FLOAT>Similarity Function — Extended Guidance
相似度函数——扩展指南
Existing table (Step 1) gives the basic rule. Additional guidance from course patterns:
Choose based on training loss function:
- Check embedding model docs — models trained with cosine loss → use
'cosine' - Models trained with L2/Euclidean loss → use
'euclidean' - When docs are silent: default to (all major hosted APIs use it)
'cosine'
Common pitfall — wrong similarity function:
❌ Created index with 'euclidean' but model outputs L2-normalized vectors
→ scores are mathematically correct but rankings differ from expected cosine order
→ no error thrown; wrong results silently returned
✅ Verify: run vector.similarity.cosine(a.embedding, b.embedding) manually on known
similar pairs — score should be > 0.9 for near-duplicate textSanity check query after index creation:
cypher
MATCH (c:Chunk) WITH c LIMIT 2
WITH collect(c) AS nodes
RETURN vector.similarity.cosine(nodes[0].embedding, nodes[1].embedding) AS cosine_check,
vector.similarity.euclidean(nodes[0].embedding, nodes[1].embedding) AS euclidean_checkIf both return → embeddings not set. If cosine returns → identical vectors (embed call failed).
null1.0步骤1中的表格给出了基本规则。以下是来自课程模式的额外指南:
根据训练损失函数选择:
- 查看嵌入模型文档——使用余弦损失训练的模型→使用
'cosine' - 使用L2/欧几里得损失训练的模型→使用
'euclidean' - 文档无说明时:默认使用(所有主流托管API均使用此函数)
'cosine'
常见陷阱——错误的相似度函数:
❌ 创建索引时使用'euclidean'但模型输出L2归一化向量
→ 分数在数学上正确,但排序与预期的余弦顺序不同
→ 无错误抛出;静默返回错误结果
✅ 验证:手动对已知相似的向量对运行vector.similarity.cosine(a.embedding, b.embedding)——近重复文本的分数应>0.9索引创建后的 sanity check 查询:
cypher
MATCH (c:Chunk) WITH c LIMIT 2
WITH collect(c) AS nodes
RETURN vector.similarity.cosine(nodes[0].embedding, nodes[1].embedding) AS cosine_check,
vector.similarity.euclidean(nodes[0].embedding, nodes[1].embedding) AS euclidean_check如果两者都返回→嵌入向量未设置。如果余弦分数返回→向量完全相同(嵌入生成失败)。
null1.0Gotchas — Extended
注意事项——扩展
| Gotcha | Detail | Fix |
|---|---|---|
| Index not ONLINE at ingest time | Inserting nodes before index exists is valid — index auto-populates. But querying during | Always poll |
| Wrong dimensions — silent failure | Stored vector dim ≠ | Assert |
| Different models at ingest vs query | No error; cosine scores ~0.3–0.5 for clearly similar text | Use same model string/version for both; store model name as node metadata |
| Missing model at query | | Test encode call standalone; check |
| Large single-transaction ingest | One transaction for 10k nodes → OOM or timeout | Use |
| Chunk overlap not set | Adjacent chunks with no overlap → context at boundaries lost → poor recall for cross-paragraph queries | Set |
| 注意事项 | 细节 | 修复方案 |
|---|---|---|
| 摄入时索引未变为ONLINE | 索引创建前插入节点是允许的——索引会自动填充。但 | 首次查询前务必轮询至 |
| 维度错误——静默失败 | 存储的向量维度与 | 每次 |
| 摄入与查询使用不同模型 | 无错误;明显相似的文本余弦分数约为0.3–0.5 | 摄入和查询使用相同的模型字符串/版本;将模型名称存储为节点元数据 |
| 查询时模型缺失 | 提供者配置错误时 | 单独测试编码调用;嵌入到管道前先执行 |
| 单事务摄入大量数据 | 10k节点单事务→内存不足或超时 | 使用 |
| 未设置分块重叠 | 相邻分块无重叠→边界上下文丢失→跨段落查询召回率差 | 设置 |
References
参考资料
Load on demand:
- Vector index docs
- SEARCH clause docs
- Vector functions docs
- ai.text.embed() / GenAI plugin docs [2025.12] — replaces deprecated
genai.vector.encode() - db.create.setNodeVectorProperty docs
- Chunking strategy, batch embed+store, splitter patterns — see document import skill
- Vector search with filters — 2026.01 preview
按需加载:
- 向量索引文档
- SEARCH子句文档
- 向量函数文档
- ai.text.embed() / GenAI插件文档 [2025.12]——替代已弃用的
genai.vector.encode() - db.create.setNodeVectorProperty文档
- 分块策略、批量嵌入+存储、拆分器模式——查看文档导入skill
- 带过滤的向量搜索——2026.01预览版