When to Use
- Creating a vector index () on nodes or relationships
- Running vector similarity / nearest-neighbor search
- Storing embeddings on graph nodes during ingestion
- Choosing similarity function, dimensions, HNSW params, or quantization
- Using clause (2026.01+) or
db.index.vector.queryNodes()
(2025.x)
- Batch-updating embeddings after model change
- Combining vector results with immediate graph neighborhood (full retrieval_query pipelines → )
When NOT to Use
- GraphRAG pipelines (VectorCypherRetriever, HybridCypherRetriever, retrieval_query) →
- Fulltext / keyword search (FULLTEXT INDEX,
db.index.fulltext.queryNodes
) →
- GDS graph embeddings (FastRP, Node2Vec, GraphSAGE) →
- Index admin (list all indexes, drop range/text/lookup indexes) →
Pre-flight — Determine Version
Drives syntax choice:
cypher
CALL dbms.components() YIELD versions RETURN versions[0] AS neo4j_version
| Version | Use |
|---|
| or higher | clause (in-index filtering, preferred) |
| db.index.vector.queryNodes()
procedure (deprecated 2026.04 — use when on 2026.x) |
Step 1 — Create Vector Index
Node index (single label):
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine',
`vector.quantization.enabled`: true,
`vector.hnsw.m`: 16,
`vector.hnsw.ef_construction`: 100
}
}
Node index
with filterable properties [2026.01+] —
declares which properties can be used in
:
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
WITH [c.source, c.lang, c.published_year] // stored as metadata; filterable in SEARCH WHERE
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
Multi-label index with filterable properties [2026.01+]:
cypher
CYPHER 25
CREATE VECTOR INDEX doc_embedding IF NOT EXISTS
FOR (n:Document|Article) ON n.embedding
WITH [n.author, n.published_year, n.lang]
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
Relationship index:
cypher
CYPHER 25
CREATE VECTOR INDEX rel_embedding IF NOT EXISTS
FOR ()-[r:HAS_CHUNK]-() ON (r.embedding)
OPTIONS { indexConfig: { `vector.dimensions`: 768, `vector.similarity_function`: 'cosine' } }
property types — only scalar types allowed:
,
,
,
,
,
,
,
,
,
. Not allowed:
,
, or the vector property itself.
Index config reference:
| Parameter | Type | Default | Notes |
|---|
| INTEGER 1–4096 | none | Required; must match embedding model exactly |
vector.similarity_function
| STRING | | or |
vector.quantization.enabled
| BOOLEAN | | Reduces storage; slight accuracy tradeoff; needs vector-2.0+ (5.18+) |
| INTEGER 1–512 | | HNSW graph connections; higher = better recall, more memory |
vector.hnsw.ef_construction
| INTEGER 1–3200 | | Build-time candidates; higher = better recall, slower build |
Similarity function choice:
| Use case | Function |
|---|
| Normalized embeddings (OpenAI, Cohere, Voyage, Google) | |
| Unnormalized / raw distance matters | |
Step 2 — Wait for Index ONLINE
Index builds asynchronously — do NOT query until ONLINE:
cypher
SHOW VECTOR INDEXES YIELD name, state, populationPercent
WHERE name = 'chunk_embedding'
RETURN name, state, populationPercent
Poll every 5s until
and
populationPercent = 100.0
. If
→ stop, check logs.
Shell poll (cypher-shell):
bash
until cypher-shell -u neo4j -p "$NEO4J_PASSWORD" \
"SHOW VECTOR INDEXES YIELD name, state WHERE name='chunk_embedding' RETURN state" \
| grep -q ONLINE; do
sleep 5
done
Step 3 — Ingest Embeddings
Batch UNWIND pattern (use for > 100 nodes — never one-node-per-transaction):
python
from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(user, password))
def embed_batch(texts: list[str]) -> list[list[float]]:
response = openai_client.embeddings.create(
model="text-embedding-3-small", input=texts
)
return [r.embedding for r in response.data]
def store_embeddings(records: list[dict], batch_size: int = 500):
expected_dim = 1536 # must match vector.dimensions
texts = [r["text"] for r in records]
embeddings = embed_batch(texts)
for emb in embeddings:
assert len(emb) == expected_dim, f"Dim mismatch: {len(emb)} != {expected_dim}"
rows = [{"id": r["id"], "embedding": emb}
for r, emb in zip(records, embeddings)]
for i in range(0, len(rows), batch_size):
driver.execute_query(
"UNWIND $rows AS row MATCH (c:Chunk {id: row.id}) SET c.embedding = row.embedding",
rows=rows[i:i+batch_size]
)
❌ Never create index after embeddings are already stored — always create index first.
✅ Create index → poll ONLINE → ingest embeddings.
Step 4 — Run Vector Search
SEARCH clause (2026.01+, preferred)
cypher
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
With in-index filter [2026.01+] — properties must be declared in
at index creation:
cypher
// Index must have been created with: WITH [c.source, c.lang, c.published_year]
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
WHERE c.source = $source AND c.lang = 'en' AND c.published_year >= 2024
LIMIT 10
) SCORE AS score
RETURN c.text, c.source, score
ORDER BY score DESC
Filtering strategy — choose one:
| Strategy | When to use | Tradeoff |
|---|
| In-index [2026.01+] | Filters on pre-declared properties; known at index design time | Fast, consistent latency; properties must be declared upfront |
| Post-filter (MATCH + procedure) | Arbitrary Cypher predicates, graph traversal, OR/NOT | Full flexibility; may over-fetch then discard |
| Pre-filter (MATCH first, then SEARCH) | Small known candidate set; exact nearest-neighbor within subset | Deterministic; slow on large candidate sets |
In-index hard limits [2026.01+]:
- Property must be listed in at index creation — undeclared properties silently fall back to post-filtering
- AND predicates only — no OR, NOT, list ops, string ops
- Scalar types only: , , , , temporal types — not VECTOR/LIST/POINT
Post-filter pattern (2025.x or arbitrary predicates)
cypher
CYPHER 25
CALL db.index.vector.queryNodes('chunk_embedding', 50, $queryEmbedding)
YIELD node AS c, score
WHERE c.source = $source // post-filter: fetch more, then filter
RETURN c.text, score
ORDER BY score DESC LIMIT 10
Relationship index procedure:
cypher
CYPHER 25
CALL db.index.vector.queryRelationships('rel_embedding', 5, $queryEmbedding)
YIELD relationship AS r, score
RETURN r.text, score
SEARCH clause hard limits (all versions):
- Index name cannot be a parameter ( not allowed — use literal string)
- Binding variable must come from the enclosing MATCH pattern
- Query vector cannot reference the binding variable
Step 5 — Combine with Graph Traversal (simple cases)
Vector search as entry point, then graph hop:
cypher
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
MATCH (c)<-[:HAS_CHUNK]-(a:Article)
OPTIONAL MATCH (a)-[:MENTIONS]->(org:Organization)
RETURN c.text, a.title, score, collect(DISTINCT org.name) AS organizations
ORDER BY score DESC
For full retrieval_query pipelines, HybridCypherRetriever, or
library → delegate to
.
Embedding Provider Quick-Reference
| Provider / Model | Dimensions | Similarity | Notes |
|---|
| OpenAI text-embedding-3-small | 1536 | cosine | Default; reducible to 256–1536 via param |
| OpenAI text-embedding-3-large | 3072 | cosine | Reducible to 256–3072 |
| OpenAI text-embedding-ada-002 | 1536 | cosine | Legacy; prefer 3-small |
| Cohere embed-v3 (English) | 1024 | cosine | Use input_type='search_document'
at ingest, at query |
| Voyage voyage-3-large | 1024 | cosine | High quality; needs package |
| Google text-embedding-004 | 768 | cosine | Via Vertex AI |
| Ollama nomic-embed-text | 768 | cosine | Local dev/testing |
| Ollama mxbai-embed-large | 1024 | cosine | Local; production-quality |
must exactly match model output — no auto-truncation.
Vector Functions
Ad-hoc similarity (not for kNN search — use index for that):
cypher
MATCH (a:Chunk {id: $id1}), (b:Chunk {id: $id2})
RETURN vector.similarity.cosine(a.embedding, b.embedding) AS sim
// vector.similarity.euclidean(a, b) — same signature, 0–1 range
// vector_distance (2025.10+) — metrics: EUCLIDEAN, EUCLIDEAN_SQUARED, MANHATTAN, COSINE, DOT, HAMMING
// Returns distance (lower = more similar, inverse of similarity)
RETURN vector_distance(a.embedding, b.embedding, 'COSINE') AS dist
// vector_dimension_count (2025.10+)
RETURN vector_dimension_count(n.embedding) AS dims
// vector_norm (2025.20+) — metrics: EUCLIDEAN, MANHATTAN
RETURN vector_norm(n.embedding, 'EUCLIDEAN') AS norm
Convert LIST to typed VECTOR:
cypher
// vector(value, dimension, coordinateType)
// coordinateType: FLOAT64, FLOAT32, INTEGER8/16/32/64
WITH vector([1.0, 2.0, 3.0], 3, 'FLOAT32') AS v
RETURN vector_dimension_count(v)
Index Management
cypher
// Show all vector indexes with config
SHOW VECTOR INDEXES YIELD name, state, populationPercent,
labelsOrTypes, properties, indexConfig
RETURN name, state, populationPercent, labelsOrTypes, properties, indexConfig;
// Drop (node data unchanged — only index structure removed)
DROP INDEX chunk_embedding IF EXISTS;
// No ALTER VECTOR INDEX — to change dimensions or similarity function:
// 1. DROP INDEX old_index IF EXISTS
// 2. CREATE VECTOR INDEX new_index ... with new OPTIONS
// 3. Re-generate all embeddings with new model
// 4. Poll until ONLINE
Common Errors
| Error | Cause | Fix |
|---|
IllegalArgumentException: Index dimension mismatch
| Stored embedding dim ≠ | Fix embed generation; drop + recreate index with correct dim |
| Search returns incomplete results | Index still | Poll until |
Unknown procedure db.index.vector.queryNodes
| Neo4j < 5.11 | No vector index support below 5.11; upgrade |
SEARCH clause not available
| Neo4j < 2026.01 | Use procedure |
OR/NOT not allowed in SEARCH WHERE
| SEARCH in-index filter restriction | Move complex predicates to outer WHERE after SEARCH |
| Zero results from correct query | Wrong similarity function or all-zeros embedding | Verify with vector.similarity.cosine()
; check embed call succeeded |
| Score always 1.0 | All-zeros or identical vectors | Embedding generation failed; add dimension assertion before ingest |
vector.quantization.enabled
option rejected | provider vector-1.0 (Neo4j < 5.18) | Omit quantization option or upgrade to 5.18+ |
Checklist
In-Cypher Embedding Generation — ai.text.embed() [2025.12]
Generate embeddings at query time without external Python code. Use
— the current API since [2025.12]:
cypher
// Syntax (requires CYPHER 25)
CYPHER 25
// ai.text.embed(resource :: STRING, provider :: STRING, configuration :: MAP) :: VECTOR
Provider strings are lowercase (
,
,
,
). Full provider config →
.
Full query pattern — embed at query time, search immediately (procedure fallback for 2025.x):
cypher
CYPHER 25
WITH ai.text.embed(
"What are good open source projects",
"openai",
{ token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
CALL db.index.vector.queryNodes('chunk_embedding', 6, userEmbedding) // deprecated 2026.04
YIELD node AS c, score
RETURN c.text, score
ORDER BY score DESC
With SEARCH clause (2026.01+):
cypher
CYPHER 25
WITH ai.text.embed("my query", "openai", { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
MATCH (c:Chunk)
SEARCH c IN (VECTOR INDEX chunk_embedding FOR userEmbedding LIMIT 6) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
❌ Never pass API key as literal string in production — use
or
.
✅ Use
parameter; inject via driver params dict.
Rule: Use same model at ingest time and query time — embeddings from different models are not comparable.
Deprecated (still works but do not use in new code):
- [deprecated] → use [2025.12]
genai.vector.encodeBatch()
[deprecated] → use CALL ai.text.embedBatch()
[2025.12]
genai.vector.listEncodingProviders()
[deprecated] → use CALL ai.text.embed.providers()
[2025.12]
For full
reference (completion, structured output, chat, tokenization) →
.
Cypher-Based Embedding Ingestion — db.create.setNodeVectorProperty
Set vector property via Cypher (e.g. during LOAD CSV or MERGE pipeline):
cypher
LOAD CSV WITH HEADERS FROM 'https://example.com/data.csv' AS row
MERGE (q:Question {text: row.question})
WITH q, row
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
Use when embedding is already in CSV/JSON form as a string —
apoc.convert.fromJsonList()
converts
to
.
For Python-generated embeddings, use the Python UNWIND batch pattern (Step 3) instead.
Similarity Function — Extended Guidance
Existing table (Step 1) gives the basic rule. Additional guidance from course patterns:
Choose based on training loss function:
- Check embedding model docs — models trained with cosine loss → use
- Models trained with L2/Euclidean loss → use
- When docs are silent: default to (all major hosted APIs use it)
Common pitfall — wrong similarity function:
❌ Created index with 'euclidean' but model outputs L2-normalized vectors
→ scores are mathematically correct but rankings differ from expected cosine order
→ no error thrown; wrong results silently returned
✅ Verify: run vector.similarity.cosine(a.embedding, b.embedding) manually on known
similar pairs — score should be > 0.9 for near-duplicate text
Sanity check query after index creation:
cypher
MATCH (c:Chunk) WITH c LIMIT 2
WITH collect(c) AS nodes
RETURN vector.similarity.cosine(nodes[0].embedding, nodes[1].embedding) AS cosine_check,
vector.similarity.euclidean(nodes[0].embedding, nodes[1].embedding) AS euclidean_check
If both return
→ embeddings not set. If cosine returns
→ identical vectors (embed call failed).
Gotchas — Extended
| Gotcha | Detail | Fix |
|---|
| Index not ONLINE at ingest time | Inserting nodes before index exists is valid — index auto-populates. But querying during returns partial results | Always poll before first query |
| Wrong dimensions — silent failure | Stored vector dim ≠ → at query time, not at ingest time | Assert before every |
| Different models at ingest vs query | No error; cosine scores ~0.3–0.5 for clearly similar text | Use same model string/version for both; store model name as node metadata |
| Missing model at query | returns silently if provider config wrong | Test encode call standalone; check CYPHER 25 RETURN ai.text.embed(...)
before embedding into pipeline |
| Large single-transaction ingest | One transaction for 10k nodes → OOM or timeout | Use UNWIND $rows ... CALL IN TRANSACTIONS OF 500 ROWS
or Python batch loop |
| Chunk overlap not set | Adjacent chunks with no overlap → context at boundaries lost → poor recall for cross-paragraph queries | Set ≥ 10% of |
References
Load on demand: