graphrag
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMicrosoft GraphRAG Skill
Microsoft GraphRAG 使用指南
Expert assistance for using Microsoft GraphRAG, a modular graph-based Retrieval-Augmented Generation system that extracts structured knowledge from unstructured text to enhance LLM reasoning over private data.
本指南为使用Microsoft GraphRAG提供专业指导,这是一款基于模块化图结构的检索增强生成(Retrieval-Augmented Generation,RAG)系统,可从非结构化文本中提取结构化知识,增强大语言模型(LLM)对私有数据的推理能力。
When to Use This Skill
何时使用此技能
This skill should be used when:
- Building RAG systems that need to "connect the dots" across dispersed information
- Querying large document collections holistically
- Extracting structured knowledge graphs from unstructured text
- Implementing graph-based retrieval for LLM applications
- Processing private datasets with enhanced reasoning capabilities
- Working with narrative, unstructured documents
- Building question-answering systems over document corpora
- Extracting entities, relationships, and claims from text
- Creating hierarchical knowledge summaries
- Implementing multi-hop reasoning over documents
- Comparing GraphRAG with traditional vector-based RAG
- Tuning prompts for domain-specific datasets
- Configuring indexing pipelines for knowledge extraction
在以下场景中应使用本技能:
- 构建需要跨分散信息“串联关联点”的RAG系统
- 对大型文档集合进行整体性查询
- 从非结构化文本中提取结构化知识图谱
- 为LLM应用实现基于图结构的检索功能
- 处理需要增强推理能力的私有数据集
- 处理叙事性、非结构化文档
- 基于文档语料库构建问答系统
- 从文本中提取实体、关系和声明
- 创建分层式知识摘要
- 实现跨文档的多跳推理
- 对比GraphRAG与传统基于向量的RAG
- 针对特定领域数据集调优提示词
- 配置用于知识提取的索引流水线
Overview
概述
What is GraphRAG?
什么是GraphRAG?
Microsoft GraphRAG is a data pipeline and transformation system that:
- Extracts meaningful, structured data from unstructured text using LLMs
- Builds knowledge graph memory structures
- Enhances LLM outputs through graph-based retrieval
- Supports private data processing without external exposure
Core Innovation:
"GraphRAG addresses fundamental limitations of baseline RAG: connecting the dots across disparate information pieces and holistically understanding summarized concepts over large collections."
Microsoft GraphRAG是一个数据流水线与转换系统,具备以下能力:
- 利用LLM从非结构化文本中提取有意义的结构化数据
- 构建知识图谱内存结构
- 通过基于图的检索增强LLM输出
- 支持私有数据处理,无需对外暴露
核心创新点:
"GraphRAG解决了基础RAG的根本性局限:跨分散信息片段建立关联,以及对大型集合中的汇总概念形成整体性理解。"
Key Differentiators from Baseline RAG
与基础RAG的关键差异
Traditional vector-based RAG has limitations:
- ❌ Struggles to connect information across multiple documents
- ❌ Limited holistic understanding of document collections
- ❌ Misses relationships between dispersed facts
- ❌ Poor performance on "summarize the corpus" queries
GraphRAG solves these with:
- ✅ Knowledge graph extraction from text
- ✅ Hierarchical community detection
- ✅ Multi-level summarization
- ✅ Graph-based reasoning and traversal
- ✅ Better performance on complex queries
传统基于向量的RAG存在以下局限:
- ❌ 难以跨多个文档建立信息关联
- ❌ 对文档集合的整体性理解有限
- ❌ 遗漏分散事实之间的关系
- ❌ 在“汇总整个语料库”类查询上表现不佳
GraphRAG通过以下特性解决这些问题:
- ✅ 从文本中提取知识图谱
- ✅ 分层社区检测
- ✅ 多级摘要生成
- ✅ 基于图的推理与遍历
- ✅ 在复杂查询上表现更优
Core Concepts
核心概念
1. Knowledge Graph Extraction
1. 知识图谱提取
GraphRAG extracts three primary elements:
Entities: Objects, people, places, concepts
Examples:
- "Microsoft" (Organization)
- "Seattle" (Location)
- "Cloud Computing" (Concept)
- "Satya Nadella" (Person)Relationships: Connections between entities
Examples:
- Microsoft → headquartered_in → Seattle
- Satya Nadella → is_CEO_of → Microsoft
- Microsoft → provides → Cloud ComputingClaims: Factual statements with supporting evidence
Examples:
- "Microsoft is the largest software company" [Source: Document X, Page 5]
- "Azure revenue grew 30% in Q4" [Source: Earnings Report]GraphRAG提取三类核心元素:
实体:对象、人物、地点、概念
示例:
- "Microsoft" (Organization)
- "Seattle" (Location)
- "Cloud Computing" (Concept)
- "Satya Nadella" (Person)关系:实体之间的连接
示例:
- Microsoft → headquartered_in → Seattle
- Satya Nadella → is_CEO_of → Microsoft
- Microsoft → provides → Cloud Computing声明:带有支持证据的事实陈述
示例:
- "Microsoft is the largest software company" [Source: Document X, Page 5]
- "Azure revenue grew 30% in Q4" [Source: Earnings Report]2. Hierarchical Community Detection
2. 分层社区检测
GraphRAG uses the Leiden algorithm to:
- Cluster related entities into communities
- Create hierarchical levels of organization
- Generate summaries at each level
- Enable bottom-up reasoning
Example Hierarchy:
Level 0 (Detailed):
Community 1: Azure services (Compute, Storage, Networking)
Community 2: Office products (Word, Excel, PowerPoint)
Level 1 (Mid-level):
Community A: Cloud services (includes Community 1)
Community B: Productivity tools (includes Community 2)
Level 2 (High-level):
Community X: Microsoft product ecosystem (includes A & B)GraphRAG使用Leiden算法实现:
- 将相关实体聚类为社区
- 创建分层组织结构
- 在每个层级生成摘要
- 支持自底向上的推理
示例层级结构:
Level 0(细粒度):
Community 1: Azure服务(计算、存储、网络)
Community 2: Office产品(Word、Excel、PowerPoint)
Level 1(中粒度):
Community A: 云服务(包含Community 1)
Community B: 生产力工具(包含Community 2)
Level 2(粗粒度):
Community X: Microsoft产品生态系统(包含A & B)3. TextUnits
3. TextUnits
Documents are segmented into TextUnits:
- Manageable chunks for analysis
- Sized based on token limits
- Overlapping to preserve context
- Form the basis of entity extraction
文档会被分割为TextUnits:
- 便于分析的可管理块
- 基于token限制确定大小
- 存在重叠以保留上下文
- 是实体提取的基础
4. Query Modes
4. 查询模式
GraphRAG offers multiple search strategies:
Global Search: Holistic corpus reasoning
- Best for: "Summarize the main themes"
- Uses: Community summaries at all levels
- Method: Bottom-up aggregation
Local Search: Entity-specific reasoning
- Best for: "Tell me about Entity X"
- Uses: Entity neighborhoods in graph
- Method: Traversal from seed entities
DRIFT Search: Entity reasoning with community context
- Best for: "How does X relate to broader themes?"
- Uses: Entities + community summaries
- Method: Hybrid approach
Basic Search: Traditional vector similarity
- Best for: Simple semantic matching
- Uses: Embedding similarity
- Method: Baseline RAG fallback
GraphRAG提供多种搜索策略:
全局搜索:整体性语料推理
- 最佳场景:“汇总主要主题”
- 使用:所有层级的社区摘要
- 方法:自底向上聚合
本地搜索:实体特定推理
- 最佳场景:“告诉我关于实体X的信息”
- 使用:图中的实体邻域
- 方法:从种子实体开始遍历
DRIFT搜索:结合社区上下文的实体推理
- 最佳场景:“X与更广泛的主题有何关联?”
- 使用:实体 + 社区摘要
- 方法:混合方式
基础搜索:传统向量相似度
- 最佳场景:简单语义匹配
- 使用:嵌入相似度
- 方法:基础RAG回退方案
Installation
安装
Prerequisites
前置条件
bash
undefinedbash
undefinedPython 3.10 or higher required
需要Python 3.10或更高版本
python --version
python --version
Install GraphRAG
安装GraphRAG
pip install graphrag
pip install graphrag
Or install from source
或从源码安装
git clone https://github.com/microsoft/graphrag.git
cd graphrag
pip install -e .
undefinedgit clone https://github.com/microsoft/graphrag.git
cd graphrag
pip install -e .
undefinedEnvironment Setup
环境配置
bash
undefinedbash
undefinedCreate environment file
创建环境文件
cat > .env << EOF
cat > .env << EOF
LLM Configuration (OpenAI)
LLM配置(OpenAI)
GRAPHRAG_LLM_API_KEY=your-openai-api-key
GRAPHRAG_LLM_TYPE=openai_chat
GRAPHRAG_LLM_MODEL=gpt-4o
GRAPHRAG_LLM_API_KEY=your-openai-api-key
GRAPHRAG_LLM_TYPE=openai_chat
GRAPHRAG_LLM_MODEL=gpt-4o
Embedding Configuration
嵌入配置
GRAPHRAG_EMBEDDING_API_KEY=your-openai-api-key
GRAPHRAG_EMBEDDING_TYPE=openai_embedding
GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small
GRAPHRAG_EMBEDDING_API_KEY=your-openai-api-key
GRAPHRAG_EMBEDDING_TYPE=openai_embedding
GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small
Optional: Azure OpenAI
可选:Azure OpenAI
GRAPHRAG_LLM_API_BASE=https://your-resource.openai.azure.com
GRAPHRAG_LLM_API_BASE=https://your-resource.openai.azure.com
GRAPHRAG_LLM_API_VERSION=2024-02-15-preview
GRAPHRAG_LLM_API_VERSION=2024-02-15-preview
GRAPHRAG_LLM_DEPLOYMENT_NAME=gpt-4
GRAPHRAG_LLM_DEPLOYMENT_NAME=gpt-4
Optional: Local models
可选:本地模型
GRAPHRAG_LLM_TYPE=ollama
GRAPHRAG_LLM_TYPE=ollama
GRAPHRAG_LLM_API_BASE=http://localhost:11434
GRAPHRAG_LLM_API_BASE=http://localhost:11434
EOF
undefinedEOF
undefinedQuick Start
快速开始
1. Initialize Project
1. 初始化项目
bash
undefinedbash
undefinedCreate new GraphRAG project
创建新的GraphRAG项目
mkdir my-graphrag-project
cd my-graphrag-project
mkdir my-graphrag-project
cd my-graphrag-project
Initialize configuration
初始化配置
graphrag init --root .
graphrag init --root .
This creates:
此命令会创建:
- settings.yaml (configuration)
- settings.yaml(配置文件)
- .env (environment variables)
- .env(环境变量文件)
- prompts/ (customizable prompts)
- prompts/(可自定义的提示词目录)
undefinedundefined2. Prepare Your Data
2. 准备数据
bash
undefinedbash
undefinedCreate input directory
创建输入目录
mkdir -p input
mkdir -p input
Add your documents
添加你的文档
cp /path/to/documents/*.txt input/
cp /path/to/documents/*.txt input/
Supported formats: .txt, .pdf, .docx, .md
支持的格式:.txt, .pdf, .docx, .md
Each file will be processed independently
每个文件会被独立处理
undefinedundefined3. Run Indexing Pipeline
3. 运行索引流水线
bash
undefinedbash
undefinedIndex your data (this can take time and cost money!)
为数据建立索引(此过程可能耗时且产生费用!)
graphrag index --root .
graphrag index --root .
The indexing process will:
索引过程会:
1. Load and chunk documents
1. 加载并分割文档
2. Extract entities, relationships, claims
2. 提取实体、关系、声明
3. Build knowledge graph
3. 构建知识图谱
4. Detect communities (Leiden algorithm)
4. 检测社区(Leiden算法)
5. Generate community summaries
5. 生成社区摘要
6. Create embeddings
6. 创建嵌入向量
7. Store results in output/
7. 将结果存储到output/目录
Monitor progress
监控进度
graphrag index --root . --verbose
undefinedgraphrag index --root . --verbose
undefined4. Query Your Data
4. 查询数据
bash
undefinedbash
undefinedGlobal Search (holistic queries)
全局搜索(整体性查询)
graphrag query --root .
--method global
--query "What are the main themes in this dataset?"
--method global
--query "What are the main themes in this dataset?"
graphrag query --root .
--method global
--query "此数据集的主要主题是什么?"
--method global
--query "此数据集的主要主题是什么?"
Local Search (entity-specific queries)
本地搜索(实体特定查询)
graphrag query --root .
--method local
--query "Tell me about Microsoft's cloud strategy"
--method local
--query "Tell me about Microsoft's cloud strategy"
graphrag query --root .
--method local
--query "告诉我Microsoft的云战略"
--method local
--query "告诉我Microsoft的云战略"
DRIFT Search (entity + community context)
DRIFT搜索(实体 + 社区上下文)
graphrag query --root .
--method drift
--query "How does Azure relate to the broader Microsoft ecosystem?"
--method drift
--query "How does Azure relate to the broader Microsoft ecosystem?"
undefinedgraphrag query --root .
--method drift
--query "Azure与更广泛的Microsoft生态系统有何关联?"
--method drift
--query "Azure与更广泛的Microsoft生态系统有何关联?"
undefinedConfiguration
配置
settings.yaml Structure
settings.yaml 结构
yaml
undefinedyaml
undefinedCore Configuration
核心配置
llm:
api_key: ${GRAPHRAG_LLM_API_KEY}
type: openai_chat # or azure_openai_chat, ollama
model: gpt-4o
max_tokens: 4000
temperature: 0
top_p: 1
embeddings:
api_key: ${GRAPHRAG_EMBEDDING_API_KEY}
type: openai_embedding
model: text-embedding-3-small
llm:
api_key: ${GRAPHRAG_LLM_API_KEY}
type: openai_chat # 或azure_openai_chat, ollama
model: gpt-4o
max_tokens: 4000
temperature: 0
top_p: 1
embeddings:
api_key: ${GRAPHRAG_EMBEDDING_API_KEY}
type: openai_embedding
model: text-embedding-3-small
Chunking Configuration
分割配置
chunks:
size: 1200 # Token size per chunk
overlap: 100 # Overlap between chunks
group_by_columns: [id]
chunks:
size: 1200 # 每个块的token数量
overlap: 100 # 块之间的重叠token数
group_by_columns: [id]
Entity Extraction
实体提取
entity_extraction:
prompt: "prompts/entity_extraction.txt"
max_gleanings: 1 # Re-extraction passes
entity_types: [organization, person, location, event]
entity_extraction:
prompt: "prompts/entity_extraction.txt"
max_gleanings: 1 # 重新提取的次数
entity_types: [organization, person, location, event]
Community Detection
社区检测
community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
Claim Extraction
声明提取
claim_extraction:
enabled: true
prompt: "prompts/claim_extraction.txt"
max_gleanings: 1
claim_extraction:
enabled: true
prompt: "prompts/claim_extraction.txt"
max_gleanings: 1
Embeddings
嵌入向量
embed_graph:
enabled: true
strategy: node2vec # or deepwalk
embed_graph:
enabled: true
strategy: node2vec # 或deepwalk
Storage
存储
storage:
type: file # or blob, cosmosdb
base_dir: output
storage:
type: file # 或blob, cosmosdb
base_dir: output
Reporting
报告
reporting:
type: file
base_dir: output/reports
undefinedreporting:
type: file
base_dir: output/reports
undefinedAdvanced Configuration Options
高级配置选项
yaml
undefinedyaml
undefinedCustom LLM Configuration
自定义LLM配置
llm:
type: azure_openai_chat
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: gpt-4
api_key: ${AZURE_OPENAI_API_KEY}
request_timeout: 180
max_retries: 10
max_retry_wait: 10
llm:
type: azure_openai_chat
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: gpt-4
api_key: ${AZURE_OPENAI_API_KEY}
request_timeout: 180
max_retries: 10
max_retry_wait: 10
Parallelization
并行化
parallelization:
stagger: 0.3 # Delay between requests
num_threads: 4 # Concurrent workers
parallelization:
stagger: 0.3 # 请求之间的延迟
num_threads: 4 # 并发工作线程数
Cache Configuration
缓存配置
cache:
type: file
base_dir: cache
cache:
type: file
base_dir: cache
Input Configuration
输入配置
input:
type: file
file_type: text # or csv, parquet
base_dir: input
encoding: utf-8
file_pattern: ".*\.txt$"
undefinedinput:
type: file
file_type: text # 或csv, parquet
base_dir: input
encoding: utf-8
file_pattern: ".*\.txt$"
undefinedPrompt Tuning
提示词调优
Why Tune Prompts?
为何调优提示词?
"Using GraphRAG with your data out of the box may not yield the best possible results."
Domain-specific datasets require custom prompts for:
- Relevant entity types
- Appropriate relationship types
- Domain-specific language
- Expected output format
"直接将GraphRAG用于你的数据可能无法获得最佳结果。"
特定领域的数据集需要自定义提示词以实现:
- 相关的实体类型
- 合适的关系类型
- 领域特定语言
- 预期的输出格式
Auto-Tuning Process
自动调优流程
bash
undefinedbash
undefinedGenerate domain-adapted prompts
生成适配领域的提示词
graphrag prompt-tune --root .
--config settings.yaml
--output prompts/
--config settings.yaml
--output prompts/
graphrag prompt-tune --root .
--config settings.yaml
--output prompts/
--config settings.yaml
--output prompts/
This will:
此命令会:
1. Analyze your input documents
1. 分析你的输入文档
2. Identify domain-specific patterns
2. 识别领域特定模式
3. Generate custom entity extraction prompts
3. 生成自定义实体提取提示词
4. Generate custom summarization prompts
4. 生成自定义摘要提示词
5. Save to prompts/ directory
5. 保存到prompts/目录
undefinedundefinedManual Prompt Customization
手动自定义提示词
bash
undefinedbash
undefinedEdit generated prompts
编辑生成的提示词
nano prompts/entity_extraction.txt
**Example Entity Extraction Prompt:**
-Target activity-
You are an AI assistant helping to identify entities in documents about {DOMAIN}.
-Goal-
Extract all entities and relationships from the text below.
Entity Types:
{ENTITY_TYPES}
Relationship Types:
{RELATIONSHIP_TYPES}
Format your response as JSON:
{{
"entities": [
{{"name": "Entity Name", "type": "ENTITY_TYPE", "description": "..."}}
],
"relationships": [
{{"source": "Entity 1", "target": "Entity 2", "type": "RELATIONSHIP_TYPE", "description": "..."}}
]
}}
Text to analyze:
{INPUT_TEXT}
undefinednano prompts/entity_extraction.txt
**示例实体提取提示词:**
-目标活动-
你是一名AI助手,帮助识别关于{DOMAIN}的文档中的实体。
-目标-
从以下文本中提取所有实体和关系。
实体类型:
{ENTITY_TYPES}
关系类型:
{RELATIONSHIP_TYPES}
请将响应格式化为JSON:
{{
"entities": [
{{"name": "实体名称", "type": "ENTITY_TYPE", "description": "..."}}
],
"relationships": [
{{"source": "实体1", "target": "实体2", "type": "RELATIONSHIP_TYPE", "description": "..."}}
]
}}
待分析文本:
{INPUT_TEXT}
undefinedIndexing Pipeline Deep Dive
索引流水线深入解析
Step-by-Step Process
分步流程
1. Document Loading
python
undefined1. 文档加载
python
undefinedInput documents are loaded from input/ directory
从input/目录加载输入文档
Supported formats: .txt, .pdf, .docx, .md
支持的格式:.txt, .pdf, .docx, .md
**2. Text Chunking**
```python
**2. 文本分割**
```pythonDocuments split into TextUnits
文档被分割为TextUnits
Default: 1200 tokens with 100 token overlap
默认:1200个token,重叠100个token
Preserves context across chunk boundaries
保留块边界的上下文
**3. Entity Extraction**
```python
**3. 实体提取**
```pythonFor each TextUnit:
对每个TextUnit:
- Extract entities (with types and descriptions)
- 提取实体(包含类型和描述)
- Extract relationships (with types and weights)
- 提取关系(包含类型和权重)
- Extract claims (with sources and confidence)
- 提取声明(包含来源和置信度)
**4. Graph Construction**
```python
**4. 图构建**
```pythonBuild knowledge graph:
构建知识图谱:
- Nodes = Entities
- 节点 = 实体
- Edges = Relationships
- 边 = 关系
- Properties = Attributes and metadata
- 属性 = 特征和元数据
**5. Community Detection**
```python
**5. 社区检测**
```pythonLeiden algorithm for hierarchical clustering:
使用Leiden算法进行分层聚类:
- Level 0: Fine-grained communities
- Level 0: 细粒度社区
- Level 1: Mid-level aggregations
- Level 1: 中粒度聚合
- Level 2+: High-level themes
- Level 2+: 粗粒度主题
**6. Community Summarization**
```python
**6. 社区摘要**
```pythonFor each community at each level:
对每个层级的每个社区:
- Aggregate entity and relationship info
- 聚合实体和关系信息
- Generate natural language summary
- 生成自然语言摘要
- Store for query-time retrieval
- 存储以供查询时检索
**7. Embedding Generation**
```python
**7. 嵌入向量生成**
```pythonCreate vector embeddings for:
为以下内容创建向量嵌入:
- TextUnits (for similarity search)
- TextUnits(用于相似度搜索)
- Entities (for semantic matching)
- 实体(用于语义匹配)
- Community summaries (for global search)
- 社区摘要(用于全局搜索)
**8. Output Storage**
```python
**8. 输出存储**
```pythonResults saved to output/:
结果保存到output/:
- create_final_entities.parquet
- create_final_entities.parquet
- create_final_relationships.parquet
- create_final_relationships.parquet
- create_final_communities.parquet
- create_final_communities.parquet
- create_final_community_reports.parquet
- create_final_community_reports.parquet
- create_final_text_units.parquet
- create_final_text_units.parquet
undefinedundefinedQuery Modes in Detail
查询模式详细说明
Global Search
全局搜索
Best For:
- "What are the main themes?"
- "Summarize the entire dataset"
- "What are the key trends?"
How It Works:
- Query is matched against community summaries
- Relevant communities selected at all hierarchy levels
- Summaries aggregated bottom-up
- Final answer synthesized from multiple levels
Example:
bash
graphrag query --root . \
--method global \
--query "What are the major technology trends discussed in these documents?"最佳场景:
- “主要主题是什么?”
- “汇总整个数据集”
- “关键趋势有哪些?”
工作原理:
- 将查询与社区摘要匹配
- 选择所有层级的相关社区
- 自底向上聚合摘要
- 从多个层级合成最终答案
示例:
bash
graphrag query --root . \
--method global \
--query "这些文档中讨论的主要技术趋势是什么?"Behind the scenes:
后台流程:
1. Match query to relevant communities
1. 将查询与相关社区匹配
2. Retrieve summaries from levels 0, 1, 2
2. 检索Level 0、1、2的摘要
3. Aggregate: AI/ML, Cloud, Cybersecurity communities
3. 聚合:AI/ML、云、网络安全社区
4. Synthesize comprehensive answer
4. 合成全面的答案
**Python API:**
```python
from graphrag.query import GlobalSearch
searcher = GlobalSearch(
llm=llm,
context_builder=context_builder,
map_system_prompt=map_prompt,
reduce_system_prompt=reduce_prompt
)
result = await searcher.asearch(
query="What are the major themes?",
conversation_history=[]
)
print(result.response)
**Python API:**
```python
from graphrag.query import GlobalSearch
searcher = GlobalSearch(
llm=llm,
context_builder=context_builder,
map_system_prompt=map_prompt,
reduce_system_prompt=reduce_prompt
)
result = await searcher.asearch(
query="主要主题是什么?",
conversation_history=[]
)
print(result.response)Local Search
本地搜索
Best For:
- "Tell me about [specific entity]"
- "What is the relationship between X and Y?"
- "Find information about [topic]"
How It Works:
- Identify entities mentioned in query
- Traverse graph from those entities
- Collect neighborhood information (N-hop)
- Retrieve associated TextUnits
- Synthesize answer from local context
Example:
bash
graphrag query --root . \
--method local \
--query "What is Microsoft's strategy for artificial intelligence?"最佳场景:
- “告诉我关于[特定实体]的信息”
- “X和Y之间的关系是什么?”
- “查找关于[主题]的信息”
工作原理:
- 识别查询中提到的实体
- 从这些实体开始遍历图
- 收集邻域信息(N跳)
- 检索关联的TextUnits
- 从本地上下文合成答案
示例:
bash
graphrag query --root . \
--method local \
--query "Microsoft的人工智能战略是什么?"Behind the scenes:
后台流程:
1. Identify: "Microsoft", "artificial intelligence" entities
1. 识别:“Microsoft”、“artificial intelligence”实体
2. Traverse: Find related entities (Azure AI, OpenAI partnership, etc.)
2. 遍历:查找相关实体(Azure AI、OpenAI合作等)
3. Collect: Relationships, claims, TextUnits
3. 收集:关系、声明、TextUnits
4. Synthesize: Answer from local graph neighborhood
4. 合成:从本地图邻域生成答案
**Python API:**
```python
from graphrag.query import LocalSearch
searcher = LocalSearch(
llm=llm,
context_builder=context_builder,
system_prompt=system_prompt
)
result = await searcher.asearch(
query="Tell me about Microsoft's AI strategy",
conversation_history=[]
)
print(result.response)
**Python API:**
```python
from graphrag.query import LocalSearch
searcher = LocalSearch(
llm=llm,
context_builder=context_builder,
system_prompt=system_prompt
)
result = await searcher.asearch(
query="告诉我Microsoft的AI战略",
conversation_history=[]
)
print(result.response)DRIFT Search
DRIFT搜索
Best For:
- "How does [entity] fit into [broader context]?"
- "What is the significance of [topic]?"
- Hybrid queries needing both local and global context
How It Works:
- Identify query entities (like Local Search)
- Find relevant communities (like Global Search)
- Combine entity neighborhoods with community summaries
- Synthesize answer from both perspectives
Example:
bash
graphrag query --root . \
--method drift \
--query "How does Azure AI relate to Microsoft's overall cloud strategy?"最佳场景:
- “[实体]如何融入[更广泛的背景]?”
- “[主题]的重要性是什么?”
- 需要本地和全局上下文的混合查询
工作原理:
- 识别查询实体(如本地搜索)
- 查找相关社区(如全局搜索)
- 结合实体邻域与社区摘要
- 从两个视角合成答案
示例:
bash
graphrag query --root . \
--method drift \
--query "Azure AI与Microsoft整体云战略有何关联?"Behind the scenes:
后台流程:
1. Local: Find "Azure AI" entity and neighborhood
1. 本地:查找“Azure AI”实体及其邻域
2. Global: Find "cloud strategy" community summaries
2. 全局:查找“云战略”社区摘要
3. Combine: Entity details + strategic context
3. 结合:实体细节 + 战略背景
4. Synthesize: Comprehensive answer
4. 合成:全面的答案
undefinedundefinedPython API Usage
Python API 使用
Basic Setup
基础设置
python
import asyncio
from graphrag.query import LocalSearch, GlobalSearch
from graphrag.llm import create_openai_chat_llm
from graphrag.config import GraphRagConfigpython
import asyncio
from graphrag.query import LocalSearch, GlobalSearch
from graphrag.llm import create_openai_chat_llm
from graphrag.config import GraphRagConfigLoad configuration
加载配置
config = GraphRagConfig.from_file("settings.yaml")
config = GraphRagConfig.from_file("settings.yaml")
Create LLM
创建LLM
llm = create_openai_chat_llm(
api_key=config.llm.api_key,
model=config.llm.model,
temperature=0.0
)
undefinedllm = create_openai_chat_llm(
api_key=config.llm.api_key,
model=config.llm.model,
temperature=0.0
)
undefinedCustom Indexing
自定义索引
python
from graphrag.index import run_pipeline_with_configpython
from graphrag.index import run_pipeline_with_configRun indexing programmatically
以编程方式运行索引
await run_pipeline_with_config(
config_path="settings.yaml",
verbose=True
)
undefinedawait run_pipeline_with_config(
config_path="settings.yaml",
verbose=True
)
undefinedAdvanced Query Customization
高级查询自定义
python
from graphrag.query.context_builder import LocalContextBuilderpython
from graphrag.query.context_builder import LocalContextBuilderBuild custom context
构建自定义上下文
context_builder = LocalContextBuilder(
entities=entities_df,
relationships=relationships_df,
text_units=text_units_df,
embeddings=embeddings
)
context_builder = LocalContextBuilder(
entities=entities_df,
relationships=relationships_df,
text_units=text_units_df,
embeddings=embeddings
)
Custom search with parameters
带参数的自定义搜索
result = await searcher.asearch(
query="Your question here",
conversation_history=[
{"role": "user", "content": "Previous question"},
{"role": "assistant", "content": "Previous answer"}
],
top_k=10, # Number of results
temperature=0.5, # LLM creativity
max_tokens=2000 # Response length
)
result = await searcher.asearch(
query="你的问题",
conversation_history=[
{"role": "user", "content": "之前的问题"},
{"role": "assistant", "content": "之前的答案"}
],
top_k=10, # 结果数量
temperature=0.5, # LLM创造性
max_tokens=2000 # 响应长度
)
Access detailed results
访问详细结果
print("Response:", result.response)
print("Context used:", result.context_data)
print("Sources:", result.sources)
undefinedprint("响应:", result.response)
print("使用的上下文:", result.context_data)
print("来源:", result.sources)
undefinedUse Cases and Examples
使用场景与示例
1. Research Paper Analysis
1. 研究论文分析
bash
undefinedbash
undefinedIndex academic papers
为学术论文建立索引
mkdir -p input/papers
cp research_papers/*.pdf input/papers/
graphrag index --root .
mkdir -p input/papers
cp research_papers/*.pdf input/papers/
graphrag index --root .
Global query
全局查询
graphrag query --method global
--query "What are the main research themes across these papers?"
--query "What are the main research themes across these papers?"
graphrag query --method global
--query "这些论文的主要研究主题是什么?"
--query "这些论文的主要研究主题是什么?"
Local query
本地查询
graphrag query --method local
--query "What methodologies does the Smith et al. paper use?"
--query "What methodologies does the Smith et al. paper use?"
undefinedgraphrag query --method local
--query "Smith等人的论文使用了哪些方法?"
--query "Smith等人的论文使用了哪些方法?"
undefined2. Legal Document Processing
2. 法律文档处理
bash
undefinedbash
undefinedIndex legal contracts
为法律合同建立索引
mkdir -p input/contracts
cp contracts/*.docx input/contracts/
mkdir -p input/contracts
cp contracts/*.docx input/contracts/
Tune prompts for legal domain
为法律领域调优提示词
graphrag prompt-tune --root . --domain "legal contracts"
graphrag prompt-tune --root . --domain "legal contracts"
Index with legal-specific entities
使用法律特定实体建立索引
graphrag index --root .
graphrag index --root .
Query
查询
graphrag query --method local
--query "What are the termination clauses in the Microsoft contracts?"
--query "What are the termination clauses in the Microsoft contracts?"
undefinedgraphrag query --method local
--query "Microsoft合同中的终止条款有哪些?"
--query "Microsoft合同中的终止条款有哪些?"
undefined3. Customer Feedback Analysis
3. 客户反馈分析
bash
undefinedbash
undefinedIndex customer feedback
为客户反馈建立索引
mkdir -p input/feedback
cp feedback_*.txt input/feedback/
mkdir -p input/feedback
cp feedback_*.txt input/feedback/
Global themes
全局主题
graphrag query --method global
--query "What are the main customer pain points?"
--query "What are the main customer pain points?"
graphrag query --method global
--query "主要的客户痛点是什么?"
--query "主要的客户痛点是什么?"
Specific product feedback
特定产品反馈
graphrag query --method local
--query "What feedback relates to product X features?"
--query "What feedback relates to product X features?"
undefinedgraphrag query --method local
--query "哪些反馈与产品X的功能相关?"
--query "哪些反馈与产品X的功能相关?"
undefined4. News Article Summarization
4. 新闻文章汇总
bash
undefinedbash
undefinedIndex news articles
为新闻文章建立索引
mkdir -p input/news
cp articles/*.txt input/news/
graphrag index --root .
mkdir -p input/news
cp articles/*.txt input/news/
graphrag index --root .
Get comprehensive summary
获取全面摘要
graphrag query --method global
--query "Summarize the key events and trends from these news articles"
--query "Summarize the key events and trends from these news articles"
graphrag query --method global
--query "汇总这些新闻文章中的关键事件和趋势"
--query "汇总这些新闻文章中的关键事件和趋势"
Entity-specific news
实体特定新闻
graphrag query --method local
--query "What news relates to climate change initiatives?"
--query "What news relates to climate change initiatives?"
undefinedgraphrag query --method local
--query "哪些新闻与气候变化倡议相关?"
--query "哪些新闻与气候变化倡议相关?"
undefinedAdvanced Features
高级功能
1. Incremental Indexing
1. 增量索引
bash
undefinedbash
undefinedInitial indexing
初始索引
graphrag index --root .
graphrag index --root .
Add new documents
添加新文档
cp new_documents/*.txt input/
cp new_documents/*.txt input/
Re-index only new content
仅为新内容重新索引
graphrag index --root . --incremental
graphrag index --root . --incremental
Note: Full graph may need periodic rebuilding
注意:完整图谱可能需要定期重建
undefinedundefined2. Custom Entity Types
2. 自定义实体类型
Edit :
prompts/entity_extraction.txtEntity Types:
- PRODUCT: Software products, services
- FEATURE: Product features and capabilities
- TECHNOLOGY: Technologies and frameworks
- METRIC: Performance metrics, KPIs
- INITIATIVE: Projects and strategic initiatives
- COMPETITOR: Competing products or companies编辑 :
prompts/entity_extraction.txt实体类型:
- PRODUCT: 软件产品、服务
- FEATURE: 产品功能与能力
- TECHNOLOGY: 技术与框架
- METRIC: 性能指标、KPI
- INITIATIVE: 项目与战略举措
- COMPETITOR: 竞争产品或公司3. Multi-Language Support
3. 多语言支持
yaml
undefinedyaml
undefinedsettings.yaml
settings.yaml
input:
encoding: utf-8
language: es # Spanish
llm:
model: gpt-4o # Multilingual model
input:
encoding: utf-8
language: es # 西班牙语
llm:
model: gpt-4o # 多语言模型
Customize prompts in target language
自定义目标语言的提示词
undefinedundefined4. Azure OpenAI Integration
4. Azure OpenAI 集成
yaml
llm:
type: azure_openai_chat
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: gpt-4
api_key: ${AZURE_OPENAI_API_KEY}
embeddings:
type: azure_openai_embedding
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: text-embedding-3-small
api_key: ${AZURE_OPENAI_API_KEY}yaml
llm:
type: azure_openai_chat
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: gpt-4
api_key: ${AZURE_OPENAI_API_KEY}
embeddings:
type: azure_openai_embedding
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: text-embedding-3-small
api_key: ${AZURE_OPENAI_API_KEY}5. Local LLM Support (Ollama)
5. 本地LLM支持(Ollama)
yaml
llm:
type: ollama
api_base: http://localhost:11434
model: llama3:70b
temperature: 0
embeddings:
type: ollama
api_base: http://localhost:11434
model: nomic-embed-textyaml
llm:
type: ollama
api_base: http://localhost:11434
model: llama3:70b
temperature: 0
embeddings:
type: ollama
api_base: http://localhost:11434
model: nomic-embed-textCost Management
成本管理
Understanding Costs
了解成本
GraphRAG uses LLM APIs which incur costs:
Indexing Phase (most expensive):
- Entity extraction: Multiple LLM calls per TextUnit
- Relationship extraction: Additional calls
- Community summarization: Calls per community
- Embedding generation: Per entity/TextUnit
Query Phase (less expensive):
- Context retrieval: Minimal LLM use
- Answer synthesis: Single LLM call per query
GraphRAG使用LLM API会产生费用:
索引阶段(成本最高):
- 实体提取:每个TextUnit多次LLM调用
- 关系提取:额外调用
- 社区摘要:每个社区一次调用
- 嵌入生成:每个实体/TextUnit一次调用
查询阶段(成本较低):
- 上下文检索:LLM使用极少
- 答案合成:每个查询一次LLM调用
Cost Optimization Strategies
成本优化策略
1. Reduce Chunk Size
yaml
chunks:
size: 600 # Smaller chunks = fewer tokens
overlap: 502. Limit Entity Extraction Passes
yaml
entity_extraction:
max_gleanings: 0 # 0 = single pass, 1 = two passes3. Use Smaller Models
yaml
llm:
model: gpt-4o-mini # Cheaper than gpt-4o
embeddings:
model: text-embedding-3-small # Cheaper than large4. Process Subset First
bash
undefined1. 减小块大小
yaml
chunks:
size: 600 # 更小的块 = 更少的token
overlap: 502. 限制实体提取次数
yaml
entity_extraction:
max_gleanings: 0 # 0 = 单次提取,1 = 两次提取3. 使用更小的模型
yaml
llm:
model: gpt-4o-mini # 比gpt-4o便宜
embeddings:
model: text-embedding-3-small # 比大模型便宜4. 先处理子集
bash
undefinedTest on small sample
在小样本上测试
mkdir input/sample
cp input/full/*.txt input/sample/ | head -5
graphrag index --root . --input-dir input/sample
**5. Cache Aggressively**
```yaml
cache:
type: file
base_dir: cachemkdir input/sample
cp input/full/*.txt input/sample/ | head -5
graphrag index --root . --input-dir input/sample
**5. 积极使用缓存**
```yaml
cache:
type: file
base_dir: cacheCost Estimation
成本估算
python
undefinedpython
undefinedEstimate before indexing
索引前估算成本
from graphrag.index import estimate_index_cost
cost_estimate = estimate_index_cost(
input_dir="input/",
config_path="settings.yaml"
)
print(f"Estimated cost: ${cost_estimate.total_cost}")
print(f"Total tokens: {cost_estimate.total_tokens}")
print(f"Estimated time: {cost_estimate.estimated_hours} hours")
undefinedfrom graphrag.index import estimate_index_cost
cost_estimate = estimate_index_cost(
input_dir="input/",
config_path="settings.yaml"
)
print(f"估算成本: ${cost_estimate.total_cost}")
print(f"总token数: {cost_estimate.total_tokens}")
print(f"估算时间: {cost_estimate.estimated_hours} 小时")
undefinedBest Practices
最佳实践
1. Start Small
1. 从小规模开始
bash
undefinedbash
undefinedTest with 5-10 documents first
先使用5-10个文档测试
Validate outputs before scaling
扩展前验证输出
Tune prompts on small sample
在小样本上调优提示词
Then scale to full dataset
然后扩展到完整数据集
undefinedundefined2. Monitor Indexing Progress
2. 监控索引进度
bash
undefinedbash
undefinedUse verbose mode
使用详细模式
graphrag index --root . --verbose
graphrag index --root . --verbose
Check output files periodically
定期检查输出文件
ls -lh output/*.parquet
ls -lh output/*.parquet
Monitor logs
监控日志
tail -f output/reports/indexing.log
undefinedtail -f output/reports/indexing.log
undefined3. Version Control Configuration
3. 版本控制配置
bash
undefinedbash
undefinedTrack changes
跟踪变更
git add settings.yaml prompts/
git commit -m "Update entity types for domain X"
git add settings.yaml prompts/
git commit -m "为领域X更新实体类型"
Tag successful configurations
为成功的配置打标签
git tag -a v1.0-config -m "Working config for dataset X"
undefinedgit tag -a v1.0-config -m "适用于数据集X的工作配置"
undefined4. Validate Outputs
4. 验证输出
python
import pandas as pdpython
import pandas as pdCheck extracted entities
检查提取的实体
entities = pd.read_parquet("output/create_final_entities.parquet")
print(f"Total entities: {len(entities)}")
print(f"Entity types: {entities['type'].value_counts()}")
entities = pd.read_parquet("output/create_final_entities.parquet")
print(f"总实体数: {len(entities)}")
print(f"实体类型: {entities['type'].value_counts()}")
Check relationships
检查关系
relationships = pd.read_parquet("output/create_final_relationships.parquet")
print(f"Total relationships: {len(relationships)}")
print(f"Relationship types: {relationships['type'].value_counts()}")
relationships = pd.read_parquet("output/create_final_relationships.parquet")
print(f"总关系数: {len(relationships)}")
print(f"关系类型: {relationships['type'].value_counts()}")
Check communities
检查社区
communities = pd.read_parquet("output/create_final_communities.parquet")
print(f"Total communities: {len(communities)}")
print(f"Hierarchy levels: {communities['level'].value_counts()}")
undefinedcommunities = pd.read_parquet("output/create_final_communities.parquet")
print(f"总社区数: {len(communities)}")
print(f"层级分布: {communities['level'].value_counts()}")
undefined5. Iterate on Prompts
5. 迭代调优提示词
bash
undefinedbash
undefinedRun initial index
运行初始索引
graphrag index --root .
graphrag index --root .
Evaluate quality
评估质量
graphrag query --method global --query "Test query"
graphrag query --method global --query "测试查询"
If quality is poor:
如果质量不佳:
1. Adjust entity types in prompts
1. 调整提示词中的实体类型
2. Modify extraction instructions
2. 修改提取说明
3. Re-run indexing
3. 重新运行索引
4. Validate improvements
4. 验证改进效果
undefinedundefinedTroubleshooting
故障排除
Common Issues
常见问题
"API rate limit exceeded"
"API速率限制超出"
yaml
undefinedyaml
undefinedAdd delays between requests
增加请求之间的延迟
parallelization:
stagger: 1.0 # Increase delay
num_threads: 2 # Reduce concurrency
llm:
max_retries: 20 # More retries
max_retry_wait: 60 # Longer backoff
undefinedparallelization:
stagger: 1.0 # 增加延迟
num_threads: 2 # 减少并发数
llm:
max_retries: 20 # 更多重试次数
max_retry_wait: 60 # 更长的退避时间
undefined"Out of memory during indexing"
"索引期间内存不足"
yaml
undefinedyaml
undefinedReduce batch sizes
减小批处理大小
chunks:
size: 600 # Smaller chunks
parallelization:
num_threads: 2 # Less parallelism
undefinedchunks:
size: 600 # 更小的块
parallelization:
num_threads: 2 # 减少并行度
undefined"Poor quality entity extraction"
"实体提取质量差"
bash
undefinedbash
undefinedRun prompt tuning
运行提示词调优
graphrag prompt-tune --root . --domain "your domain"
graphrag prompt-tune --root . --domain "你的领域"
Manually refine prompts
手动优化提示词
nano prompts/entity_extraction.txt
nano prompts/entity_extraction.txt
Add domain-specific examples
添加领域特定示例
Specify expected entity types clearly
明确指定预期的实体类型
undefinedundefined"Queries return irrelevant results"
"查询返回无关结果"
bash
undefinedbash
undefinedCheck if indexing completed successfully
检查索引是否成功完成
ls -lh output/*.parquet
ls -lh output/*.parquet
Validate extracted entities
验证提取的实体
python -c "import pandas as pd; print(pd.read_parquet('output/create_final_entities.parquet').head())"
python -c "import pandas as pd; print(pd.read_parquet('output/create_final_entities.parquet').head())"
Try different query methods
尝试不同的查询方法
graphrag query --method local --query "Your query"
graphrag query --method global --query "Your query"
undefinedgraphrag query --method local --query "你的查询"
graphrag query --method global --query "你的查询"
undefined"Version incompatibility after update"
"更新后版本不兼容"
bash
undefinedbash
undefinedReinitialize configuration
重新初始化配置
graphrag init --root . --force
graphrag init --root . --force
This updates settings.yaml to new schema
此命令会将settings.yaml更新为新的架构
Review and merge your customizations
查看并合并你的自定义配置
undefinedundefinedPerformance Optimization
性能优化
Indexing Performance
索引性能
yaml
undefinedyaml
undefinedOptimize for speed
针对速度优化
parallelization:
num_threads: 8 # Max concurrent workers
stagger: 0.1 # Minimal delay
chunks:
size: 1500 # Larger chunks (fewer API calls)
entity_extraction:
max_gleanings: 0 # Single pass only
undefinedparallelization:
num_threads: 8 # 最大并发工作线程
stagger: 0.1 # 最小延迟
chunks:
size: 1500 # 更大的块(更少的API调用)
entity_extraction:
max_gleanings: 0 # 仅单次提取
undefinedQuery Performance
查询性能
python
undefinedpython
undefinedCache query results
缓存查询结果
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_query(query_text):
return searcher.search(query_text)
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_query(query_text):
return searcher.search(query_text)
Pre-load data structures
预加载数据结构
entities_df = pd.read_parquet("output/create_final_entities.parquet")
relationships_df = pd.read_parquet("output/create_final_relationships.parquet")
entities_df = pd.read_parquet("output/create_final_entities.parquet")
relationships_df = pd.read_parquet("output/create_final_relationships.parquet")
Keep in memory for fast access
保留在内存中以实现快速访问
undefinedundefinedStorage Optimization
存储优化
yaml
undefinedyaml
undefinedUse compressed storage
使用压缩存储
storage:
type: file
compression: gzip # Or snappy, lz4
storage:
type: file
compression: gzip # 或snappy, lz4
Or use database storage
或使用数据库存储
storage:
type: cosmosdb
connection_string: ${COSMOS_CONNECTION_STRING}
undefinedstorage:
type: cosmosdb
connection_string: ${COSMOS_CONNECTION_STRING}
undefinedIntegration Examples
集成示例
LangChain Integration
LangChain 集成
python
from langchain.retrievers import GraphRAGRetriever
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAIpython
from langchain.retrievers import GraphRAGRetriever
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAICreate GraphRAG retriever
创建GraphRAG检索器
retriever = GraphRAGRetriever(
index_path="output/",
search_method="local"
)
retriever = GraphRAGRetriever(
index_path="output/",
search_method="local"
)
Build QA chain
构建QA链
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True
)
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True
)
Query
查询
result = qa_chain("What are the main themes?")
print(result["answer"])
undefinedresult = qa_chain("主要主题是什么?")
print(result["answer"])
undefinedFastAPI Service
FastAPI 服务
python
from fastapi import FastAPI
from graphrag.query import LocalSearch, GlobalSearch
app = FastAPI()python
from fastapi import FastAPI
from graphrag.query import LocalSearch, GlobalSearch
app = FastAPI()Initialize searchers
初始化搜索器
local_searcher = LocalSearch(...)
global_searcher = GlobalSearch(...)
@app.post("/query/local")
async def query_local(query: str):
result = await local_searcher.asearch(query)
return {"response": result.response, "sources": result.sources}
@app.post("/query/global")
async def query_global(query: str):
result = await global_searcher.asearch(query)
return {"response": result.response}
local_searcher = LocalSearch(...)
global_searcher = GlobalSearch(...)
@app.post("/query/local")
async def query_local(query: str):
result = await local_searcher.asearch(query)
return {"response": result.response, "sources": result.sources}
@app.post("/query/global")
async def query_global(query: str):
result = await global_searcher.asearch(query)
return {"response": result.response}
Run: uvicorn main:app --reload
运行: uvicorn main:app --reload
undefinedundefinedStreamlit UI
Streamlit UI
python
import streamlit as st
from graphrag.query import GlobalSearch
st.title("GraphRAG Query Interface")python
import streamlit as st
from graphrag.query import GlobalSearch
st.title("GraphRAG 查询界面")Query input
查询输入
query = st.text_input("Enter your question:")
method = st.selectbox("Search method:", ["global", "local", "drift"])
if st.button("Search"):
with st.spinner("Searching..."):
# Run query
result = await searcher.asearch(query)
# Display results
st.write("### Answer")
st.write(result.response)
st.write("### Sources")
st.write(result.sources)undefinedquery = st.text_input("输入你的问题:")
method = st.selectbox("搜索方法:", ["global", "local", "drift"])
if st.button("搜索"):
with st.spinner("搜索中..."):
# 运行查询
result = await searcher.asearch(query)
# 显示结果
st.write("### 答案")
st.write(result.response)
st.write("### 来源")
st.write(result.sources)undefinedComparison with Other Approaches
与其他方案的对比
GraphRAG vs. Vector RAG
GraphRAG vs. 向量RAG
| Feature | Vector RAG | GraphRAG |
|---|---|---|
| Structure | Flat embeddings | Knowledge graph |
| Relationships | Implicit (similarity) | Explicit (edges) |
| Multi-hop | Poor | Excellent |
| Summarization | Difficult | Natural (communities) |
| Setup Cost | Low | High (indexing) |
| Query Cost | Low | Medium |
| Best For | Simple lookups | Complex reasoning |
| 特性 | 向量RAG | GraphRAG |
|---|---|---|
| 结构 | 扁平嵌入 | 知识图谱 |
| 关系 | 隐式(相似度) | 显式(边) |
| 多跳推理 | 差 | 优秀 |
| 汇总 | 困难 | 自然(社区) |
| 设置成本 | 低 | 高(索引) |
| 查询成本 | 低 | 中 |
| 最佳场景 | 简单查找 | 复杂推理 |
When to Use GraphRAG
何时使用GraphRAG
✅ Use GraphRAG when:
- Queries require connecting multiple pieces of information
- Need holistic understanding of document corpus
- Relationships between entities matter
- Multi-hop reasoning is important
- Domain has rich entity/relationship structure
❌ Use Vector RAG when:
- Simple semantic search is sufficient
- Low setup cost is priority
- Documents are independent
- Queries are straightforward lookups
- Budget is constrained
✅ 当以下情况时使用GraphRAG:
- 查询需要连接多个信息片段
- 需要对文档语料库形成整体性理解
- 实体之间的关系很重要
- 多跳推理是关键需求
- 领域具有丰富的实体/关系结构
❌ 当以下情况时使用向量RAG:
- 简单语义搜索已足够
- 低设置成本是优先考虑因素
- 文档相互独立
- 查询是直接的查找
- 预算有限
Resources
资源
Documentation
文档
- Official Docs: https://microsoft.github.io/graphrag/
- GitHub: https://github.com/microsoft/graphrag
- Research Paper: https://arxiv.org/abs/2404.16130
Community
社区
- GitHub Discussions: https://github.com/microsoft/graphrag/discussions
- Issues: https://github.com/microsoft/graphrag/issues
Examples
示例
Important Notes
重要说明
⚠️ Not an Official Microsoft Product
"This codebase is a demonstration of graph-based RAG and not an officially supported Microsoft offering."
💰 Cost Considerations
- Indexing can be expensive (especially with GPT-4)
- Test on small samples first
- Monitor API costs closely
🔄 Version Management
- Configuration schemas change between versions
- Run after updates
graphrag init --force - Review migration guides for breaking changes
🎯 Prompt Tuning is Critical
- Out-of-box results may be suboptimal
- Domain-specific tuning significantly improves quality
- Invest time in prompt customization
⚠️ 非官方Microsoft产品
"此代码库是基于图的RAG技术演示,并非Microsoft官方支持的产品。"
💰 成本考量
- 索引可能成本高昂(尤其是使用GPT-4时)
- 先在小样本上测试
- 密切监控API成本
🔄 版本管理
- 配置架构在不同版本之间可能变化
- 更新后运行
graphrag init --force - 查看迁移指南以了解破坏性变更
🎯 提示词调优至关重要
- 开箱即用的结果可能不够理想
- 领域特定调优可显著提升质量
- 投入时间进行提示词自定义
License
许可证
Microsoft GraphRAG is released under the MIT License.
Note: This skill provides comprehensive guidance for using Microsoft GraphRAG. Always test on small datasets first, monitor costs, and tune prompts for your specific domain.
Microsoft GraphRAG 以MIT许可证发布。
注意:本技能为使用Microsoft GraphRAG提供全面指导。请始终先在小数据集上测试,监控成本,并针对你的特定领域调优提示词。