graphrag

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Microsoft GraphRAG Skill

Microsoft GraphRAG 使用指南

Expert assistance for using Microsoft GraphRAG, a modular graph-based Retrieval-Augmented Generation system that extracts structured knowledge from unstructured text to enhance LLM reasoning over private data.

本指南为使用Microsoft GraphRAG提供专业指导，这是一款基于模块化图结构的检索增强生成（Retrieval-Augmented Generation，RAG）系统，可从非结构化文本中提取结构化知识，增强大语言模型（LLM）对私有数据的推理能力。

When to Use This Skill

何时使用此技能

This skill should be used when:

Building RAG systems that need to "connect the dots" across dispersed information
Querying large document collections holistically
Extracting structured knowledge graphs from unstructured text
Implementing graph-based retrieval for LLM applications
Processing private datasets with enhanced reasoning capabilities
Working with narrative, unstructured documents
Building question-answering systems over document corpora
Extracting entities, relationships, and claims from text
Creating hierarchical knowledge summaries
Implementing multi-hop reasoning over documents
Comparing GraphRAG with traditional vector-based RAG
Tuning prompts for domain-specific datasets
Configuring indexing pipelines for knowledge extraction

在以下场景中应使用本技能：

构建需要跨分散信息“串联关联点”的RAG系统
对大型文档集合进行整体性查询
从非结构化文本中提取结构化知识图谱
为LLM应用实现基于图结构的检索功能
处理需要增强推理能力的私有数据集
处理叙事性、非结构化文档
基于文档语料库构建问答系统
从文本中提取实体、关系和声明
创建分层式知识摘要
实现跨文档的多跳推理
对比GraphRAG与传统基于向量的RAG
针对特定领域数据集调优提示词
配置用于知识提取的索引流水线

Overview

概述

What is GraphRAG?

什么是GraphRAG？

Microsoft GraphRAG is a data pipeline and transformation system that:

Extracts meaningful, structured data from unstructured text using LLMs
Builds knowledge graph memory structures
Enhances LLM outputs through graph-based retrieval
Supports private data processing without external exposure

Core Innovation:

"GraphRAG addresses fundamental limitations of baseline RAG: connecting the dots across disparate information pieces and holistically understanding summarized concepts over large collections."

Microsoft GraphRAG是一个数据流水线与转换系统，具备以下能力：

利用LLM从非结构化文本中提取有意义的结构化数据
构建知识图谱内存结构
通过基于图的检索增强LLM输出
支持私有数据处理，无需对外暴露

核心创新点：

"GraphRAG解决了基础RAG的根本性局限：跨分散信息片段建立关联，以及对大型集合中的汇总概念形成整体性理解。"

Key Differentiators from Baseline RAG

与基础RAG的关键差异

Traditional vector-based RAG has limitations:

❌ Struggles to connect information across multiple documents
❌ Limited holistic understanding of document collections
❌ Misses relationships between dispersed facts
❌ Poor performance on "summarize the corpus" queries

GraphRAG solves these with:

✅ Knowledge graph extraction from text
✅ Hierarchical community detection
✅ Multi-level summarization
✅ Graph-based reasoning and traversal
✅ Better performance on complex queries

传统基于向量的RAG存在以下局限：

❌ 难以跨多个文档建立信息关联
❌ 对文档集合的整体性理解有限
❌ 遗漏分散事实之间的关系
❌ 在“汇总整个语料库”类查询上表现不佳

GraphRAG通过以下特性解决这些问题：

✅ 从文本中提取知识图谱
✅ 分层社区检测
✅ 多级摘要生成
✅ 基于图的推理与遍历
✅ 在复杂查询上表现更优

Core Concepts

核心概念

1. Knowledge Graph Extraction

1. 知识图谱提取

GraphRAG extracts three primary elements:

Entities: Objects, people, places, concepts

Examples:
- "Microsoft" (Organization)
- "Seattle" (Location)
- "Cloud Computing" (Concept)
- "Satya Nadella" (Person)

Relationships: Connections between entities

Examples:
- Microsoft → headquartered_in → Seattle
- Satya Nadella → is_CEO_of → Microsoft
- Microsoft → provides → Cloud Computing

Claims: Factual statements with supporting evidence

Examples:
- "Microsoft is the largest software company" [Source: Document X, Page 5]
- "Azure revenue grew 30% in Q4" [Source: Earnings Report]

GraphRAG提取三类核心元素：

实体：对象、人物、地点、概念

示例：
- "Microsoft" (Organization)
- "Seattle" (Location)
- "Cloud Computing" (Concept)
- "Satya Nadella" (Person)

关系：实体之间的连接

示例：
- Microsoft → headquartered_in → Seattle
- Satya Nadella → is_CEO_of → Microsoft
- Microsoft → provides → Cloud Computing

声明：带有支持证据的事实陈述

示例：
- "Microsoft is the largest software company" [Source: Document X, Page 5]
- "Azure revenue grew 30% in Q4" [Source: Earnings Report]

2. Hierarchical Community Detection

2. 分层社区检测

GraphRAG uses the Leiden algorithm to:

Cluster related entities into communities
Create hierarchical levels of organization
Generate summaries at each level
Enable bottom-up reasoning

Example Hierarchy:

Level 0 (Detailed):
  Community 1: Azure services (Compute, Storage, Networking)
  Community 2: Office products (Word, Excel, PowerPoint)

Level 1 (Mid-level):
  Community A: Cloud services (includes Community 1)
  Community B: Productivity tools (includes Community 2)

Level 2 (High-level):
  Community X: Microsoft product ecosystem (includes A & B)

GraphRAG使用Leiden算法实现：

将相关实体聚类为社区
创建分层组织结构
在每个层级生成摘要
支持自底向上的推理

示例层级结构：

Level 0（细粒度）：
  Community 1: Azure服务（计算、存储、网络）
  Community 2: Office产品（Word、Excel、PowerPoint）

Level 1（中粒度）：
  Community A: 云服务（包含Community 1）
  Community B: 生产力工具（包含Community 2）

Level 2（粗粒度）：
  Community X: Microsoft产品生态系统（包含A & B）

3. TextUnits

Documents are segmented into TextUnits:

Manageable chunks for analysis
Sized based on token limits
Overlapping to preserve context
Form the basis of entity extraction

文档会被分割为TextUnits：

便于分析的可管理块
基于token限制确定大小
存在重叠以保留上下文
是实体提取的基础

4. Query Modes

4. 查询模式

GraphRAG offers multiple search strategies:

Global Search: Holistic corpus reasoning

Best for: "Summarize the main themes"
Uses: Community summaries at all levels
Method: Bottom-up aggregation

Local Search: Entity-specific reasoning

Best for: "Tell me about Entity X"
Uses: Entity neighborhoods in graph
Method: Traversal from seed entities

DRIFT Search: Entity reasoning with community context

Best for: "How does X relate to broader themes?"
Uses: Entities + community summaries
Method: Hybrid approach

Basic Search: Traditional vector similarity

Best for: Simple semantic matching
Uses: Embedding similarity
Method: Baseline RAG fallback

GraphRAG提供多种搜索策略：

全局搜索：整体性语料推理

最佳场景：“汇总主要主题”
使用：所有层级的社区摘要
方法：自底向上聚合

本地搜索：实体特定推理

最佳场景：“告诉我关于实体X的信息”
使用：图中的实体邻域
方法：从种子实体开始遍历

DRIFT搜索：结合社区上下文的实体推理

最佳场景：“X与更广泛的主题有何关联？”
使用：实体 + 社区摘要
方法：混合方式

基础搜索：传统向量相似度

最佳场景：简单语义匹配
使用：嵌入相似度
方法：基础RAG回退方案

Installation

安装

Prerequisites

前置条件

bash

undefined

bash

undefined

Python 3.10 or higher required

需要Python 3.10或更高版本

python --version

Install GraphRAG

安装GraphRAG

pip install graphrag

Or install from source

或从源码安装

git clone https://github.com/microsoft/graphrag.git cd graphrag pip install -e .

undefined

git clone https://github.com/microsoft/graphrag.git cd graphrag pip install -e .

undefined

Environment Setup

环境配置

bash

undefined

bash

undefined

Create environment file

创建环境文件

cat > .env << EOF

LLM Configuration (OpenAI)

LLM配置（OpenAI）

GRAPHRAG_LLM_API_KEY=your-openai-api-key GRAPHRAG_LLM_TYPE=openai_chat GRAPHRAG_LLM_MODEL=gpt-4o

Embedding Configuration

嵌入配置

GRAPHRAG_EMBEDDING_API_KEY=your-openai-api-key GRAPHRAG_EMBEDDING_TYPE=openai_embedding GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small

Optional: Azure OpenAI

可选：Azure OpenAI

GRAPHRAG_LLM_API_BASE=https://your-resource.openai.azure.com

GRAPHRAG_LLM_API_VERSION=2024-02-15-preview

GRAPHRAG_LLM_DEPLOYMENT_NAME=gpt-4

Optional: Local models

可选：本地模型

GRAPHRAG_LLM_TYPE=ollama

GRAPHRAG_LLM_API_BASE=http://localhost:11434

EOF

undefined

EOF

undefined

Quick Start

快速开始

1. Initialize Project

1. 初始化项目

bash

undefined

bash

undefined

Create new GraphRAG project

创建新的GraphRAG项目

mkdir my-graphrag-project cd my-graphrag-project

Initialize configuration

初始化配置

graphrag init --root .

This creates:

此命令会创建：

- settings.yaml (configuration)

- settings.yaml（配置文件）

- .env (environment variables)

- .env（环境变量文件）

- prompts/ (customizable prompts)

- prompts/（可自定义的提示词目录）

undefined

undefined

2. Prepare Your Data

2. 准备数据

bash

undefined

bash

undefined

Create input directory

创建输入目录

mkdir -p input

Add your documents

添加你的文档

cp /path/to/documents/*.txt input/

Supported formats: .txt, .pdf, .docx, .md

支持的格式：.txt, .pdf, .docx, .md

Each file will be processed independently

每个文件会被独立处理

undefined

undefined

3. Run Indexing Pipeline

3. 运行索引流水线

bash

undefined

bash

undefined

Index your data (this can take time and cost money!)

为数据建立索引（此过程可能耗时且产生费用！）

graphrag index --root .

The indexing process will:

索引过程会：

1. Load and chunk documents

1. 加载并分割文档

2. Extract entities, relationships, claims

2. 提取实体、关系、声明

3. Build knowledge graph

3. 构建知识图谱

4. Detect communities (Leiden algorithm)

4. 检测社区（Leiden算法）

5. Generate community summaries

5. 生成社区摘要

6. Create embeddings

6. 创建嵌入向量

7. Store results in output/

7. 将结果存储到output/目录

Monitor progress

监控进度

graphrag index --root . --verbose

undefined

graphrag index --root . --verbose

undefined

4. Query Your Data

4. 查询数据

bash

undefined

bash

undefined

Global Search (holistic queries)

全局搜索（整体性查询）

graphrag query --root .
--method global
--query "What are the main themes in this dataset?"

graphrag query --root .
--method global
--query "此数据集的主要主题是什么？"

Local Search (entity-specific queries)

本地搜索（实体特定查询）

graphrag query --root .
--method local
--query "Tell me about Microsoft's cloud strategy"

graphrag query --root .
--method local
--query "告诉我Microsoft的云战略"

DRIFT Search (entity + community context)

DRIFT搜索（实体 + 社区上下文）

graphrag query --root .
--method drift
--query "How does Azure relate to the broader Microsoft ecosystem?"

undefined

graphrag query --root .
--method drift
--query "Azure与更广泛的Microsoft生态系统有何关联？"

undefined

Configuration

配置

settings.yaml Structure

settings.yaml 结构

yaml

undefined

yaml

undefined

Core Configuration

核心配置

llm: api_key: ${GRAPHRAG_LLM_API_KEY} type: openai_chat # or azure_openai_chat, ollama model: gpt-4o max_tokens: 4000 temperature: 0 top_p: 1

embeddings: api_key: ${GRAPHRAG_EMBEDDING_API_KEY} type: openai_embedding model: text-embedding-3-small

llm: api_key: ${GRAPHRAG_LLM_API_KEY} type: openai_chat # 或azure_openai_chat, ollama model: gpt-4o max_tokens: 4000 temperature: 0 top_p: 1

embeddings: api_key: ${GRAPHRAG_EMBEDDING_API_KEY} type: openai_embedding model: text-embedding-3-small

Chunking Configuration

分割配置

chunks: size: 1200 # Token size per chunk overlap: 100 # Overlap between chunks group_by_columns: [id]

chunks: size: 1200 # 每个块的token数量 overlap: 100 # 块之间的重叠token数 group_by_columns: [id]

Entity Extraction

实体提取

entity_extraction: prompt: "prompts/entity_extraction.txt" max_gleanings: 1 # Re-extraction passes entity_types: [organization, person, location, event]

entity_extraction: prompt: "prompts/entity_extraction.txt" max_gleanings: 1 # 重新提取的次数 entity_types: [organization, person, location, event]

Community Detection

社区检测

community_reports: prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000

Claim Extraction

声明提取

claim_extraction: enabled: true prompt: "prompts/claim_extraction.txt" max_gleanings: 1

Embeddings

嵌入向量

embed_graph: enabled: true strategy: node2vec # or deepwalk

embed_graph: enabled: true strategy: node2vec # 或deepwalk

Storage

存储

storage: type: file # or blob, cosmosdb base_dir: output

storage: type: file # 或blob, cosmosdb base_dir: output

Reporting

报告

reporting: type: file base_dir: output/reports

undefined

reporting: type: file base_dir: output/reports

undefined

Advanced Configuration Options

高级配置选项

yaml

undefined

yaml

undefined

Custom LLM Configuration

自定义LLM配置

llm: type: azure_openai_chat api_base: https://your-resource.openai.azure.com api_version: "2024-02-15-preview" deployment_name: gpt-4 api_key: ${AZURE_OPENAI_API_KEY} request_timeout: 180 max_retries: 10 max_retry_wait: 10

Parallelization

并行化

parallelization: stagger: 0.3 # Delay between requests num_threads: 4 # Concurrent workers

parallelization: stagger: 0.3 # 请求之间的延迟 num_threads: 4 # 并发工作线程数

Cache Configuration

缓存配置

cache: type: file base_dir: cache

Input Configuration

输入配置

input: type: file file_type: text # or csv, parquet base_dir: input encoding: utf-8 file_pattern: ".*\.txt$"

undefined

input: type: file file_type: text # 或csv, parquet base_dir: input encoding: utf-8 file_pattern: ".*\.txt$"

undefined

Prompt Tuning

提示词调优

Why Tune Prompts?

为何调优提示词？

"Using GraphRAG with your data out of the box may not yield the best possible results."

Domain-specific datasets require custom prompts for:

Relevant entity types
Appropriate relationship types
Domain-specific language
Expected output format

"直接将GraphRAG用于你的数据可能无法获得最佳结果。"

特定领域的数据集需要自定义提示词以实现：

相关的实体类型
合适的关系类型
领域特定语言
预期的输出格式

Auto-Tuning Process

自动调优流程

bash

undefined

bash

undefined

Generate domain-adapted prompts

生成适配领域的提示词

graphrag prompt-tune --root .
--config settings.yaml
--output prompts/

This will:

此命令会：

1. Analyze your input documents

1. 分析你的输入文档

2. Identify domain-specific patterns

2. 识别领域特定模式

3. Generate custom entity extraction prompts

3. 生成自定义实体提取提示词

4. Generate custom summarization prompts

4. 生成自定义摘要提示词

5. Save to prompts/ directory

5. 保存到prompts/目录

undefined

undefined

Manual Prompt Customization

手动自定义提示词

bash

undefined

bash

undefined

Edit generated prompts

编辑生成的提示词

nano prompts/entity_extraction.txt


**Example Entity Extraction Prompt:**

-Target activity- You are an AI assistant helping to identify entities in documents about {DOMAIN}.

-Goal- Extract all entities and relationships from the text below.

Entity Types: {ENTITY_TYPES}

Relationship Types: {RELATIONSHIP_TYPES}

Format your response as JSON: {{ "entities": [ {{"name": "Entity Name", "type": "ENTITY_TYPE", "description": "..."}} ], "relationships": [ {{"source": "Entity 1", "target": "Entity 2", "type": "RELATIONSHIP_TYPE", "description": "..."}} ] }}

Text to analyze: {INPUT_TEXT}

undefined

nano prompts/entity_extraction.txt


**示例实体提取提示词：**

-目标活动- 你是一名AI助手，帮助识别关于{DOMAIN}的文档中的实体。

-目标- 从以下文本中提取所有实体和关系。

实体类型： {ENTITY_TYPES}

关系类型： {RELATIONSHIP_TYPES}

请将响应格式化为JSON： {{ "entities": [ {{"name": "实体名称", "type": "ENTITY_TYPE", "description": "..."}} ], "relationships": [ {{"source": "实体1", "target": "实体2", "type": "RELATIONSHIP_TYPE", "description": "..."}} ] }}

待分析文本： {INPUT_TEXT}

undefined

Indexing Pipeline Deep Dive

索引流水线深入解析

Step-by-Step Process

分步流程

1. Document Loading

python

undefined

1. 文档加载

python

undefined

Input documents are loaded from input/ directory

从input/目录加载输入文档

Supported formats: .txt, .pdf, .docx, .md

支持的格式：.txt, .pdf, .docx, .md


**2. Text Chunking**
```python


**2. 文本分割**
```python

Documents split into TextUnits

文档被分割为TextUnits

Default: 1200 tokens with 100 token overlap

默认：1200个token，重叠100个token

Preserves context across chunk boundaries

保留块边界的上下文


**3. Entity Extraction**
```python


**3. 实体提取**
```python

For each TextUnit:

对每个TextUnit：

- Extract entities (with types and descriptions)

- 提取实体（包含类型和描述）

- Extract relationships (with types and weights)

- 提取关系（包含类型和权重）

- Extract claims (with sources and confidence)

- 提取声明（包含来源和置信度）


**4. Graph Construction**
```python


**4. 图构建**
```python

Build knowledge graph:

构建知识图谱：

- Nodes = Entities

- 节点 = 实体

- Edges = Relationships

- 边 = 关系

- Properties = Attributes and metadata

- 属性 = 特征和元数据


**5. Community Detection**
```python


**5. 社区检测**
```python

Leiden algorithm for hierarchical clustering:

使用Leiden算法进行分层聚类：

- Level 0: Fine-grained communities

- Level 0: 细粒度社区

- Level 1: Mid-level aggregations

- Level 1: 中粒度聚合

- Level 2+: High-level themes

- Level 2+: 粗粒度主题


**6. Community Summarization**
```python


**6. 社区摘要**
```python

For each community at each level:

对每个层级的每个社区：

- Aggregate entity and relationship info

- 聚合实体和关系信息

- Generate natural language summary

- 生成自然语言摘要

- Store for query-time retrieval

- 存储以供查询时检索


**7. Embedding Generation**
```python


**7. 嵌入向量生成**
```python

Create vector embeddings for:

为以下内容创建向量嵌入：

- TextUnits (for similarity search)

- TextUnits（用于相似度搜索）

- Entities (for semantic matching)

- 实体（用于语义匹配）

- Community summaries (for global search)

- 社区摘要（用于全局搜索）


**8. Output Storage**
```python


**8. 输出存储**
```python

Results saved to output/:

结果保存到output/：

- create_final_entities.parquet

- create_final_relationships.parquet

- create_final_communities.parquet

- create_final_community_reports.parquet

- create_final_text_units.parquet

undefined

undefined

Query Modes in Detail

查询模式详细说明

Global Search

全局搜索

Best For:

"What are the main themes?"
"Summarize the entire dataset"
"What are the key trends?"

How It Works:

Query is matched against community summaries
Relevant communities selected at all hierarchy levels
Summaries aggregated bottom-up
Final answer synthesized from multiple levels

Example:

bash

graphrag query --root . \
  --method global \
  --query "What are the major technology trends discussed in these documents?"

最佳场景：

“主要主题是什么？”
“汇总整个数据集”
“关键趋势有哪些？”

工作原理：

将查询与社区摘要匹配
选择所有层级的相关社区
自底向上聚合摘要
从多个层级合成最终答案

示例：

bash

graphrag query --root . \
  --method global \
  --query "这些文档中讨论的主要技术趋势是什么？"

Behind the scenes:

后台流程：

1. Match query to relevant communities

1. 将查询与相关社区匹配

2. Retrieve summaries from levels 0, 1, 2

2. 检索Level 0、1、2的摘要

3. Aggregate: AI/ML, Cloud, Cybersecurity communities

3. 聚合：AI/ML、云、网络安全社区

4. Synthesize comprehensive answer

4. 合成全面的答案


**Python API:**
```python
from graphrag.query import GlobalSearch

searcher = GlobalSearch(
    llm=llm,
    context_builder=context_builder,
    map_system_prompt=map_prompt,
    reduce_system_prompt=reduce_prompt
)

result = await searcher.asearch(
    query="What are the major themes?",
    conversation_history=[]
)
print(result.response)


**Python API：**
```python
from graphrag.query import GlobalSearch

searcher = GlobalSearch(
    llm=llm,
    context_builder=context_builder,
    map_system_prompt=map_prompt,
    reduce_system_prompt=reduce_prompt
)

result = await searcher.asearch(
    query="主要主题是什么？",
    conversation_history=[]
)
print(result.response)

Local Search

本地搜索

Best For:

"Tell me about [specific entity]"
"What is the relationship between X and Y?"
"Find information about [topic]"

How It Works:

Identify entities mentioned in query
Traverse graph from those entities
Collect neighborhood information (N-hop)
Retrieve associated TextUnits
Synthesize answer from local context

Example:

bash

graphrag query --root . \
  --method local \
  --query "What is Microsoft's strategy for artificial intelligence?"

最佳场景：

“告诉我关于[特定实体]的信息”
“X和Y之间的关系是什么？”
“查找关于[主题]的信息”

工作原理：

识别查询中提到的实体
从这些实体开始遍历图
收集邻域信息（N跳）
检索关联的TextUnits
从本地上下文合成答案

示例：

bash

graphrag query --root . \
  --method local \
  --query "Microsoft的人工智能战略是什么？"

Behind the scenes:

后台流程：

1. Identify: "Microsoft", "artificial intelligence" entities

1. 识别：“Microsoft”、“artificial intelligence”实体

2. Traverse: Find related entities (Azure AI, OpenAI partnership, etc.)

2. 遍历：查找相关实体（Azure AI、OpenAI合作等）

3. Collect: Relationships, claims, TextUnits

3. 收集：关系、声明、TextUnits

4. Synthesize: Answer from local graph neighborhood

4. 合成：从本地图邻域生成答案


**Python API:**
```python
from graphrag.query import LocalSearch

searcher = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    system_prompt=system_prompt
)

result = await searcher.asearch(
    query="Tell me about Microsoft's AI strategy",
    conversation_history=[]
)
print(result.response)


**Python API：**
```python
from graphrag.query import LocalSearch

searcher = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    system_prompt=system_prompt
)

result = await searcher.asearch(
    query="告诉我Microsoft的AI战略",
    conversation_history=[]
)
print(result.response)

DRIFT Search

DRIFT搜索

Best For:

"How does [entity] fit into [broader context]?"
"What is the significance of [topic]?"
Hybrid queries needing both local and global context

How It Works:

Identify query entities (like Local Search)
Find relevant communities (like Global Search)
Combine entity neighborhoods with community summaries
Synthesize answer from both perspectives

Example:

bash

graphrag query --root . \
  --method drift \
  --query "How does Azure AI relate to Microsoft's overall cloud strategy?"

最佳场景：

“[实体]如何融入[更广泛的背景]？”
“[主题]的重要性是什么？”
需要本地和全局上下文的混合查询

工作原理：

识别查询实体（如本地搜索）
查找相关社区（如全局搜索）
结合实体邻域与社区摘要
从两个视角合成答案

示例：

bash

graphrag query --root . \
  --method drift \
  --query "Azure AI与Microsoft整体云战略有何关联？"

Behind the scenes:

后台流程：

1. Local: Find "Azure AI" entity and neighborhood

1. 本地：查找“Azure AI”实体及其邻域

2. Global: Find "cloud strategy" community summaries

2. 全局：查找“云战略”社区摘要

3. Combine: Entity details + strategic context

3. 结合：实体细节 + 战略背景

4. Synthesize: Comprehensive answer

4. 合成：全面的答案

undefined

undefined

Python API Usage

Python API 使用

Basic Setup

基础设置

python

import asyncio
from graphrag.query import LocalSearch, GlobalSearch
from graphrag.llm import create_openai_chat_llm
from graphrag.config import GraphRagConfig

python

import asyncio
from graphrag.query import LocalSearch, GlobalSearch
from graphrag.llm import create_openai_chat_llm
from graphrag.config import GraphRagConfig

Load configuration

加载配置

config = GraphRagConfig.from_file("settings.yaml")

Create LLM

创建LLM

llm = create_openai_chat_llm( api_key=config.llm.api_key, model=config.llm.model, temperature=0.0 )

undefined

llm = create_openai_chat_llm( api_key=config.llm.api_key, model=config.llm.model, temperature=0.0 )

undefined

Custom Indexing

自定义索引

python

from graphrag.index import run_pipeline_with_config

python

from graphrag.index import run_pipeline_with_config

Run indexing programmatically

以编程方式运行索引

await run_pipeline_with_config( config_path="settings.yaml", verbose=True )

undefined

await run_pipeline_with_config( config_path="settings.yaml", verbose=True )

undefined

Advanced Query Customization

高级查询自定义

python

from graphrag.query.context_builder import LocalContextBuilder

python

from graphrag.query.context_builder import LocalContextBuilder

Build custom context

构建自定义上下文

context_builder = LocalContextBuilder( entities=entities_df, relationships=relationships_df, text_units=text_units_df, embeddings=embeddings )

Custom search with parameters

带参数的自定义搜索

result = await searcher.asearch( query="Your question here", conversation_history=[ {"role": "user", "content": "Previous question"}, {"role": "assistant", "content": "Previous answer"} ], top_k=10, # Number of results temperature=0.5, # LLM creativity max_tokens=2000 # Response length )

result = await searcher.asearch( query="你的问题", conversation_history=[ {"role": "user", "content": "之前的问题"}, {"role": "assistant", "content": "之前的答案"} ], top_k=10, # 结果数量 temperature=0.5, # LLM创造性 max_tokens=2000 # 响应长度 )

Access detailed results

访问详细结果

print("Response:", result.response) print("Context used:", result.context_data) print("Sources:", result.sources)

undefined

print("响应:", result.response) print("使用的上下文:", result.context_data) print("来源:", result.sources)

undefined

Use Cases and Examples

使用场景与示例

1. Research Paper Analysis

1. 研究论文分析

bash

undefined

bash

undefined

Index academic papers

为学术论文建立索引

mkdir -p input/papers cp research_papers/*.pdf input/papers/

graphrag index --root .

mkdir -p input/papers cp research_papers/*.pdf input/papers/

graphrag index --root .

Global query

全局查询

graphrag query --method global
--query "What are the main research themes across these papers?"

graphrag query --method global
--query "这些论文的主要研究主题是什么？"

Local query

本地查询

graphrag query --method local
--query "What methodologies does the Smith et al. paper use?"

undefined

graphrag query --method local
--query "Smith等人的论文使用了哪些方法？"

undefined

2. Legal Document Processing

2. 法律文档处理

bash

undefined

bash

undefined

Index legal contracts

为法律合同建立索引

mkdir -p input/contracts cp contracts/*.docx input/contracts/

Tune prompts for legal domain

为法律领域调优提示词

graphrag prompt-tune --root . --domain "legal contracts"

Index with legal-specific entities

使用法律特定实体建立索引

graphrag index --root .

Query

查询

graphrag query --method local
--query "What are the termination clauses in the Microsoft contracts?"

undefined

graphrag query --method local
--query "Microsoft合同中的终止条款有哪些？"

undefined

3. Customer Feedback Analysis

3. 客户反馈分析

bash

undefined

bash

undefined

Index customer feedback

为客户反馈建立索引

mkdir -p input/feedback cp feedback_*.txt input/feedback/

Global themes

全局主题

graphrag query --method global
--query "What are the main customer pain points?"

graphrag query --method global
--query "主要的客户痛点是什么？"

Specific product feedback

特定产品反馈

graphrag query --method local
--query "What feedback relates to product X features?"

undefined

graphrag query --method local
--query "哪些反馈与产品X的功能相关？"

undefined

4. News Article Summarization

4. 新闻文章汇总

bash

undefined

bash

undefined

Index news articles

为新闻文章建立索引

mkdir -p input/news cp articles/*.txt input/news/

graphrag index --root .

mkdir -p input/news cp articles/*.txt input/news/

graphrag index --root .

Get comprehensive summary

获取全面摘要

graphrag query --method global
--query "Summarize the key events and trends from these news articles"

graphrag query --method global
--query "汇总这些新闻文章中的关键事件和趋势"

Entity-specific news

实体特定新闻

graphrag query --method local
--query "What news relates to climate change initiatives?"

undefined

graphrag query --method local
--query "哪些新闻与气候变化倡议相关？"

undefined

Advanced Features

高级功能

1. Incremental Indexing

1. 增量索引

bash

undefined

bash

undefined

Initial indexing

初始索引

graphrag index --root .

Add new documents

添加新文档

cp new_documents/*.txt input/

Re-index only new content

仅为新内容重新索引

graphrag index --root . --incremental

Note: Full graph may need periodic rebuilding

注意：完整图谱可能需要定期重建

undefined

undefined

2. Custom Entity Types

2. 自定义实体类型

Edit

prompts/entity_extraction.txt

Entity Types:
- PRODUCT: Software products, services
- FEATURE: Product features and capabilities
- TECHNOLOGY: Technologies and frameworks
- METRIC: Performance metrics, KPIs
- INITIATIVE: Projects and strategic initiatives
- COMPETITOR: Competing products or companies

编辑

prompts/entity_extraction.txt

：

实体类型：
- PRODUCT: 软件产品、服务
- FEATURE: 产品功能与能力
- TECHNOLOGY: 技术与框架
- METRIC: 性能指标、KPI
- INITIATIVE: 项目与战略举措
- COMPETITOR: 竞争产品或公司

3. Multi-Language Support

3. 多语言支持

yaml

undefined

yaml

undefined

settings.yaml

input: encoding: utf-8 language: es # Spanish

llm: model: gpt-4o # Multilingual model

input: encoding: utf-8 language: es # 西班牙语

llm: model: gpt-4o # 多语言模型

Customize prompts in target language

自定义目标语言的提示词

undefined

undefined

4. Azure OpenAI Integration

4. Azure OpenAI 集成

yaml

llm:
  type: azure_openai_chat
  api_base: https://your-resource.openai.azure.com
  api_version: "2024-02-15-preview"
  deployment_name: gpt-4
  api_key: ${AZURE_OPENAI_API_KEY}

embeddings:
  type: azure_openai_embedding
  api_base: https://your-resource.openai.azure.com
  api_version: "2024-02-15-preview"
  deployment_name: text-embedding-3-small
  api_key: ${AZURE_OPENAI_API_KEY}

yaml

llm:
  type: azure_openai_chat
  api_base: https://your-resource.openai.azure.com
  api_version: "2024-02-15-preview"
  deployment_name: gpt-4
  api_key: ${AZURE_OPENAI_API_KEY}

embeddings:
  type: azure_openai_embedding
  api_base: https://your-resource.openai.azure.com
  api_version: "2024-02-15-preview"
  deployment_name: text-embedding-3-small
  api_key: ${AZURE_OPENAI_API_KEY}

5. Local LLM Support (Ollama)

5. 本地LLM支持（Ollama）

yaml

llm:
  type: ollama
  api_base: http://localhost:11434
  model: llama3:70b
  temperature: 0

embeddings:
  type: ollama
  api_base: http://localhost:11434
  model: nomic-embed-text

yaml

llm:
  type: ollama
  api_base: http://localhost:11434
  model: llama3:70b
  temperature: 0

embeddings:
  type: ollama
  api_base: http://localhost:11434
  model: nomic-embed-text

Cost Management

成本管理

Understanding Costs

了解成本

GraphRAG uses LLM APIs which incur costs:

Indexing Phase (most expensive):

Entity extraction: Multiple LLM calls per TextUnit
Relationship extraction: Additional calls
Community summarization: Calls per community
Embedding generation: Per entity/TextUnit

Query Phase (less expensive):

Context retrieval: Minimal LLM use
Answer synthesis: Single LLM call per query

GraphRAG使用LLM API会产生费用：

索引阶段（成本最高）：

实体提取：每个TextUnit多次LLM调用
关系提取：额外调用
社区摘要：每个社区一次调用
嵌入生成：每个实体/TextUnit一次调用

查询阶段（成本较低）：

上下文检索：LLM使用极少
答案合成：每个查询一次LLM调用

Cost Optimization Strategies

成本优化策略

1. Reduce Chunk Size

yaml

chunks:
  size: 600  # Smaller chunks = fewer tokens
  overlap: 50

2. Limit Entity Extraction Passes

yaml

entity_extraction:
  max_gleanings: 0  # 0 = single pass, 1 = two passes

3. Use Smaller Models

yaml

llm:
  model: gpt-4o-mini  # Cheaper than gpt-4o

embeddings:
  model: text-embedding-3-small  # Cheaper than large

4. Process Subset First

bash

undefined

1. 减小块大小

yaml

chunks:
  size: 600  # 更小的块 = 更少的token
  overlap: 50

2. 限制实体提取次数

yaml

entity_extraction:
  max_gleanings: 0  # 0 = 单次提取，1 = 两次提取

3. 使用更小的模型

yaml

llm:
  model: gpt-4o-mini  # 比gpt-4o便宜

embeddings:
  model: text-embedding-3-small  # 比大模型便宜

4. 先处理子集

bash

undefined

Test on small sample

在小样本上测试

mkdir input/sample cp input/full/*.txt input/sample/ | head -5 graphrag index --root . --input-dir input/sample


**5. Cache Aggressively**
```yaml
cache:
  type: file
  base_dir: cache

mkdir input/sample cp input/full/*.txt input/sample/ | head -5 graphrag index --root . --input-dir input/sample


**5. 积极使用缓存**
```yaml
cache:
  type: file
  base_dir: cache

Cost Estimation

成本估算

python

undefined

python

undefined

Estimate before indexing

索引前估算成本

from graphrag.index import estimate_index_cost

cost_estimate = estimate_index_cost( input_dir="input/", config_path="settings.yaml" )

print(f"Estimated cost: ${cost_estimate.total_cost}") print(f"Total tokens: {cost_estimate.total_tokens}") print(f"Estimated time: {cost_estimate.estimated_hours} hours")

undefined

from graphrag.index import estimate_index_cost

cost_estimate = estimate_index_cost( input_dir="input/", config_path="settings.yaml" )

print(f"估算成本: ${cost_estimate.total_cost}") print(f"总token数: {cost_estimate.total_tokens}") print(f"估算时间: {cost_estimate.estimated_hours} 小时")

undefined

Best Practices

最佳实践

1. Start Small

1. 从小规模开始

bash

undefined

bash

undefined

Test with 5-10 documents first

先使用5-10个文档测试

Validate outputs before scaling

扩展前验证输出

Tune prompts on small sample

在小样本上调优提示词

Then scale to full dataset

然后扩展到完整数据集

undefined

undefined

2. Monitor Indexing Progress

2. 监控索引进度

bash

undefined

bash

undefined

Use verbose mode

使用详细模式

graphrag index --root . --verbose

Check output files periodically

定期检查输出文件

ls -lh output/*.parquet

Monitor logs

监控日志

tail -f output/reports/indexing.log

undefined

tail -f output/reports/indexing.log

undefined

3. Version Control Configuration

3. 版本控制配置

bash

undefined

bash

undefined

Track changes

跟踪变更

git add settings.yaml prompts/ git commit -m "Update entity types for domain X"

git add settings.yaml prompts/ git commit -m "为领域X更新实体类型"

Tag successful configurations

为成功的配置打标签

git tag -a v1.0-config -m "Working config for dataset X"

undefined

git tag -a v1.0-config -m "适用于数据集X的工作配置"

undefined

4. Validate Outputs

4. 验证输出

python

import pandas as pd

python

import pandas as pd

Check extracted entities

检查提取的实体

entities = pd.read_parquet("output/create_final_entities.parquet") print(f"Total entities: {len(entities)}") print(f"Entity types: {entities['type'].value_counts()}")

entities = pd.read_parquet("output/create_final_entities.parquet") print(f"总实体数: {len(entities)}") print(f"实体类型: {entities['type'].value_counts()}")

Check relationships

检查关系

relationships = pd.read_parquet("output/create_final_relationships.parquet") print(f"Total relationships: {len(relationships)}") print(f"Relationship types: {relationships['type'].value_counts()}")

relationships = pd.read_parquet("output/create_final_relationships.parquet") print(f"总关系数: {len(relationships)}") print(f"关系类型: {relationships['type'].value_counts()}")

Check communities

检查社区

communities = pd.read_parquet("output/create_final_communities.parquet") print(f"Total communities: {len(communities)}") print(f"Hierarchy levels: {communities['level'].value_counts()}")

undefined

communities = pd.read_parquet("output/create_final_communities.parquet") print(f"总社区数: {len(communities)}") print(f"层级分布: {communities['level'].value_counts()}")

undefined

5. Iterate on Prompts

5. 迭代调优提示词

bash

undefined

bash

undefined

Run initial index

运行初始索引

graphrag index --root .

Evaluate quality

评估质量

graphrag query --method global --query "Test query"

graphrag query --method global --query "测试查询"

If quality is poor:

如果质量不佳：

1. Adjust entity types in prompts

1. 调整提示词中的实体类型

2. Modify extraction instructions

2. 修改提取说明

3. Re-run indexing

3. 重新运行索引

4. Validate improvements

4. 验证改进效果

undefined

undefined

Troubleshooting

故障排除

Common Issues

常见问题

"API rate limit exceeded"

"API速率限制超出"

yaml

undefined

yaml

undefined

Add delays between requests

增加请求之间的延迟

parallelization: stagger: 1.0 # Increase delay num_threads: 2 # Reduce concurrency

llm: max_retries: 20 # More retries max_retry_wait: 60 # Longer backoff

undefined

parallelization: stagger: 1.0 # 增加延迟 num_threads: 2 # 减少并发数

llm: max_retries: 20 # 更多重试次数 max_retry_wait: 60 # 更长的退避时间

undefined

"Out of memory during indexing"

"索引期间内存不足"

yaml

undefined

yaml

undefined

Reduce batch sizes

减小批处理大小

chunks: size: 600 # Smaller chunks

parallelization: num_threads: 2 # Less parallelism

undefined

chunks: size: 600 # 更小的块

parallelization: num_threads: 2 # 减少并行度

undefined

"Poor quality entity extraction"

"实体提取质量差"

bash

undefined

bash

undefined

Run prompt tuning

运行提示词调优

graphrag prompt-tune --root . --domain "your domain"

graphrag prompt-tune --root . --domain "你的领域"

Manually refine prompts

手动优化提示词

nano prompts/entity_extraction.txt

Add domain-specific examples

添加领域特定示例

Specify expected entity types clearly

明确指定预期的实体类型

undefined

undefined

"Queries return irrelevant results"

"查询返回无关结果"

bash

undefined

bash

undefined

Check if indexing completed successfully

检查索引是否成功完成

ls -lh output/*.parquet

Validate extracted entities

验证提取的实体

python -c "import pandas as pd; print(pd.read_parquet('output/create_final_entities.parquet').head())"

Try different query methods

尝试不同的查询方法

graphrag query --method local --query "Your query" graphrag query --method global --query "Your query"

undefined

graphrag query --method local --query "你的查询" graphrag query --method global --query "你的查询"

undefined

"Version incompatibility after update"

"更新后版本不兼容"

bash

undefined

bash

undefined

Reinitialize configuration

重新初始化配置

graphrag init --root . --force

This updates settings.yaml to new schema

此命令会将settings.yaml更新为新的架构

Review and merge your customizations

查看并合并你的自定义配置

undefined

undefined

Performance Optimization

性能优化

Indexing Performance

索引性能

yaml

undefined

yaml

undefined

Optimize for speed

针对速度优化

parallelization: num_threads: 8 # Max concurrent workers stagger: 0.1 # Minimal delay

chunks: size: 1500 # Larger chunks (fewer API calls)

entity_extraction: max_gleanings: 0 # Single pass only

undefined

parallelization: num_threads: 8 # 最大并发工作线程 stagger: 0.1 # 最小延迟

chunks: size: 1500 # 更大的块（更少的API调用）

entity_extraction: max_gleanings: 0 # 仅单次提取

undefined

Query Performance

查询性能

python

undefined

python

undefined

Cache query results

缓存查询结果

from functools import lru_cache

@lru_cache(maxsize=100) def cached_query(query_text): return searcher.search(query_text)

from functools import lru_cache

@lru_cache(maxsize=100) def cached_query(query_text): return searcher.search(query_text)

Pre-load data structures

预加载数据结构

entities_df = pd.read_parquet("output/create_final_entities.parquet") relationships_df = pd.read_parquet("output/create_final_relationships.parquet")

Keep in memory for fast access

保留在内存中以实现快速访问

undefined

undefined

Storage Optimization

存储优化

yaml

undefined

yaml

undefined

Use compressed storage

使用压缩存储

storage: type: file compression: gzip # Or snappy, lz4

storage: type: file compression: gzip # 或snappy, lz4

Or use database storage

或使用数据库存储

storage: type: cosmosdb connection_string: ${COSMOS_CONNECTION_STRING}

undefined

storage: type: cosmosdb connection_string: ${COSMOS_CONNECTION_STRING}

undefined

Integration Examples

集成示例

LangChain Integration

LangChain 集成

python

from langchain.retrievers import GraphRAGRetriever
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

python

from langchain.retrievers import GraphRAGRetriever
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

Create GraphRAG retriever

创建GraphRAG检索器

retriever = GraphRAGRetriever( index_path="output/", search_method="local" )

Build QA chain

构建QA链

llm = ChatOpenAI(model="gpt-4o") qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=retriever, return_source_documents=True )

Query

查询

result = qa_chain("What are the main themes?") print(result["answer"])

undefined

result = qa_chain("主要主题是什么？") print(result["answer"])

undefined

FastAPI Service

FastAPI 服务

python

from fastapi import FastAPI
from graphrag.query import LocalSearch, GlobalSearch

app = FastAPI()

python

from fastapi import FastAPI
from graphrag.query import LocalSearch, GlobalSearch

app = FastAPI()

Initialize searchers

初始化搜索器

local_searcher = LocalSearch(...) global_searcher = GlobalSearch(...)

@app.post("/query/local") async def query_local(query: str): result = await local_searcher.asearch(query) return {"response": result.response, "sources": result.sources}

@app.post("/query/global") async def query_global(query: str): result = await global_searcher.asearch(query) return {"response": result.response}

local_searcher = LocalSearch(...) global_searcher = GlobalSearch(...)

@app.post("/query/local") async def query_local(query: str): result = await local_searcher.asearch(query) return {"response": result.response, "sources": result.sources}

@app.post("/query/global") async def query_global(query: str): result = await global_searcher.asearch(query) return {"response": result.response}

Run: uvicorn main:app --reload

运行: uvicorn main:app --reload

undefined

undefined

Streamlit UI

python

import streamlit as st
from graphrag.query import GlobalSearch

st.title("GraphRAG Query Interface")

python

import streamlit as st
from graphrag.query import GlobalSearch

st.title("GraphRAG 查询界面")

Query input

查询输入

query = st.text_input("Enter your question:") method = st.selectbox("Search method:", ["global", "local", "drift"])

if st.button("Search"): with st.spinner("Searching..."): # Run query result = await searcher.asearch(query)

    # Display results
    st.write("### Answer")
    st.write(result.response)

    st.write("### Sources")
    st.write(result.sources)

undefined

query = st.text_input("输入你的问题:") method = st.selectbox("搜索方法:", ["global", "local", "drift"])

if st.button("搜索"): with st.spinner("搜索中..."): # 运行查询 result = await searcher.asearch(query)

    # 显示结果
    st.write("### 答案")
    st.write(result.response)

    st.write("### 来源")
    st.write(result.sources)

undefined

Comparison with Other Approaches

与其他方案的对比

GraphRAG vs. Vector RAG

GraphRAG vs. 向量RAG

Feature	Vector RAG	GraphRAG
Structure	Flat embeddings	Knowledge graph
Relationships	Implicit (similarity)	Explicit (edges)
Multi-hop	Poor	Excellent
Summarization	Difficult	Natural (communities)
Setup Cost	Low	High (indexing)
Query Cost	Low	Medium
Best For	Simple lookups	Complex reasoning

特性	向量RAG	GraphRAG
结构	扁平嵌入	知识图谱
关系	隐式（相似度）	显式（边）
多跳推理	差	优秀
汇总	困难	自然（社区）
设置成本	低	高（索引）
查询成本	低	中
最佳场景	简单查找	复杂推理

When to Use GraphRAG

何时使用GraphRAG

✅ Use GraphRAG when:

Queries require connecting multiple pieces of information
Need holistic understanding of document corpus
Relationships between entities matter
Multi-hop reasoning is important
Domain has rich entity/relationship structure

❌ Use Vector RAG when:

Simple semantic search is sufficient
Low setup cost is priority
Documents are independent
Queries are straightforward lookups
Budget is constrained

✅ 当以下情况时使用GraphRAG：

查询需要连接多个信息片段
需要对文档语料库形成整体性理解
实体之间的关系很重要
多跳推理是关键需求
领域具有丰富的实体/关系结构

❌ 当以下情况时使用向量RAG：

简单语义搜索已足够
低设置成本是优先考虑因素
文档相互独立
查询是直接的查找
预算有限

Resources

资源

Documentation

文档

Official Docs: https://microsoft.github.io/graphrag/
GitHub: https://github.com/microsoft/graphrag
Research Paper: https://arxiv.org/abs/2404.16130

官方文档: https://microsoft.github.io/graphrag/
GitHub: https://github.com/microsoft/graphrag
研究论文: https://arxiv.org/abs/2404.16130

Community

社区

GitHub Discussions: https://github.com/microsoft/graphrag/discussions
Issues: https://github.com/microsoft/graphrag/issues

GitHub讨论区: https://github.com/microsoft/graphrag/discussions
问题反馈: https://github.com/microsoft/graphrag/issues

Examples

示例

Notebooks: https://github.com/microsoft/graphrag/tree/main/examples
Sample Configs: https://github.com/microsoft/graphrag/tree/main/examples/configs

Notebooks: https://github.com/microsoft/graphrag/tree/main/examples
示例配置: https://github.com/microsoft/graphrag/tree/main/examples/configs

Important Notes

重要说明

⚠️ Not an Official Microsoft Product

"This codebase is a demonstration of graph-based RAG and not an officially supported Microsoft offering."

💰 Cost Considerations

Indexing can be expensive (especially with GPT-4)
Test on small samples first
Monitor API costs closely

🔄 Version Management

Configuration schemas change between versions
Run
```
graphrag init --force
```
after updates
Review migration guides for breaking changes

🎯 Prompt Tuning is Critical

Out-of-box results may be suboptimal
Domain-specific tuning significantly improves quality
Invest time in prompt customization

⚠️ 非官方Microsoft产品

"此代码库是基于图的RAG技术演示，并非Microsoft官方支持的产品。"

💰 成本考量

索引可能成本高昂（尤其是使用GPT-4时）
先在小样本上测试
密切监控API成本

🔄 版本管理

配置架构在不同版本之间可能变化
更新后运行
```
graphrag init --force
```
查看迁移指南以了解破坏性变更

🎯 提示词调优至关重要

开箱即用的结果可能不够理想
领域特定调优可显著提升质量
投入时间进行提示词自定义

License

许可证

Microsoft GraphRAG is released under the MIT License.

Note: This skill provides comprehensive guidance for using Microsoft GraphRAG. Always test on small datasets first, monitor costs, and tune prompts for your specific domain.

Microsoft GraphRAG 以MIT许可证发布。

注意：本技能为使用Microsoft GraphRAG提供全面指导。请始终先在小数据集上测试，监控成本，并针对你的特定领域调优提示词。