Loading...
Loading...
Comprehensive skill for Microsoft GraphRAG - modular graph-based RAG system for reasoning over private datasets
npx skill4agent add aeonbridge/ab-anthropic-claude-skills graphrag"GraphRAG addresses fundamental limitations of baseline RAG: connecting the dots across disparate information pieces and holistically understanding summarized concepts over large collections."
Examples:
- "Microsoft" (Organization)
- "Seattle" (Location)
- "Cloud Computing" (Concept)
- "Satya Nadella" (Person)Examples:
- Microsoft → headquartered_in → Seattle
- Satya Nadella → is_CEO_of → Microsoft
- Microsoft → provides → Cloud ComputingExamples:
- "Microsoft is the largest software company" [Source: Document X, Page 5]
- "Azure revenue grew 30% in Q4" [Source: Earnings Report]Level 0 (Detailed):
Community 1: Azure services (Compute, Storage, Networking)
Community 2: Office products (Word, Excel, PowerPoint)
Level 1 (Mid-level):
Community A: Cloud services (includes Community 1)
Community B: Productivity tools (includes Community 2)
Level 2 (High-level):
Community X: Microsoft product ecosystem (includes A & B)# Python 3.10 or higher required
python --version
# Install GraphRAG
pip install graphrag
# Or install from source
git clone https://github.com/microsoft/graphrag.git
cd graphrag
pip install -e .# Create environment file
cat > .env << EOF
# LLM Configuration (OpenAI)
GRAPHRAG_LLM_API_KEY=your-openai-api-key
GRAPHRAG_LLM_TYPE=openai_chat
GRAPHRAG_LLM_MODEL=gpt-4o
# Embedding Configuration
GRAPHRAG_EMBEDDING_API_KEY=your-openai-api-key
GRAPHRAG_EMBEDDING_TYPE=openai_embedding
GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small
# Optional: Azure OpenAI
# GRAPHRAG_LLM_API_BASE=https://your-resource.openai.azure.com
# GRAPHRAG_LLM_API_VERSION=2024-02-15-preview
# GRAPHRAG_LLM_DEPLOYMENT_NAME=gpt-4
# Optional: Local models
# GRAPHRAG_LLM_TYPE=ollama
# GRAPHRAG_LLM_API_BASE=http://localhost:11434
EOF# Create new GraphRAG project
mkdir my-graphrag-project
cd my-graphrag-project
# Initialize configuration
graphrag init --root .
# This creates:
# - settings.yaml (configuration)
# - .env (environment variables)
# - prompts/ (customizable prompts)# Create input directory
mkdir -p input
# Add your documents
cp /path/to/documents/*.txt input/
# Supported formats: .txt, .pdf, .docx, .md
# Each file will be processed independently# Index your data (this can take time and cost money!)
graphrag index --root .
# The indexing process will:
# 1. Load and chunk documents
# 2. Extract entities, relationships, claims
# 3. Build knowledge graph
# 4. Detect communities (Leiden algorithm)
# 5. Generate community summaries
# 6. Create embeddings
# 7. Store results in output/
# Monitor progress
graphrag index --root . --verbose# Global Search (holistic queries)
graphrag query --root . \
--method global \
--query "What are the main themes in this dataset?"
# Local Search (entity-specific queries)
graphrag query --root . \
--method local \
--query "Tell me about Microsoft's cloud strategy"
# DRIFT Search (entity + community context)
graphrag query --root . \
--method drift \
--query "How does Azure relate to the broader Microsoft ecosystem?"# Core Configuration
llm:
api_key: ${GRAPHRAG_LLM_API_KEY}
type: openai_chat # or azure_openai_chat, ollama
model: gpt-4o
max_tokens: 4000
temperature: 0
top_p: 1
embeddings:
api_key: ${GRAPHRAG_EMBEDDING_API_KEY}
type: openai_embedding
model: text-embedding-3-small
# Chunking Configuration
chunks:
size: 1200 # Token size per chunk
overlap: 100 # Overlap between chunks
group_by_columns: [id]
# Entity Extraction
entity_extraction:
prompt: "prompts/entity_extraction.txt"
max_gleanings: 1 # Re-extraction passes
entity_types: [organization, person, location, event]
# Community Detection
community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
# Claim Extraction
claim_extraction:
enabled: true
prompt: "prompts/claim_extraction.txt"
max_gleanings: 1
# Embeddings
embed_graph:
enabled: true
strategy: node2vec # or deepwalk
# Storage
storage:
type: file # or blob, cosmosdb
base_dir: output
# Reporting
reporting:
type: file
base_dir: output/reports# Custom LLM Configuration
llm:
type: azure_openai_chat
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: gpt-4
api_key: ${AZURE_OPENAI_API_KEY}
request_timeout: 180
max_retries: 10
max_retry_wait: 10
# Parallelization
parallelization:
stagger: 0.3 # Delay between requests
num_threads: 4 # Concurrent workers
# Cache Configuration
cache:
type: file
base_dir: cache
# Input Configuration
input:
type: file
file_type: text # or csv, parquet
base_dir: input
encoding: utf-8
file_pattern: ".*\\.txt$""Using GraphRAG with your data out of the box may not yield the best possible results."
# Generate domain-adapted prompts
graphrag prompt-tune --root . \
--config settings.yaml \
--output prompts/
# This will:
# 1. Analyze your input documents
# 2. Identify domain-specific patterns
# 3. Generate custom entity extraction prompts
# 4. Generate custom summarization prompts
# 5. Save to prompts/ directory# Edit generated prompts
nano prompts/entity_extraction.txt-Target activity-
You are an AI assistant helping to identify entities in documents about {DOMAIN}.
-Goal-
Extract all entities and relationships from the text below.
Entity Types:
{ENTITY_TYPES}
Relationship Types:
{RELATIONSHIP_TYPES}
Format your response as JSON:
{{
"entities": [
{{"name": "Entity Name", "type": "ENTITY_TYPE", "description": "..."}}
],
"relationships": [
{{"source": "Entity 1", "target": "Entity 2", "type": "RELATIONSHIP_TYPE", "description": "..."}}
]
}}
Text to analyze:
{INPUT_TEXT}# Input documents are loaded from input/ directory
# Supported formats: .txt, .pdf, .docx, .md# Documents split into TextUnits
# Default: 1200 tokens with 100 token overlap
# Preserves context across chunk boundaries# For each TextUnit:
# - Extract entities (with types and descriptions)
# - Extract relationships (with types and weights)
# - Extract claims (with sources and confidence)# Build knowledge graph:
# - Nodes = Entities
# - Edges = Relationships
# - Properties = Attributes and metadata# Leiden algorithm for hierarchical clustering:
# - Level 0: Fine-grained communities
# - Level 1: Mid-level aggregations
# - Level 2+: High-level themes# For each community at each level:
# - Aggregate entity and relationship info
# - Generate natural language summary
# - Store for query-time retrieval# Create vector embeddings for:
# - TextUnits (for similarity search)
# - Entities (for semantic matching)
# - Community summaries (for global search)# Results saved to output/:
# - create_final_entities.parquet
# - create_final_relationships.parquet
# - create_final_communities.parquet
# - create_final_community_reports.parquet
# - create_final_text_units.parquetgraphrag query --root . \
--method global \
--query "What are the major technology trends discussed in these documents?"
# Behind the scenes:
# 1. Match query to relevant communities
# 2. Retrieve summaries from levels 0, 1, 2
# 3. Aggregate: AI/ML, Cloud, Cybersecurity communities
# 4. Synthesize comprehensive answerfrom graphrag.query import GlobalSearch
searcher = GlobalSearch(
llm=llm,
context_builder=context_builder,
map_system_prompt=map_prompt,
reduce_system_prompt=reduce_prompt
)
result = await searcher.asearch(
query="What are the major themes?",
conversation_history=[]
)
print(result.response)graphrag query --root . \
--method local \
--query "What is Microsoft's strategy for artificial intelligence?"
# Behind the scenes:
# 1. Identify: "Microsoft", "artificial intelligence" entities
# 2. Traverse: Find related entities (Azure AI, OpenAI partnership, etc.)
# 3. Collect: Relationships, claims, TextUnits
# 4. Synthesize: Answer from local graph neighborhoodfrom graphrag.query import LocalSearch
searcher = LocalSearch(
llm=llm,
context_builder=context_builder,
system_prompt=system_prompt
)
result = await searcher.asearch(
query="Tell me about Microsoft's AI strategy",
conversation_history=[]
)
print(result.response)graphrag query --root . \
--method drift \
--query "How does Azure AI relate to Microsoft's overall cloud strategy?"
# Behind the scenes:
# 1. Local: Find "Azure AI" entity and neighborhood
# 2. Global: Find "cloud strategy" community summaries
# 3. Combine: Entity details + strategic context
# 4. Synthesize: Comprehensive answerimport asyncio
from graphrag.query import LocalSearch, GlobalSearch
from graphrag.llm import create_openai_chat_llm
from graphrag.config import GraphRagConfig
# Load configuration
config = GraphRagConfig.from_file("settings.yaml")
# Create LLM
llm = create_openai_chat_llm(
api_key=config.llm.api_key,
model=config.llm.model,
temperature=0.0
)from graphrag.index import run_pipeline_with_config
# Run indexing programmatically
await run_pipeline_with_config(
config_path="settings.yaml",
verbose=True
)from graphrag.query.context_builder import LocalContextBuilder
# Build custom context
context_builder = LocalContextBuilder(
entities=entities_df,
relationships=relationships_df,
text_units=text_units_df,
embeddings=embeddings
)
# Custom search with parameters
result = await searcher.asearch(
query="Your question here",
conversation_history=[
{"role": "user", "content": "Previous question"},
{"role": "assistant", "content": "Previous answer"}
],
top_k=10, # Number of results
temperature=0.5, # LLM creativity
max_tokens=2000 # Response length
)
# Access detailed results
print("Response:", result.response)
print("Context used:", result.context_data)
print("Sources:", result.sources)# Index academic papers
mkdir -p input/papers
cp research_papers/*.pdf input/papers/
graphrag index --root .
# Global query
graphrag query --method global \
--query "What are the main research themes across these papers?"
# Local query
graphrag query --method local \
--query "What methodologies does the Smith et al. paper use?"# Index legal contracts
mkdir -p input/contracts
cp contracts/*.docx input/contracts/
# Tune prompts for legal domain
graphrag prompt-tune --root . --domain "legal contracts"
# Index with legal-specific entities
graphrag index --root .
# Query
graphrag query --method local \
--query "What are the termination clauses in the Microsoft contracts?"# Index customer feedback
mkdir -p input/feedback
cp feedback_*.txt input/feedback/
# Global themes
graphrag query --method global \
--query "What are the main customer pain points?"
# Specific product feedback
graphrag query --method local \
--query "What feedback relates to product X features?"# Index news articles
mkdir -p input/news
cp articles/*.txt input/news/
graphrag index --root .
# Get comprehensive summary
graphrag query --method global \
--query "Summarize the key events and trends from these news articles"
# Entity-specific news
graphrag query --method local \
--query "What news relates to climate change initiatives?"# Initial indexing
graphrag index --root .
# Add new documents
cp new_documents/*.txt input/
# Re-index only new content
graphrag index --root . --incremental
# Note: Full graph may need periodic rebuildingprompts/entity_extraction.txtEntity Types:
- PRODUCT: Software products, services
- FEATURE: Product features and capabilities
- TECHNOLOGY: Technologies and frameworks
- METRIC: Performance metrics, KPIs
- INITIATIVE: Projects and strategic initiatives
- COMPETITOR: Competing products or companies# settings.yaml
input:
encoding: utf-8
language: es # Spanish
llm:
model: gpt-4o # Multilingual model
# Customize prompts in target languagellm:
type: azure_openai_chat
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: gpt-4
api_key: ${AZURE_OPENAI_API_KEY}
embeddings:
type: azure_openai_embedding
api_base: https://your-resource.openai.azure.com
api_version: "2024-02-15-preview"
deployment_name: text-embedding-3-small
api_key: ${AZURE_OPENAI_API_KEY}llm:
type: ollama
api_base: http://localhost:11434
model: llama3:70b
temperature: 0
embeddings:
type: ollama
api_base: http://localhost:11434
model: nomic-embed-textchunks:
size: 600 # Smaller chunks = fewer tokens
overlap: 50entity_extraction:
max_gleanings: 0 # 0 = single pass, 1 = two passesllm:
model: gpt-4o-mini # Cheaper than gpt-4o
embeddings:
model: text-embedding-3-small # Cheaper than large# Test on small sample
mkdir input/sample
cp input/full/*.txt input/sample/ | head -5
graphrag index --root . --input-dir input/samplecache:
type: file
base_dir: cache# Estimate before indexing
from graphrag.index import estimate_index_cost
cost_estimate = estimate_index_cost(
input_dir="input/",
config_path="settings.yaml"
)
print(f"Estimated cost: ${cost_estimate.total_cost}")
print(f"Total tokens: {cost_estimate.total_tokens}")
print(f"Estimated time: {cost_estimate.estimated_hours} hours")# Test with 5-10 documents first
# Validate outputs before scaling
# Tune prompts on small sample
# Then scale to full dataset# Use verbose mode
graphrag index --root . --verbose
# Check output files periodically
ls -lh output/*.parquet
# Monitor logs
tail -f output/reports/indexing.log# Track changes
git add settings.yaml prompts/
git commit -m "Update entity types for domain X"
# Tag successful configurations
git tag -a v1.0-config -m "Working config for dataset X"import pandas as pd
# Check extracted entities
entities = pd.read_parquet("output/create_final_entities.parquet")
print(f"Total entities: {len(entities)}")
print(f"Entity types: {entities['type'].value_counts()}")
# Check relationships
relationships = pd.read_parquet("output/create_final_relationships.parquet")
print(f"Total relationships: {len(relationships)}")
print(f"Relationship types: {relationships['type'].value_counts()}")
# Check communities
communities = pd.read_parquet("output/create_final_communities.parquet")
print(f"Total communities: {len(communities)}")
print(f"Hierarchy levels: {communities['level'].value_counts()}")# Run initial index
graphrag index --root .
# Evaluate quality
graphrag query --method global --query "Test query"
# If quality is poor:
# 1. Adjust entity types in prompts
# 2. Modify extraction instructions
# 3. Re-run indexing
# 4. Validate improvements# Add delays between requests
parallelization:
stagger: 1.0 # Increase delay
num_threads: 2 # Reduce concurrency
llm:
max_retries: 20 # More retries
max_retry_wait: 60 # Longer backoff# Reduce batch sizes
chunks:
size: 600 # Smaller chunks
parallelization:
num_threads: 2 # Less parallelism# Run prompt tuning
graphrag prompt-tune --root . --domain "your domain"
# Manually refine prompts
nano prompts/entity_extraction.txt
# Add domain-specific examples
# Specify expected entity types clearly# Check if indexing completed successfully
ls -lh output/*.parquet
# Validate extracted entities
python -c "import pandas as pd; print(pd.read_parquet('output/create_final_entities.parquet').head())"
# Try different query methods
graphrag query --method local --query "Your query"
graphrag query --method global --query "Your query"# Reinitialize configuration
graphrag init --root . --force
# This updates settings.yaml to new schema
# Review and merge your customizations# Optimize for speed
parallelization:
num_threads: 8 # Max concurrent workers
stagger: 0.1 # Minimal delay
chunks:
size: 1500 # Larger chunks (fewer API calls)
entity_extraction:
max_gleanings: 0 # Single pass only# Cache query results
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_query(query_text):
return searcher.search(query_text)
# Pre-load data structures
entities_df = pd.read_parquet("output/create_final_entities.parquet")
relationships_df = pd.read_parquet("output/create_final_relationships.parquet")
# Keep in memory for fast access# Use compressed storage
storage:
type: file
compression: gzip # Or snappy, lz4
# Or use database storage
storage:
type: cosmosdb
connection_string: ${COSMOS_CONNECTION_STRING}from langchain.retrievers import GraphRAGRetriever
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
# Create GraphRAG retriever
retriever = GraphRAGRetriever(
index_path="output/",
search_method="local"
)
# Build QA chain
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True
)
# Query
result = qa_chain("What are the main themes?")
print(result["answer"])from fastapi import FastAPI
from graphrag.query import LocalSearch, GlobalSearch
app = FastAPI()
# Initialize searchers
local_searcher = LocalSearch(...)
global_searcher = GlobalSearch(...)
@app.post("/query/local")
async def query_local(query: str):
result = await local_searcher.asearch(query)
return {"response": result.response, "sources": result.sources}
@app.post("/query/global")
async def query_global(query: str):
result = await global_searcher.asearch(query)
return {"response": result.response}
# Run: uvicorn main:app --reloadimport streamlit as st
from graphrag.query import GlobalSearch
st.title("GraphRAG Query Interface")
# Query input
query = st.text_input("Enter your question:")
method = st.selectbox("Search method:", ["global", "local", "drift"])
if st.button("Search"):
with st.spinner("Searching..."):
# Run query
result = await searcher.asearch(query)
# Display results
st.write("### Answer")
st.write(result.response)
st.write("### Sources")
st.write(result.sources)| Feature | Vector RAG | GraphRAG |
|---|---|---|
| Structure | Flat embeddings | Knowledge graph |
| Relationships | Implicit (similarity) | Explicit (edges) |
| Multi-hop | Poor | Excellent |
| Summarization | Difficult | Natural (communities) |
| Setup Cost | Low | High (indexing) |
| Query Cost | Low | Medium |
| Best For | Simple lookups | Complex reasoning |
"This codebase is a demonstration of graph-based RAG and not an officially supported Microsoft offering."
graphrag init --force