openalex-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenAlex Database
OpenAlex数据库
Overview
概述
OpenAlex is a comprehensive open catalog of 240M+ scholarly works, authors, institutions, topics, sources, publishers, and funders. This skill provides tools and workflows for querying the OpenAlex API to search literature, analyze research output, track citations, and conduct bibliometric studies.
OpenAlex是一个包含2.4亿+学术成果、作者、机构、主题、来源、出版商和资助方的综合性开放目录。该技能提供了查询OpenAlex API的工具和工作流,用于检索文献、分析研究产出、追踪引用情况以及开展文献计量研究。
Quick Start
快速开始
Basic Setup
基础设置
Always initialize the client with an email address to access the polite pool (10x rate limit boost):
python
from scripts.openalex_client import OpenAlexClient
client = OpenAlexClient(email="your-email@example.edu")请始终使用电子邮箱初始化客户端,以接入礼貌请求池(速率限制提升10倍):
python
from scripts.openalex_client import OpenAlexClient
client = OpenAlexClient(email="your-email@example.edu")Installation Requirements
安装要求
Install required package using uv:
bash
uv pip install requestsNo API key required - OpenAlex is completely open.
使用uv安装所需包:
bash
uv pip install requests无需API密钥 - OpenAlex完全开放。
Core Capabilities
核心功能
1. Search for Papers
1. 检索论文
Use for: Finding papers by title, abstract, or topic
python
undefined适用场景:通过标题、摘要或主题查找论文
python
undefinedSimple search
简单检索
results = client.search_works(
search="machine learning",
per_page=100
)
results = client.search_works(
search="machine learning",
per_page=100
)
Search with filters
带筛选条件的检索
results = client.search_works(
search="CRISPR gene editing",
filter_params={
"publication_year": ">2020",
"is_oa": "true"
},
sort="cited_by_count:desc"
)
undefinedresults = client.search_works(
search="CRISPR gene editing",
filter_params={
"publication_year": ">2020",
"is_oa": "true"
},
sort="cited_by_count:desc"
)
undefined2. Find Works by Author
2. 查找作者成果
Use for: Getting all publications by a specific researcher
Use the two-step pattern (entity name → ID → works):
python
from scripts.query_helpers import find_author_works
works = find_author_works(
author_name="Jennifer Doudna",
client=client,
limit=100
)Manual two-step approach:
python
undefined适用场景:获取特定研究者的所有出版物
使用两步模式(实体名称→ID→成果):
python
from scripts.query_helpers import find_author_works
works = find_author_works(
author_name="Jennifer Doudna",
client=client,
limit=100
)手动两步法:
python
undefinedStep 1: Get author ID
步骤1:获取作者ID
author_response = client._make_request(
'/authors',
params={'search': 'Jennifer Doudna', 'per-page': 1}
)
author_id = author_response['results'][0]['id'].split('/')[-1]
author_response = client._make_request(
'/authors',
params={'search': 'Jennifer Doudna', 'per-page': 1}
)
author_id = author_response['results'][0]['id'].split('/')[-1]
Step 2: Get works
步骤2:获取成果
works = client.search_works(
filter_params={"authorships.author.id": author_id}
)
undefinedworks = client.search_works(
filter_params={"authorships.author.id": author_id}
)
undefined3. Find Works from Institution
3. 查找机构成果
Use for: Analyzing research output from universities or organizations
python
from scripts.query_helpers import find_institution_works
works = find_institution_works(
institution_name="Stanford University",
client=client,
limit=200
)适用场景:分析高校或机构的研究产出
python
from scripts.query_helpers import find_institution_works
works = find_institution_works(
institution_name="Stanford University",
client=client,
limit=200
)4. Highly Cited Papers
4. 高被引论文
Use for: Finding influential papers in a field
python
from scripts.query_helpers import find_highly_cited_recent_papers
papers = find_highly_cited_recent_papers(
topic="quantum computing",
years=">2020",
client=client,
limit=100
)适用场景:查找领域内有影响力的论文
python
from scripts.query_helpers import find_highly_cited_recent_papers
papers = find_highly_cited_recent_papers(
topic="quantum computing",
years=">2020",
client=client,
limit=100
)5. Open Access Papers
5. 开放获取论文
Use for: Finding freely available research
python
from scripts.query_helpers import get_open_access_papers
papers = get_open_access_papers(
search_term="climate change",
client=client,
oa_status="any", # or "gold", "green", "hybrid", "bronze"
limit=200
)适用场景:查找可免费获取的研究成果
python
from scripts.query_helpers import get_open_access_papers
papers = get_open_access_papers(
search_term="climate change",
client=client,
oa_status="any", # 或 "gold", "green", "hybrid", "bronze"
limit=200
)6. Publication Trends Analysis
6. 出版趋势分析
Use for: Tracking research output over time
python
from scripts.query_helpers import get_publication_trends
trends = get_publication_trends(
search_term="artificial intelligence",
filter_params={"is_oa": "true"},
client=client
)适用场景:追踪研究产出随时间的变化趋势
python
from scripts.query_helpers import get_publication_trends
trends = get_publication_trends(
search_term="artificial intelligence",
filter_params={"is_oa": "true"},
client=client
)Sort and display
排序并展示
for trend in sorted(trends, key=lambda x: x['key'])[-10:]:
print(f"{trend['key']}: {trend['count']} publications")
undefinedfor trend in sorted(trends, key=lambda x: x['key'])[-10:]:
print(f"{trend['key']}: {trend['count']} 篇出版物")
undefined7. Research Output Analysis
7. 研究产出分析
Use for: Comprehensive analysis of author or institution research
python
from scripts.query_helpers import analyze_research_output
analysis = analyze_research_output(
entity_type='institution', # or 'author'
entity_name='MIT',
client=client,
years='>2020'
)
print(f"Total works: {analysis['total_works']}")
print(f"Open access: {analysis['open_access_percentage']}%")
print(f"Top topics: {analysis['top_topics'][:5]}")适用场景:对作者或机构的研究成果进行综合分析
python
from scripts.query_helpers import analyze_research_output
analysis = analyze_research_output(
entity_type='institution', # 或 'author'
entity_name='MIT',
client=client,
years='>2020'
)
print(f"总成果数: {analysis['total_works']}")
print(f"开放获取占比: {analysis['open_access_percentage']}%")
print(f"热门主题: {analysis['top_topics'][:5]}")8. Batch Lookups
8. 批量查询
Use for: Getting information for multiple DOIs, ORCIDs, or IDs efficiently
python
dois = [
"https://doi.org/10.1038/s41586-021-03819-2",
"https://doi.org/10.1126/science.abc1234",
# ... up to 50 DOIs
]
works = client.batch_lookup(
entity_type='works',
ids=dois,
id_field='doi'
)适用场景:高效获取多个DOI、ORCID或ID的相关信息
python
dois = [
"https://doi.org/10.1038/s41586-021-03819-2",
"https://doi.org/10.1126/science.abc1234",
# ... 最多支持50个DOI
]
works = client.batch_lookup(
entity_type='works',
ids=dois,
id_field='doi'
)9. Random Sampling
9. 随机抽样
Use for: Getting representative samples for analysis
python
undefined适用场景:获取用于分析的代表性样本
python
undefinedSmall sample
小样本
works = client.sample_works(
sample_size=100,
seed=42, # For reproducibility
filter_params={"publication_year": "2023"}
)
works = client.sample_works(
sample_size=100,
seed=42, # 保证可复现
filter_params={"publication_year": "2023"}
)
Large sample (>10k) - automatically handles multiple requests
大样本(>10k)- 自动处理多轮请求
works = client.sample_works(
sample_size=25000,
seed=42,
filter_params={"is_oa": "true"}
)
undefinedworks = client.sample_works(
sample_size=25000,
seed=42,
filter_params={"is_oa": "true"}
)
undefined10. Citation Analysis
10. 引用分析
Use for: Finding papers that cite a specific work
python
undefined适用场景:查找引用某一特定成果的论文
python
undefinedGet the work
获取目标成果
work = client.get_entity('works', 'https://doi.org/10.1038/s41586-021-03819-2')
work = client.get_entity('works', 'https://doi.org/10.1038/s41586-021-03819-2')
Get citing papers using cited_by_api_url
通过cited_by_api_url获取引用论文
import requests
citing_response = requests.get(
work['cited_by_api_url'],
params={'mailto': client.email, 'per-page': 200}
)
citing_works = citing_response.json()['results']
undefinedimport requests
citing_response = requests.get(
work['cited_by_api_url'],
params={'mailto': client.email, 'per-page': 200}
)
citing_works = citing_response.json()['results']
undefined11. Topic and Subject Analysis
11. 主题与学科分析
Use for: Understanding research focus areas
python
undefined适用场景:了解研究聚焦领域
python
undefinedGet top topics for an institution
获取某机构的热门主题
topics = client.group_by(
entity_type='works',
group_field='topics.id',
filter_params={
"authorships.institutions.id": "I136199984", # MIT
"publication_year": ">2020"
}
)
for topic in topics[:10]:
print(f"{topic['key_display_name']}: {topic['count']} works")
undefinedtopics = client.group_by(
entity_type='works',
group_field='topics.id',
filter_params={
"authorships.institutions.id": "I136199984", # MIT的ID
"publication_year": ">2020"
}
)
for topic in topics[:10]:
print(f"{topic['key_display_name']}: {topic['count']} 篇成果")
undefined12. Large-Scale Data Extraction
12. 大规模数据提取
Use for: Downloading large datasets for analysis
python
undefined适用场景:下载大型数据集用于分析
python
undefinedPaginate through all results
遍历所有结果
all_papers = client.paginate_all(
endpoint='/works',
params={
'search': 'synthetic biology',
'filter': 'publication_year:2020-2024'
},
max_results=10000
)
all_papers = client.paginate_all(
endpoint='/works',
params={
'search': 'synthetic biology',
'filter': 'publication_year:2020-2024'
},
max_results=10000
)
Export to CSV
导出为CSV
import csv
with open('papers.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Title', 'Year', 'Citations', 'DOI', 'OA Status'])
for paper in all_papers:
writer.writerow([
paper.get('title', 'N/A'),
paper.get('publication_year', 'N/A'),
paper.get('cited_by_count', 0),
paper.get('doi', 'N/A'),
paper.get('open_access', {}).get('oa_status', 'closed')
])undefinedimport csv
with open('papers.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['标题', '年份', '引用量', 'DOI', '开放获取状态'])
for paper in all_papers:
writer.writerow([
paper.get('title', 'N/A'),
paper.get('publication_year', 'N/A'),
paper.get('cited_by_count', 0),
paper.get('doi', 'N/A'),
paper.get('open_access', {}).get('oa_status', 'closed')
])undefinedCritical Best Practices
关键最佳实践
Always Use Email for Polite Pool
始终使用电子邮箱接入礼貌请求池
Add email to get 10x rate limit (1 req/sec → 10 req/sec):
python
client = OpenAlexClient(email="your-email@example.edu")添加电子邮箱可获得10倍速率限制(1次请求/秒 → 10次请求/秒):
python
client = OpenAlexClient(email="your-email@example.edu")Use Two-Step Pattern for Entity Lookups
使用两步模式进行实体查询
Never filter by entity names directly - always get ID first:
python
undefined切勿直接通过实体名称筛选 - 务必先获取ID:
python
undefined✅ Correct
✅ 正确方式
1. Search for entity → get ID
1. 搜索实体 → 获取ID
2. Filter by ID
2. 通过ID筛选
❌ Wrong
❌ 错误方式
filter=author_name:Einstein # This doesn't work!
filter=author_name:Einstein # 此方式无效!
undefinedundefinedUse Maximum Page Size
使用最大分页大小
Always use for efficient data retrieval:
per-page=200python
results = client.search_works(search="topic", per_page=200)始终使用以高效获取数据:
per-page=200python
results = client.search_works(search="topic", per_page=200)Batch Multiple IDs
批量处理多个ID
Use batch_lookup() for multiple IDs instead of individual requests:
python
undefined使用batch_lookup()处理多个ID,而非单独请求:
python
undefined✅ Correct - 1 request for 50 DOIs
✅ 正确方式 - 1次请求处理50个DOI
works = client.batch_lookup('works', doi_list, 'doi')
works = client.batch_lookup('works', doi_list, 'doi')
❌ Wrong - 50 separate requests
❌ 错误方式 - 50次单独请求
for doi in doi_list:
work = client.get_entity('works', doi)
undefinedfor doi in doi_list:
work = client.get_entity('works', doi)
undefinedUse Sample Parameter for Random Data
使用样本参数获取随机数据
Use with seed for reproducible random sampling:
sample_works()python
undefined使用带seed参数的获取可复现的随机样本:
sample_works()python
undefined✅ Correct
✅ 正确方式
works = client.sample_works(sample_size=100, seed=42)
works = client.sample_works(sample_size=100, seed=42)
❌ Wrong - random page numbers bias results
❌ 错误方式 - 随机页码会导致结果有偏差
Using random page numbers doesn't give true random sample
使用随机页码无法得到真正的随机样本
undefinedundefinedSelect Only Needed Fields
仅选择所需字段
Reduce response size by selecting specific fields:
python
results = client.search_works(
search="topic",
select=['id', 'title', 'publication_year', 'cited_by_count']
)通过选择特定字段减少响应数据量:
python
results = client.search_works(
search="topic",
select=['id', 'title', 'publication_year', 'cited_by_count']
)Common Filter Patterns
常见筛选模式
Date Ranges
日期范围
python
undefinedpython
undefinedSingle year
单一年份
filter_params={"publication_year": "2023"}
filter_params={"publication_year": "2023"}
After year
某年份之后
filter_params={"publication_year": ">2020"}
filter_params={"publication_year": ">2020"}
Range
年份范围
filter_params={"publication_year": "2020-2024"}
undefinedfilter_params={"publication_year": "2020-2024"}
undefinedMultiple Filters (AND)
多条件筛选(逻辑与)
python
undefinedpython
undefinedAll conditions must match
所有条件必须同时满足
filter_params={
"publication_year": ">2020",
"is_oa": "true",
"cited_by_count": ">100"
}
undefinedfilter_params={
"publication_year": ">2020",
"is_oa": "true",
"cited_by_count": ">100"
}
undefinedMultiple Values (OR)
多值筛选(逻辑或)
python
undefinedpython
undefinedAny institution matches
匹配任一机构
filter_params={
"authorships.institutions.id": "I136199984|I27837315" # MIT or Harvard
}
undefinedfilter_params={
"authorships.institutions.id": "I136199984|I27837315" # MIT 或 哈佛大学
}
undefinedCollaboration (AND within attribute)
合作筛选(属性内逻辑与)
python
undefinedpython
undefinedPapers with authors from BOTH institutions
同时包含来自两个机构作者的论文
filter_params={
"authorships.institutions.id": "I136199984+I27837315" # MIT AND Harvard
}
undefinedfilter_params={
"authorships.institutions.id": "I136199984+I27837315" # MIT 且 哈佛大学
}
undefinedNegation
否定筛选
python
undefinedpython
undefinedExclude type
排除特定类型
filter_params={
"type": "!paratext"
}
undefinedfilter_params={
"type": "!paratext"
}
undefinedEntity Types
实体类型
OpenAlex provides these entity types:
- works - Scholarly documents (articles, books, datasets)
- authors - Researchers with disambiguated identities
- institutions - Universities and research organizations
- sources - Journals, repositories, conferences
- topics - Subject classifications
- publishers - Publishing organizations
- funders - Funding agencies
Access any entity type using consistent patterns:
python
client.search_works(...)
client.get_entity('authors', author_id)
client.group_by('works', 'topics.id', filter_params={...})OpenAlex提供以下实体类型:
- works - 学术文献(论文、书籍、数据集)
- authors - 经过身份消歧的研究者
- institutions - 高校和研究机构
- sources - 期刊、知识库、会议
- topics - 学科分类
- publishers - 出版机构
- funders - 资助机构
使用统一模式访问任意实体类型:
python
client.search_works(...)
client.get_entity('authors', author_id)
client.group_by('works', 'topics.id', filter_params={...})External IDs
外部ID
Use external identifiers directly:
python
undefined可直接使用外部标识符:
python
undefinedDOI for works
成果的DOI
work = client.get_entity('works', 'https://doi.org/10.7717/peerj.4375')
work = client.get_entity('works', 'https://doi.org/10.7717/peerj.4375')
ORCID for authors
作者的ORCID
author = client.get_entity('authors', 'https://orcid.org/0000-0003-1613-5981')
author = client.get_entity('authors', 'https://orcid.org/0000-0003-1613-5981')
ROR for institutions
机构的ROR
institution = client.get_entity('institutions', 'https://ror.org/02y3ad647')
institution = client.get_entity('institutions', 'https://ror.org/02y3ad647')
ISSN for sources
来源的ISSN
source = client.get_entity('sources', 'issn:0028-0836')
undefinedsource = client.get_entity('sources', 'issn:0028-0836')
undefinedReference Documentation
参考文档
Detailed API Reference
详细API参考
See for:
references/api_guide.md- Complete filter syntax
- All available endpoints
- Response structures
- Error handling
- Performance optimization
- Rate limiting details
查看获取:
references/api_guide.md- 完整筛选语法
- 所有可用端点
- 响应结构
- 错误处理
- 性能优化
- 速率限制详情
Common Query Examples
常见查询示例
See for:
references/common_queries.md- Complete working examples
- Real-world use cases
- Complex query patterns
- Data export workflows
- Multi-step analysis procedures
查看获取:
references/common_queries.md- 完整可运行示例
- 真实场景用例
- 复杂查询模式
- 数据导出工作流
- 多步骤分析流程
Scripts
脚本说明
openalex_client.py
openalex_client.py
Main API client with:
- Automatic rate limiting
- Exponential backoff retry logic
- Pagination support
- Batch operations
- Error handling
Use for direct API access with full control.
主API客户端,包含:
- 自动速率限制
- 指数退避重试逻辑
- 分页支持
- 批量操作
- 错误处理
用于需要完全控制的直接API访问场景。
query_helpers.py
query_helpers.py
High-level helper functions for common operations:
- - Get papers by author
find_author_works() - - Get papers from institution
find_institution_works() - - Get influential papers
find_highly_cited_recent_papers() - - Find OA publications
get_open_access_papers() - - Analyze trends over time
get_publication_trends() - - Comprehensive analysis
analyze_research_output()
Use for common research queries with simplified interfaces.
针对常见操作的高层级辅助函数:
- - 获取作者的论文
find_author_works() - - 获取机构的论文
find_institution_works() - - 获取高影响力论文
find_highly_cited_recent_papers() - - 查找开放获取出版物
get_open_access_papers() - - 分析时间趋势
get_publication_trends() - - 综合研究产出分析
analyze_research_output()
用于简化常见研究查询的场景。
Troubleshooting
故障排除
Rate Limiting
速率限制
If encountering 403 errors:
- Ensure email is added to requests
- Verify not exceeding 10 req/sec
- Client automatically implements exponential backoff
如果遇到403错误:
- 确保请求中已添加电子邮箱
- 确认未超过10次请求/秒的限制
- 客户端会自动执行指数退避重试
Empty Results
无结果返回
If searches return no results:
- Check filter syntax (see )
references/api_guide.md - Use two-step pattern for entity lookups (don't filter by names)
- Verify entity IDs are correct format
如果检索无结果:
- 检查筛选语法(参考)
references/api_guide.md - 使用两步模式进行实体查询(不要通过名称筛选)
- 验证实体ID格式正确
Timeout Errors
超时错误
For large queries:
- Use pagination with
per-page=200 - Use to limit returned fields
select= - Break into smaller queries if needed
针对大型查询:
- 使用进行分页
per-page=200 - 使用限制返回字段
select= - 必要时拆分为更小的查询
Rate Limits
速率限制
- Default: 1 request/second, 100k requests/day
- Polite pool (with email): 10 requests/second, 100k requests/day
Always use polite pool for production workflows by providing email to client.
- 默认限制:1次请求/秒,每日10万次请求
- 礼貌请求池(含电子邮箱):10次请求/秒,每日10万次请求
生产环境工作流请始终通过向客户端提供电子邮箱接入礼貌请求池。
Notes
注意事项
- No authentication required
- All data is open and free
- Rate limits apply globally, not per IP
- Use LitLLM with OpenRouter if LLM-based analysis is needed (don't use Perplexity API directly)
- Client handles pagination, retries, and rate limiting automatically
- 无需身份验证
- 所有数据均开放免费
- 速率限制为全局限制,而非按IP限制
- 如需基于大语言模型的分析,请使用LitLLM搭配OpenRouter(不要直接使用Perplexity API)
- 客户端会自动处理分页、重试和速率限制