openalex-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenAlex Database

OpenAlex数据库

Overview

概述

OpenAlex is a comprehensive open catalog of 240M+ scholarly works, authors, institutions, topics, sources, publishers, and funders. This skill provides tools and workflows for querying the OpenAlex API to search literature, analyze research output, track citations, and conduct bibliometric studies.
OpenAlex是一个包含2.4亿+学术成果、作者、机构、主题、来源、出版商和资助方的综合性开放目录。该技能提供了查询OpenAlex API的工具和工作流,用于检索文献、分析研究产出、追踪引用情况以及开展文献计量研究。

Quick Start

快速开始

Basic Setup

基础设置

Always initialize the client with an email address to access the polite pool (10x rate limit boost):
python
from scripts.openalex_client import OpenAlexClient

client = OpenAlexClient(email="your-email@example.edu")
请始终使用电子邮箱初始化客户端,以接入礼貌请求池(速率限制提升10倍):
python
from scripts.openalex_client import OpenAlexClient

client = OpenAlexClient(email="your-email@example.edu")

Installation Requirements

安装要求

Install required package using uv:
bash
uv pip install requests
No API key required - OpenAlex is completely open.
使用uv安装所需包:
bash
uv pip install requests
无需API密钥 - OpenAlex完全开放。

Core Capabilities

核心功能

1. Search for Papers

1. 检索论文

Use for: Finding papers by title, abstract, or topic
python
undefined
适用场景:通过标题、摘要或主题查找论文
python
undefined

Simple search

简单检索

results = client.search_works( search="machine learning", per_page=100 )
results = client.search_works( search="machine learning", per_page=100 )

Search with filters

带筛选条件的检索

results = client.search_works( search="CRISPR gene editing", filter_params={ "publication_year": ">2020", "is_oa": "true" }, sort="cited_by_count:desc" )
undefined
results = client.search_works( search="CRISPR gene editing", filter_params={ "publication_year": ">2020", "is_oa": "true" }, sort="cited_by_count:desc" )
undefined

2. Find Works by Author

2. 查找作者成果

Use for: Getting all publications by a specific researcher
Use the two-step pattern (entity name → ID → works):
python
from scripts.query_helpers import find_author_works

works = find_author_works(
    author_name="Jennifer Doudna",
    client=client,
    limit=100
)
Manual two-step approach:
python
undefined
适用场景:获取特定研究者的所有出版物
使用两步模式(实体名称→ID→成果):
python
from scripts.query_helpers import find_author_works

works = find_author_works(
    author_name="Jennifer Doudna",
    client=client,
    limit=100
)
手动两步法:
python
undefined

Step 1: Get author ID

步骤1:获取作者ID

author_response = client._make_request( '/authors', params={'search': 'Jennifer Doudna', 'per-page': 1} ) author_id = author_response['results'][0]['id'].split('/')[-1]
author_response = client._make_request( '/authors', params={'search': 'Jennifer Doudna', 'per-page': 1} ) author_id = author_response['results'][0]['id'].split('/')[-1]

Step 2: Get works

步骤2:获取成果

works = client.search_works( filter_params={"authorships.author.id": author_id} )
undefined
works = client.search_works( filter_params={"authorships.author.id": author_id} )
undefined

3. Find Works from Institution

3. 查找机构成果

Use for: Analyzing research output from universities or organizations
python
from scripts.query_helpers import find_institution_works

works = find_institution_works(
    institution_name="Stanford University",
    client=client,
    limit=200
)
适用场景:分析高校或机构的研究产出
python
from scripts.query_helpers import find_institution_works

works = find_institution_works(
    institution_name="Stanford University",
    client=client,
    limit=200
)

4. Highly Cited Papers

4. 高被引论文

Use for: Finding influential papers in a field
python
from scripts.query_helpers import find_highly_cited_recent_papers

papers = find_highly_cited_recent_papers(
    topic="quantum computing",
    years=">2020",
    client=client,
    limit=100
)
适用场景:查找领域内有影响力的论文
python
from scripts.query_helpers import find_highly_cited_recent_papers

papers = find_highly_cited_recent_papers(
    topic="quantum computing",
    years=">2020",
    client=client,
    limit=100
)

5. Open Access Papers

5. 开放获取论文

Use for: Finding freely available research
python
from scripts.query_helpers import get_open_access_papers

papers = get_open_access_papers(
    search_term="climate change",
    client=client,
    oa_status="any",  # or "gold", "green", "hybrid", "bronze"
    limit=200
)
适用场景:查找可免费获取的研究成果
python
from scripts.query_helpers import get_open_access_papers

papers = get_open_access_papers(
    search_term="climate change",
    client=client,
    oa_status="any",  # 或 "gold", "green", "hybrid", "bronze"
    limit=200
)

6. Publication Trends Analysis

6. 出版趋势分析

Use for: Tracking research output over time
python
from scripts.query_helpers import get_publication_trends

trends = get_publication_trends(
    search_term="artificial intelligence",
    filter_params={"is_oa": "true"},
    client=client
)
适用场景:追踪研究产出随时间的变化趋势
python
from scripts.query_helpers import get_publication_trends

trends = get_publication_trends(
    search_term="artificial intelligence",
    filter_params={"is_oa": "true"},
    client=client
)

Sort and display

排序并展示

for trend in sorted(trends, key=lambda x: x['key'])[-10:]: print(f"{trend['key']}: {trend['count']} publications")
undefined
for trend in sorted(trends, key=lambda x: x['key'])[-10:]: print(f"{trend['key']}: {trend['count']} 篇出版物")
undefined

7. Research Output Analysis

7. 研究产出分析

Use for: Comprehensive analysis of author or institution research
python
from scripts.query_helpers import analyze_research_output

analysis = analyze_research_output(
    entity_type='institution',  # or 'author'
    entity_name='MIT',
    client=client,
    years='>2020'
)

print(f"Total works: {analysis['total_works']}")
print(f"Open access: {analysis['open_access_percentage']}%")
print(f"Top topics: {analysis['top_topics'][:5]}")
适用场景:对作者或机构的研究成果进行综合分析
python
from scripts.query_helpers import analyze_research_output

analysis = analyze_research_output(
    entity_type='institution',  # 或 'author'
    entity_name='MIT',
    client=client,
    years='>2020'
)

print(f"总成果数: {analysis['total_works']}")
print(f"开放获取占比: {analysis['open_access_percentage']}%")
print(f"热门主题: {analysis['top_topics'][:5]}")

8. Batch Lookups

8. 批量查询

Use for: Getting information for multiple DOIs, ORCIDs, or IDs efficiently
python
dois = [
    "https://doi.org/10.1038/s41586-021-03819-2",
    "https://doi.org/10.1126/science.abc1234",
    # ... up to 50 DOIs
]

works = client.batch_lookup(
    entity_type='works',
    ids=dois,
    id_field='doi'
)
适用场景:高效获取多个DOI、ORCID或ID的相关信息
python
dois = [
    "https://doi.org/10.1038/s41586-021-03819-2",
    "https://doi.org/10.1126/science.abc1234",
    # ... 最多支持50个DOI
]

works = client.batch_lookup(
    entity_type='works',
    ids=dois,
    id_field='doi'
)

9. Random Sampling

9. 随机抽样

Use for: Getting representative samples for analysis
python
undefined
适用场景:获取用于分析的代表性样本
python
undefined

Small sample

小样本

works = client.sample_works( sample_size=100, seed=42, # For reproducibility filter_params={"publication_year": "2023"} )
works = client.sample_works( sample_size=100, seed=42, # 保证可复现 filter_params={"publication_year": "2023"} )

Large sample (>10k) - automatically handles multiple requests

大样本(>10k)- 自动处理多轮请求

works = client.sample_works( sample_size=25000, seed=42, filter_params={"is_oa": "true"} )
undefined
works = client.sample_works( sample_size=25000, seed=42, filter_params={"is_oa": "true"} )
undefined

10. Citation Analysis

10. 引用分析

Use for: Finding papers that cite a specific work
python
undefined
适用场景:查找引用某一特定成果的论文
python
undefined

Get the work

获取目标成果

work = client.get_entity('works', 'https://doi.org/10.1038/s41586-021-03819-2')
work = client.get_entity('works', 'https://doi.org/10.1038/s41586-021-03819-2')

Get citing papers using cited_by_api_url

通过cited_by_api_url获取引用论文

import requests citing_response = requests.get( work['cited_by_api_url'], params={'mailto': client.email, 'per-page': 200} ) citing_works = citing_response.json()['results']
undefined
import requests citing_response = requests.get( work['cited_by_api_url'], params={'mailto': client.email, 'per-page': 200} ) citing_works = citing_response.json()['results']
undefined

11. Topic and Subject Analysis

11. 主题与学科分析

Use for: Understanding research focus areas
python
undefined
适用场景:了解研究聚焦领域
python
undefined

Get top topics for an institution

获取某机构的热门主题

topics = client.group_by( entity_type='works', group_field='topics.id', filter_params={ "authorships.institutions.id": "I136199984", # MIT "publication_year": ">2020" } )
for topic in topics[:10]: print(f"{topic['key_display_name']}: {topic['count']} works")
undefined
topics = client.group_by( entity_type='works', group_field='topics.id', filter_params={ "authorships.institutions.id": "I136199984", # MIT的ID "publication_year": ">2020" } )
for topic in topics[:10]: print(f"{topic['key_display_name']}: {topic['count']} 篇成果")
undefined

12. Large-Scale Data Extraction

12. 大规模数据提取

Use for: Downloading large datasets for analysis
python
undefined
适用场景:下载大型数据集用于分析
python
undefined

Paginate through all results

遍历所有结果

all_papers = client.paginate_all( endpoint='/works', params={ 'search': 'synthetic biology', 'filter': 'publication_year:2020-2024' }, max_results=10000 )
all_papers = client.paginate_all( endpoint='/works', params={ 'search': 'synthetic biology', 'filter': 'publication_year:2020-2024' }, max_results=10000 )

Export to CSV

导出为CSV

import csv with open('papers.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['Title', 'Year', 'Citations', 'DOI', 'OA Status'])
for paper in all_papers:
    writer.writerow([
        paper.get('title', 'N/A'),
        paper.get('publication_year', 'N/A'),
        paper.get('cited_by_count', 0),
        paper.get('doi', 'N/A'),
        paper.get('open_access', {}).get('oa_status', 'closed')
    ])
undefined
import csv with open('papers.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['标题', '年份', '引用量', 'DOI', '开放获取状态'])
for paper in all_papers:
    writer.writerow([
        paper.get('title', 'N/A'),
        paper.get('publication_year', 'N/A'),
        paper.get('cited_by_count', 0),
        paper.get('doi', 'N/A'),
        paper.get('open_access', {}).get('oa_status', 'closed')
    ])
undefined

Critical Best Practices

关键最佳实践

Always Use Email for Polite Pool

始终使用电子邮箱接入礼貌请求池

Add email to get 10x rate limit (1 req/sec → 10 req/sec):
python
client = OpenAlexClient(email="your-email@example.edu")
添加电子邮箱可获得10倍速率限制(1次请求/秒 → 10次请求/秒):
python
client = OpenAlexClient(email="your-email@example.edu")

Use Two-Step Pattern for Entity Lookups

使用两步模式进行实体查询

Never filter by entity names directly - always get ID first:
python
undefined
切勿直接通过实体名称筛选 - 务必先获取ID:
python
undefined

✅ Correct

✅ 正确方式

1. Search for entity → get ID

1. 搜索实体 → 获取ID

2. Filter by ID

2. 通过ID筛选

❌ Wrong

❌ 错误方式

filter=author_name:Einstein # This doesn't work!

filter=author_name:Einstein # 此方式无效!

undefined
undefined

Use Maximum Page Size

使用最大分页大小

Always use
per-page=200
for efficient data retrieval:
python
results = client.search_works(search="topic", per_page=200)
始终使用
per-page=200
以高效获取数据:
python
results = client.search_works(search="topic", per_page=200)

Batch Multiple IDs

批量处理多个ID

Use batch_lookup() for multiple IDs instead of individual requests:
python
undefined
使用batch_lookup()处理多个ID,而非单独请求:
python
undefined

✅ Correct - 1 request for 50 DOIs

✅ 正确方式 - 1次请求处理50个DOI

works = client.batch_lookup('works', doi_list, 'doi')
works = client.batch_lookup('works', doi_list, 'doi')

❌ Wrong - 50 separate requests

❌ 错误方式 - 50次单独请求

for doi in doi_list: work = client.get_entity('works', doi)
undefined
for doi in doi_list: work = client.get_entity('works', doi)
undefined

Use Sample Parameter for Random Data

使用样本参数获取随机数据

Use
sample_works()
with seed for reproducible random sampling:
python
undefined
使用带seed参数的
sample_works()
获取可复现的随机样本:
python
undefined

✅ Correct

✅ 正确方式

works = client.sample_works(sample_size=100, seed=42)
works = client.sample_works(sample_size=100, seed=42)

❌ Wrong - random page numbers bias results

❌ 错误方式 - 随机页码会导致结果有偏差

Using random page numbers doesn't give true random sample

使用随机页码无法得到真正的随机样本

undefined
undefined

Select Only Needed Fields

仅选择所需字段

Reduce response size by selecting specific fields:
python
results = client.search_works(
    search="topic",
    select=['id', 'title', 'publication_year', 'cited_by_count']
)
通过选择特定字段减少响应数据量:
python
results = client.search_works(
    search="topic",
    select=['id', 'title', 'publication_year', 'cited_by_count']
)

Common Filter Patterns

常见筛选模式

Date Ranges

日期范围

python
undefined
python
undefined

Single year

单一年份

filter_params={"publication_year": "2023"}
filter_params={"publication_year": "2023"}

After year

某年份之后

filter_params={"publication_year": ">2020"}
filter_params={"publication_year": ">2020"}

Range

年份范围

filter_params={"publication_year": "2020-2024"}
undefined
filter_params={"publication_year": "2020-2024"}
undefined

Multiple Filters (AND)

多条件筛选(逻辑与)

python
undefined
python
undefined

All conditions must match

所有条件必须同时满足

filter_params={ "publication_year": ">2020", "is_oa": "true", "cited_by_count": ">100" }
undefined
filter_params={ "publication_year": ">2020", "is_oa": "true", "cited_by_count": ">100" }
undefined

Multiple Values (OR)

多值筛选(逻辑或)

python
undefined
python
undefined

Any institution matches

匹配任一机构

filter_params={ "authorships.institutions.id": "I136199984|I27837315" # MIT or Harvard }
undefined
filter_params={ "authorships.institutions.id": "I136199984|I27837315" # MIT 或 哈佛大学 }
undefined

Collaboration (AND within attribute)

合作筛选(属性内逻辑与)

python
undefined
python
undefined

Papers with authors from BOTH institutions

同时包含来自两个机构作者的论文

filter_params={ "authorships.institutions.id": "I136199984+I27837315" # MIT AND Harvard }
undefined
filter_params={ "authorships.institutions.id": "I136199984+I27837315" # MIT 且 哈佛大学 }
undefined

Negation

否定筛选

python
undefined
python
undefined

Exclude type

排除特定类型

filter_params={ "type": "!paratext" }
undefined
filter_params={ "type": "!paratext" }
undefined

Entity Types

实体类型

OpenAlex provides these entity types:
  • works - Scholarly documents (articles, books, datasets)
  • authors - Researchers with disambiguated identities
  • institutions - Universities and research organizations
  • sources - Journals, repositories, conferences
  • topics - Subject classifications
  • publishers - Publishing organizations
  • funders - Funding agencies
Access any entity type using consistent patterns:
python
client.search_works(...)
client.get_entity('authors', author_id)
client.group_by('works', 'topics.id', filter_params={...})
OpenAlex提供以下实体类型:
  • works - 学术文献(论文、书籍、数据集)
  • authors - 经过身份消歧的研究者
  • institutions - 高校和研究机构
  • sources - 期刊、知识库、会议
  • topics - 学科分类
  • publishers - 出版机构
  • funders - 资助机构
使用统一模式访问任意实体类型:
python
client.search_works(...)
client.get_entity('authors', author_id)
client.group_by('works', 'topics.id', filter_params={...})

External IDs

外部ID

Use external identifiers directly:
python
undefined
可直接使用外部标识符:
python
undefined

DOI for works

成果的DOI

work = client.get_entity('works', 'https://doi.org/10.7717/peerj.4375')
work = client.get_entity('works', 'https://doi.org/10.7717/peerj.4375')

ORCID for authors

作者的ORCID

author = client.get_entity('authors', 'https://orcid.org/0000-0003-1613-5981')
author = client.get_entity('authors', 'https://orcid.org/0000-0003-1613-5981')

ROR for institutions

机构的ROR

institution = client.get_entity('institutions', 'https://ror.org/02y3ad647')
institution = client.get_entity('institutions', 'https://ror.org/02y3ad647')

ISSN for sources

来源的ISSN

source = client.get_entity('sources', 'issn:0028-0836')
undefined
source = client.get_entity('sources', 'issn:0028-0836')
undefined

Reference Documentation

参考文档

Detailed API Reference

详细API参考

See
references/api_guide.md
for:
  • Complete filter syntax
  • All available endpoints
  • Response structures
  • Error handling
  • Performance optimization
  • Rate limiting details
查看
references/api_guide.md
获取:
  • 完整筛选语法
  • 所有可用端点
  • 响应结构
  • 错误处理
  • 性能优化
  • 速率限制详情

Common Query Examples

常见查询示例

See
references/common_queries.md
for:
  • Complete working examples
  • Real-world use cases
  • Complex query patterns
  • Data export workflows
  • Multi-step analysis procedures
查看
references/common_queries.md
获取:
  • 完整可运行示例
  • 真实场景用例
  • 复杂查询模式
  • 数据导出工作流
  • 多步骤分析流程

Scripts

脚本说明

openalex_client.py

openalex_client.py

Main API client with:
  • Automatic rate limiting
  • Exponential backoff retry logic
  • Pagination support
  • Batch operations
  • Error handling
Use for direct API access with full control.
主API客户端,包含:
  • 自动速率限制
  • 指数退避重试逻辑
  • 分页支持
  • 批量操作
  • 错误处理
用于需要完全控制的直接API访问场景。

query_helpers.py

query_helpers.py

High-level helper functions for common operations:
  • find_author_works()
    - Get papers by author
  • find_institution_works()
    - Get papers from institution
  • find_highly_cited_recent_papers()
    - Get influential papers
  • get_open_access_papers()
    - Find OA publications
  • get_publication_trends()
    - Analyze trends over time
  • analyze_research_output()
    - Comprehensive analysis
Use for common research queries with simplified interfaces.
针对常见操作的高层级辅助函数:
  • find_author_works()
    - 获取作者的论文
  • find_institution_works()
    - 获取机构的论文
  • find_highly_cited_recent_papers()
    - 获取高影响力论文
  • get_open_access_papers()
    - 查找开放获取出版物
  • get_publication_trends()
    - 分析时间趋势
  • analyze_research_output()
    - 综合研究产出分析
用于简化常见研究查询的场景。

Troubleshooting

故障排除

Rate Limiting

速率限制

If encountering 403 errors:
  1. Ensure email is added to requests
  2. Verify not exceeding 10 req/sec
  3. Client automatically implements exponential backoff
如果遇到403错误:
  1. 确保请求中已添加电子邮箱
  2. 确认未超过10次请求/秒的限制
  3. 客户端会自动执行指数退避重试

Empty Results

无结果返回

If searches return no results:
  1. Check filter syntax (see
    references/api_guide.md
    )
  2. Use two-step pattern for entity lookups (don't filter by names)
  3. Verify entity IDs are correct format
如果检索无结果:
  1. 检查筛选语法(参考
    references/api_guide.md
  2. 使用两步模式进行实体查询(不要通过名称筛选)
  3. 验证实体ID格式正确

Timeout Errors

超时错误

For large queries:
  1. Use pagination with
    per-page=200
  2. Use
    select=
    to limit returned fields
  3. Break into smaller queries if needed
针对大型查询:
  1. 使用
    per-page=200
    进行分页
  2. 使用
    select=
    限制返回字段
  3. 必要时拆分为更小的查询

Rate Limits

速率限制

  • Default: 1 request/second, 100k requests/day
  • Polite pool (with email): 10 requests/second, 100k requests/day
Always use polite pool for production workflows by providing email to client.
  • 默认限制:1次请求/秒,每日10万次请求
  • 礼貌请求池(含电子邮箱):10次请求/秒,每日10万次请求
生产环境工作流请始终通过向客户端提供电子邮箱接入礼貌请求池。

Notes

注意事项

  • No authentication required
  • All data is open and free
  • Rate limits apply globally, not per IP
  • Use LitLLM with OpenRouter if LLM-based analysis is needed (don't use Perplexity API directly)
  • Client handles pagination, retries, and rate limiting automatically
  • 无需身份验证
  • 所有数据均开放免费
  • 速率限制为全局限制,而非按IP限制
  • 如需基于大语言模型的分析,请使用LitLLM搭配OpenRouter(不要直接使用Perplexity API)
  • 客户端会自动处理分页、重试和速率限制