interpro-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInterPro Database Access
InterPro数据库访问
Prerequisites
前提条件
- : Read the
uvskill and follow its Setup instructions to ensureuvis installed and on PATH.uv - User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/interpro/ and https://www.ebi.ac.uk/about/terms-of-use/, then (2) create the file recording the notification text and timestamp.
- : 阅读
uv技能文档并按照其设置说明操作,确保uv已安装且在PATH环境变量中。uv - 用户通知: 如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则需(1)显著告知用户查看https://www.ebi.ac.uk/interpro/和https://www.ebi.ac.uk/about/terms-of-use/上的条款,然后(2)创建该文件,记录通知文本和时间戳。
Overview
概述
InterPro combines signatures from multiple, diverse databases into a single
searchable resource, reducing redundancy and helping users interpret their
sequence analysis results. By uniting these member databases (e.g., Pfam, CDD,
SMART), InterPro capitalises on their individual strengths to produce a powerful
diagnostic tool and integrated resource.
Use to:
interpro-database- Identify what domains, families, and sites are found in a particular protein.
- Identify all proteins that belong to a protein family or contain a particular domain, even when the names and activities of the proteins are highly variable.
- Examine the species in which a particular protein family or domain is found.
- Annotate genomes with protein family information and Gene Ontology (GO) terms.
This skill provides a robust utility, , to interact with the
InterPro API seamlessly. It natively handles rate limiting (HTTP 429),
background query sleep tracking (HTTP 408), terminal errors (HTTP 404/410), and
lazy pagination.
interpro_client.pyInterPro将来自多个不同数据库的特征整合到一个可搜索的资源中,减少冗余并帮助用户解读序列分析结果。通过整合这些成员数据库(如Pfam、CDD、SMART),InterPro充分利用它们各自的优势,打造出一款强大的诊断工具和集成资源。
使用可以:
interpro-database- 识别特定蛋白质中存在的结构域、家族和位点。
- 识别属于某一蛋白质家族或包含特定结构域的所有蛋白质,即使这些蛋白质的名称和功能差异很大。
- 研究特定蛋白质家族或结构域所在的物种。
- 使用蛋白质家族信息和基因本体(GO)术语对基因组进行注释。
本技能提供了一个功能强大的实用工具,可无缝与InterPro API交互。它原生支持速率限制处理(HTTP 429)、后台查询睡眠跟踪(HTTP 408)、终端错误处理(HTTP 404/410)以及延迟分页。
interpro_client.pyCore Rules
核心规则
- Use the Wrapper: ALWAYS execute the helper script to query the database rather than accessing the database directly. The scripts automatically enforce fair use and implement retry logic.
scripts/interpro_client.py - For exploratory queries: ALWAYS use the CLI with a strict . This allows you to rapidly understand the data schema without polluting your context window or fetching millions of results.
--limit - Output to file: Use the CLI with --output to output to a file rather than attempting to print it all to the console. Process the output using jq or code.
- For more complex pipelines import the module natively into your Python scripts to consume the generator directly, preventing the need to deserialize CLI strings in large workflows.
- Notification: If this skill is used, ensure this is mentioned in the output.
Examples:
bash
uv run ./scripts/interpro_client.py fetch protein --source_db reviewed --limit 2 --query_params tax_id=9606 --output exploratory_results.jsonlpython
import sys
sys.path.append('scripts')
from interpro_client import fetch_interpro_data
import itertools- 使用封装工具: 始终执行辅助脚本来查询数据库,而非直接访问数据库。该脚本会自动执行合理使用限制并实现重试逻辑。
scripts/interpro_client.py - 探索性查询: 始终使用CLI并设置严格的参数。这能让你快速了解数据结构,同时避免占用过多上下文窗口或获取数百万条结果。
--limit - 输出到文件: 使用CLI的--output参数将结果输出到文件,而非尝试全部打印到控制台。可使用jq或代码处理输出内容。
- 复杂流水线: 在Python脚本中直接导入模块,直接使用生成器,避免在大型工作流中反序列化CLI字符串。
- 通知: 如果使用了本技能,需在输出中提及这一点。
示例:
bash
uv run ./scripts/interpro_client.py fetch protein --source_db reviewed --limit 2 --query_params tax_id=9606 --output exploratory_results.jsonlpython
import sys
sys.path.append('scripts')
from interpro_client import fetch_interpro_data
import itertoolsfetch_interpro_data lazily yields results page-by-page
fetch_interpro_data 会逐页惰性返回结果
results = fetch_interpro_data(
endpoint="entry",
source_db="pfam",
query_params={"page_size": 10}
)
for match in itertools.islice(results, 10):
print(match["metadata"]["accession"])
undefinedresults = fetch_interpro_data(
endpoint="entry",
source_db="pfam",
query_params={"page_size": 10}
)
for match in itertools.islice(results, 10):
print(match["metadata"]["accession"])
undefined4 Ways to Construct Endpoints:
4种端点构造方式:
The arguments strictly map to the four common API path constructions. Do not
format your own separated strings:
/- (e.g.
/{endpoint})/entryuv run ./scripts/interpro_client.py fetch entry --limit 10 --output entries.jsonl - (e.g.
/{endpoint}/{sourceDB})/entry/pfamuv run ./scripts/interpro_client.py fetch entry --source_db pfam --limit 10 --output pfam_entries.jsonl - (e.g.
/{endpoint}/{sourceDB}/{accession})/entry/pfam/PF00001uv run ./scripts/interpro_client.py fetch entry --source_db pfam --accession PF00001 --limit 10 --output pf00001_entry.jsonl - (e.g.
/{endpoint}/{sourceDB}/{linked_endpoint}/{sourceDB}/{accession})/entry/interpro/protein/uniprot/P04637uv run ./scripts/interpro_client.py fetch entry \ --source_db interpro \ --linked_endpoint protein \ --linked_source_db uniprot \ --linked_accession P04637 \ --limit 10 --output p04637_entries.jsonl
参数严格对应四种常见的API路径构造方式。请勿自行构造以分隔的字符串:
/- (例如
/{endpoint})/entryuv run ./scripts/interpro_client.py fetch entry --limit 10 --output entries.jsonl - (例如
/{endpoint}/{sourceDB})/entry/pfamuv run ./scripts/interpro_client.py fetch entry --source_db pfam --limit 10 --output pfam_entries.jsonl - (例如
/{endpoint}/{sourceDB}/{accession})/entry/pfam/PF00001uv run ./scripts/interpro_client.py fetch entry --source_db pfam --accession PF00001 --limit 10 --output pf00001_entry.jsonl - (例如
/{endpoint}/{sourceDB}/{linked_endpoint}/{sourceDB}/{accession})/entry/interpro/protein/uniprot/P04637uv run ./scripts/interpro_client.py fetch entry \ --source_db interpro \ --linked_endpoint protein \ --linked_source_db uniprot \ --linked_accession P04637 \ --limit 10 --output p04637_entries.jsonl
Valid Source Databases (--source_db
)
--source_db有效的源数据库(--source_db
)
--source_dbEach endpoint only accepts specific values. Using an invalid value
returns a 404 error.
source_db- (16 values):
/entry,interpro,pfam,cathgene3d,ssf,panther,cdd,profile,smart,ncbifam,prosite,prints,hamap,pirsf,sfld.antifam - (3 values):
/protein(all),uniprot(SwissProt),reviewed(TrEMBL).unreviewed - (1 value):
/structure.pdb - (1 value):
/taxonomy.uniprot - (1 value):
/proteome.uniprot - (2 values):
/set,pfam.cdd
每个端点仅接受特定的值。使用无效值会返回404错误。
source_db- (16个值):
/entry,interpro,pfam,cathgene3d,ssf,panther,cdd,profile,smart,ncbifam,prosite,prints,hamap,pirsf,sfld。antifam - (3个值):
/protein(全部),uniprot(SwissProt),reviewed(TrEMBL)。unreviewed - (1个值):
/structure。pdb - (1个值):
/taxonomy。uniprot - (1个值):
/proteome。uniprot - (2个值):
/set,pfam。cdd
Quick Reference / Core Endpoints & Parameters
快速参考 / 核心端点与参数
For a complete, exhaustive list of all query parameters, see the
Full API Reference.
The API is fully open and supports 6 core endpoints. You can combine them using
the linked parameters described above. Below is a nested list of the specific
query parameters available for each endpoint:
-
(Domain, family, active site, repeat, or homologous superfamily entries)
/entry- : Filter by integrated status (e.g.,
integrated).pfam - : Filter by type (e.g.,
type,family,domain).homologous_superfamily - /
go_term: Filter by Gene Ontology.go_category - /
ida_search/ida_ignore/exact: Filter by domain architecture (see IDA Search section).ordered - : Request additional data (e.g.,
extra_fieldsfor match coordinates).counters - /
group_by: Aggregate or sort results (valid values depend on context, see Full API Reference).sort_by - Example:
uv run ./scripts/interpro_client.py count entry --source_db pfam --query_params type=domain --output count.jsonl
-
(Protein records matching entries or domains)
/protein- : Filter by taxonomy ID (does not search lineage).
tax_id - : Filter by proteins having InterPro matches (
match_presence/true).false - : Filter complete vs. fragment sequences.
is_fragment - : Aggregate results (e.g.,
group_by).taxonomy - : Request sequence or match details.
extra_fields - /
isoforms/residues: Include specific sub-features.structureinfo - /
conservation: Append residue conservation flags or Mobidb/coil features (only valid forextra_features)./protein/{source_db}/{accession} - Example:
uv run ./scripts/interpro_client.py fetch protein --source_db uniprot --limit 20 --query_params tax_id=9606 --output human_proteins.jsonl
-
(PDB structures linked to InterPro entries)
/structure- : Filter by experimental method (e.g.,
experiment_type).X-RAY DIFFRACTION - : Filter by resolution limit.
resolution - : Include additional structural metadata.
extra_fields - : Aggregate results.
group_by - Example:
./scripts/interpro_client.py fetch structure --source_db pdb --accession 1ATP --limit 10 --output 1atp_structures.jsonl
-
(Taxonomy distribution nodes)
/taxonomy- : Filter to limit to key species.
key_species - : Include scientific names.
with_names - /
filter_by_entry: Filter intersection with specific entries.filter_by_entry_db - : Additional taxonomic metadata.
extra_fields - Example:
./scripts/interpro_client.py fetch taxonomy --source_db uniprot --accession 9606 --limit 10 --output human_taxonomy.jsonl
-
(Complete proteomes linked to InterPro)
/proteome- : General query expansion.
extra_fields - Example:
uv run ./scripts/interpro_client.py fetch proteome --source_db uniprot --accession UP000005640 --limit 10 --output proteome.jsonl
-
(Curated sets of related entries, e.g., Pfam clans)
/set- : Additional metadata (only valid for
extra_fields)./set/{sourceDB} - Example:
uv run ./scripts/interpro_client.py fetch set --source_db pfam --accession CL0001 --limit 10 --output pfam_clan.jsonl
完整的查询参数列表,请查看
完整API参考。
该API完全开放,支持6个核心端点。你可以使用上述链接参数组合它们。以下是每个端点可用的特定查询参数嵌套列表:
-
(结构域、家族、活性位点、重复序列或同源超家族条目)
/entry- : 按整合状态筛选(例如
integrated)。pfam - : 按类型筛选(例如
type,family,domain)。homologous_superfamily - /
go_term: 按基因本体(GO)筛选。go_category - /
ida_search/ida_ignore/exact: 按域架构筛选(请查看IDA搜索部分)。ordered - : 请求额外数据(例如用于匹配坐标的
extra_fields)。counters - /
group_by: 聚合或排序结果(有效值取决于上下文,请查看完整API参考)。sort_by - 示例:
uv run ./scripts/interpro_client.py count entry --source_db pfam --query_params type=domain --output count.jsonl
-
(与条目或结构域匹配的蛋白质记录)
/protein- : 按分类ID筛选(不搜索谱系)。
tax_id - : 按是否有InterPro匹配筛选蛋白质(
match_presence/true)。false - : 筛选完整序列与片段序列。
is_fragment - : 聚合结果(例如
group_by)。taxonomy - : 请求序列或匹配详情。
extra_fields - /
isoforms/residues: 包含特定子特征。structureinfo - /
conservation: 添加残基保守性标记或Mobidb/coil特征(仅对extra_features有效)。/protein/{source_db}/{accession} - 示例:
uv run ./scripts/interpro_client.py fetch protein --source_db uniprot --limit 20 --query_params tax_id=9606 --output human_proteins.jsonl
-
(与InterPro条目关联的PDB结构)
/structure- : 按实验方法筛选(例如
experiment_type)。X-RAY DIFFRACTION - : 按分辨率限制筛选。
resolution - : 包含额外的结构元数据。
extra_fields - : 聚合结果。
group_by - 示例:
./scripts/interpro_client.py fetch structure --source_db pdb --accession 1ATP --limit 10 --output 1atp_structures.jsonl
-
(分类分布节点)
/taxonomy- : 筛选限定为关键物种。
key_species - : 包含科学名称。
with_names - /
filter_by_entry: 与特定条目交集筛选。filter_by_entry_db - : 额外的分类元数据。
extra_fields - 示例:
./scripts/interpro_client.py fetch taxonomy --source_db uniprot --accession 9606 --limit 10 --output human_taxonomy.jsonl
-
(与InterPro关联的完整蛋白质组)
/proteome- : 通用查询扩展。
extra_fields - 示例:
uv run ./scripts/interpro_client.py fetch proteome --source_db uniprot --accession UP000005640 --limit 10 --output proteome.jsonl
-
(相关条目的 curated 集合,例如Pfam clans)
/set- : 额外元数据(仅对
extra_fields有效)。/set/{sourceDB} - 示例:
uv run ./scripts/interpro_client.py fetch set --source_db pfam --accession CL0001 --limit 10 --output pfam_clan.jsonl
InterPro Domain Architecture (IDA) Search
InterPro域架构(IDA)搜索
InterPro provides powerful tools for searching proteins by their domain
architecture (the exact combination and order of domains). Because the API does
not allow querying proteins directly by multiple domains at once (e.g., "give me
proteins with PF00069 AND PF00017"), finding proteins with specific domain
combinations requires a two-step process.
InterPro提供了强大的工具,可按域架构(结构域的确切组合和顺序)搜索蛋白质。由于API不允许直接按多个结构域查询蛋白质(例如“给我包含PF00069和PF00017的蛋白质”),因此查找具有特定结构域组合的蛋白质需要两步流程。
Step 1: Find matching architectures (ida_search
)
ida_search步骤1: 查找匹配的架构(ida_search
)
ida_searchThe parameter is used on the root endpoint to find all
Domain Architectures (IDAs) containing the domains you specify.
ida_search/entry- Constraints:
- Valid ONLY on the root endpoint.
/entry - Cannot be combined with non-IDA parameters.
- Valid ONLY on the root
- Modifiers (Only valid with ):
ida_search- : Ignores the given domains in the search (query param).
ida_ignore - : Ensures domains appear in the exact specified order (flag).
ordered - : Ensures the architecture matches exactly (no additional domains) (flag). Requires
exactflag to be present.ordered
Example: Find architectures containing both a kinase domain (PF00069) and an
SH2 domain (PF00017), in that exact order:
bash
uv run scripts/interpro_client.py fetch entry
--query_params ida_search=PF00069,PF00017
--flags ordered exact
--output architectures.jsonlNote: This returns the architectures and their unique s, not all
individual proteins.
ida_idida_search/entry- 约束:
- 仅在根端点有效。
/entry - 不能与非IDA参数组合使用。
- 仅在根
- 修饰符(仅与一起有效):
ida_search- : 在搜索中忽略指定的结构域(查询参数)。
ida_ignore - : 确保结构域按指定的精确顺序出现(标记)。
ordered - : 确保架构完全匹配(无额外结构域)(标记)。需要同时设置
exact标记。ordered
示例: 查找同时包含激酶结构域(PF00069)和SH2结构域(PF00017)且顺序完全一致的架构:
bash
uv run scripts/interpro_client.py fetch entry
--query_params ida_search=PF00069,PF00017
--flags ordered exact
--output architectures.jsonl注意: 此操作返回的是架构及其唯一的,而非所有单个蛋白质。
ida_idStep 2: Fetch proteins for those architectures (ida
)
ida步骤2: 获取这些架构对应的蛋白质(ida
)
idaOnce you have the s (e.g., ) from Step 1, you can fetch all
the actual proteins that share that precise layout by filtering the
endpoint.
ida_id619edbb.../proteinConstraints:
- Valid on and
/proteinendpoints./entry/{sourceDB}/{accession}
Example: Fetch proteins matching one of the architecture IDs from Step 1:
bash
uv run scripts/interpro_client.py fetch protein
--source_db uniprot
--query_params ida=619edbb2b445bfa3ad51bd894e3c115b025a5f25
--output matching_proteins.jsonl(When building pipelines or querying comprehensively, you would loop through
all the s from Step 1 and run Step 2 for each one).
ida_id一旦从步骤1中获得(例如 ),你可以通过筛选端点获取所有具有该精确布局的实际蛋白质。
ida_id619edbb.../protein约束:
- 在和
/protein端点有效。/entry/{sourceDB}/{accession}
示例: 获取与步骤1中某一架构ID匹配的蛋白质:
bash
uv run scripts/interpro_client.py fetch protein
--source_db uniprot
--query_params ida=619edbb2b445bfa3ad51bd894e3c115b025a5f25
--output matching_proteins.jsonl(在构建流水线或进行全面查询时,你需要遍历步骤1中的所有,并为每个ID执行步骤2。)
ida_idInterPro Entry Types
InterPro条目类型
Each InterPro entry is assigned a type indicating what you can infer when a
protein matches the entry:
- Domain: Distinct functional, structural or sequence units that may exist in a variety of biological contexts. Example: PH domain or classical C2H2 zinc finger.
- Family: A group of proteins sharing a common evolutionary origin reflected by related functions, sequence similarities, or primary/secondary/tertiary structures.
- Homologous Superfamily: Proteins sharing an evolutionary origin reflected by structural similarity but often displaying very low sequence similarity. Usually comprises signatures from the SUPERFAMILY and CATH-Gene3D databases.
- Repeat: A short sequence that is typically repeated within a protein, often <50 amino acids long. Example: Leucine Rich Repeats or WD40 repeats.
- Site: Includes (sequence containing conserved residues for catalytic activity) and
Active site(sequence containing conserved residues forming a protein interaction site).Binding site
每个InterPro条目都被分配了一个类型,表明当蛋白质与该条目匹配时你可以推断出的信息:
- Domain(结构域): 不同的功能、结构或序列单元,可能存在于多种生物学环境中。示例: PH结构域或经典C2H2锌指。
- Family(家族): 共享共同进化起源的一组蛋白质,表现为功能相关、序列相似或一/二/三级结构相似。
- Homologous Superfamily(同源超家族): 共享进化起源的蛋白质,表现为结构相似但通常序列相似度极低。通常包含来自SUPERFAMILY和CATH-Gene3D数据库的特征。
- Repeat(重复序列): 通常在蛋白质内重复出现的短序列,长度通常小于50个氨基酸。示例: 亮氨酸重复序列或WD40重复序列。
- Site(位点): 包括(包含催化活性保守残基的序列)和
Active site(活性位点)(包含形成蛋白质相互作用位点的保守残基的序列)。Binding site(结合位点)
InterPro-N Predictions
InterPro-N预测
InterPro-N is a deep-learning-based extension of the standard InterPro database.
It utilizes an AI architecture inspired by computer vision to treat protein
sequence annotation as a "panoptic segmentation" task, labeling residues and
distinguishing between domains.
InterPro-N是标准InterPro数据库的深度学习扩展。它采用受计算机视觉启发的AI架构,将蛋白质序列注释视为“全景分割”任务,标记残基并区分结构域。
When to use InterPro-N
何时使用InterPro-N
Standard InterPro signatures are the "gold standard" and should not be discarded
in favor of InterPro-N predictions. Use InterPro-N primarily to fill in gaps or
refine results.
In addition to standard InterPro:
- Analyzing "Dark Matter" (Uncharacterised Proteins): Use when a protein returns no hits in standard InterPro. InterPro-N excels at identifying remote homologs.
- Resolving Complex Repeats: Use for proteins with multiple tandem repeats (e.g., TPR or WD40) where standard HMMs might merge or miss them.
- Predicting Discontinuous Domains: Use when a domain sequence is interrupted by a completely different inserted sequence.
Instead of standard InterPro (specific scenarios):
- Precise Boundary Delineation: When you need more accurate start-and-stop coordinates for a domain than fuzzy standard hits provide.
- Large-Scale Metagenomic Screening: For initial high-recovery screening of fragmented or highly divergent sequences.
标准InterPro特征是“黄金标准”,不应被InterPro-N预测取代。InterPro-N主要用于填补空白或优化结果。
补充标准InterPro的场景:
- 分析“暗物质”(未表征蛋白质): 当蛋白质在标准InterPro中无匹配结果时使用。InterPro-N擅长识别远缘同源物。
- 解析复杂重复序列: 用于具有多个串联重复序列的蛋白质(例如TPR或WD40),标准HMM可能会合并或遗漏这些序列。
- 预测不连续结构域: 当结构域序列被完全不同的插入序列打断时使用。
替代标准InterPro的特定场景:
- 精确边界划分: 当你需要比标准模糊匹配更准确的结构域起止坐标时使用。
- 大规模宏基因组筛选: 用于对片段化或高度分化序列进行初始高回收率筛选。
Fallback Strategy: Checking InterPro-N
回退策略: 检查InterPro-N
When you are asked to find annotations for a protein and standard InterPro
queries return no results or no annotations, you MUST check InterPro-N as a
fallback.
Example Scenario: If a user asks to "List the SSF annotations for protein X"
and the standard query returns no hits, you should retry the query with the
flag.
interpro_nThis fallback is crucial because InterPro-N can identify remote homologs and
domains in "dark matter" proteins that standard methods miss.
If found, ALWAYS report to the user that these annotations are deep learning
predictions from InterPro-N.
当你被要求查找蛋白质的注释,而标准InterPro查询无结果或无注释时,你必须将InterPro-N作为回退方案进行检查。
示例场景: 如果用户要求“列出蛋白质X的SSF注释”,而标准查询无结果,你应使用标记重试查询。
interpro_n此回退策略至关重要,因为InterPro-N可以识别标准方法遗漏的“暗物质”蛋白质中的远缘同源物和结构域。
如果找到结果,必须告知用户这些注释来自InterPro-N的深度学习预测。
How to Use
使用方法
InterPro-N predictions are accessed by passing the flag to the
endpoint with as the source database.
interpro_nproteinuniprotVia CLI:
bash
uv run ./scripts/interpro_client.py fetch protein
--source_db uniprot
--accession A0A096LNN2
--flags interpro_n
--output A0A096LNN2_interpro_n.jsonlVia Python Pipeline:
python
results = fetch_interpro_data(
endpoint="protein",
source_db="uniprot",
accession="A0A096LNN2",
flags=["interpro_n"])通过在端点传递标记,并将作为源数据库,即可访问InterPro-N预测结果。
proteininterpro_nuniprot通过CLI:
bash
uv run ./scripts/interpro_client.py fetch protein
--source_db uniprot
--accession A0A096LNN2
--flags interpro_n
--output A0A096LNN2_interpro_n.jsonl通过Python流水线:
python
results = fetch_interpro_data(
endpoint="protein",
source_db="uniprot",
accession="A0A096LNN2",
flags=["interpro_n"])Strict Lookup Rules
严格查询规则
-
Always Use UniProt Accessions, NEVER Gene Names: When looking up proteins in InterPro, you MUST use their UniProt Accessions (e.g.). InterPro does not natively support or reliably map gene names (e.g.
P04637). If the user provides a gene name, you must use a database like Ensembl or UniProt first to resolve it to an accession.TP53 -
NEVER Iterate to Count: When asked for an aggregate count (e.g., "How many domains are there?"), you MUST read thefield from the initial API JSON response using the
counthelper. NEVER iterate over theget_interpro_count()generator to tally elements. Iterating over an endpoint with 50,000+ entries just to count them silently hangs the agent and abuses the API. Every time. No exceptions.fetch_interpro_data✅ Correct:Via CLI:bashuv run ./scripts/interpro_client.py count entry --source_db interpro --query_params type=domain --output count.jsonVia Python Pipeline:pythonfrom interpro_client import get_interpro_count cnt = get_interpro_count( endpoint="entry", source_db="interpro", query_params={"type": "domain"}, )❌ Wrong (Iterating over fetch):bash# NEVER DO THIS: uv run ./scripts/interpro_client.py fetch entry --source_db interpro --query_params type=domain --output output.jsonl && wc -l output.jsonl
-
始终使用UniProt accession,绝不使用基因名称: 在InterPro中查询蛋白质时,必须使用其UniProt accession(例如)。InterPro本身不支持也无法可靠映射基因名称(例如
P04637)。如果用户提供基因名称,你必须先使用Ensembl或UniProt等数据库将其解析为accession。TP53 -
绝不通过迭代计数: 当被要求获取聚合计数(例如“有多少个结构域?”)时,必须使用辅助函数从初始API JSON响应中读取
get_interpro_count()字段。绝不要遍历count生成器来统计元素。遍历包含50,000+条目的端点仅为计数会导致代理静默挂起并滥用API。每次都是如此,无一例外。fetch_interpro_data✅ 正确做法:通过CLI:bashuv run ./scripts/interpro_client.py count entry --source_db interpro --query_params type=domain --output count.json通过Python流水线:pythonfrom interpro_client import get_interpro_count cnt = get_interpro_count( endpoint="entry", source_db="interpro", query_params={"type": "domain"}, )❌ 错误做法(遍历fetch结果):bash# 绝不要这样做: uv run ./scripts/interpro_client.py fetch entry --source_db interpro --query_params type=domain --output output.jsonl && wc -l output.jsonl
Quick examples
快速示例
For detailed examples of the invocations and JSON output schemas returned by
various endpoints, see the
Example Responses Reference. This TSV
contains command-line calls, Python equivalents, and the corresponding JSON
payload structures.
有关各种端点的调用示例和返回的JSON输出模式的详细信息,请查看
示例响应参考。 此TSV文件包含命令行调用、Python等效代码以及对应的JSON负载结构。
1. Determining all protein domains
1. 确定所有蛋白质结构域
bash
undefinedbash
undefinedFetches InterPro Entries within UniProt protein P04637
获取UniProt蛋白质P04637对应的InterPro条目
URL equivalent: /entry/interpro/protein/uniprot/P04637
等效URL: /entry/interpro/protein/uniprot/P04637
uv run ./scripts/interpro_client.py fetch entry
--source_db interpro
--linked_endpoint protein
--linked_source_db uniprot
--linked_accession P04637
--output p04637_domains.jsonl
undefineduv run ./scripts/interpro_client.py fetch entry
--source_db interpro
--linked_endpoint protein
--linked_source_db uniprot
--linked_accession P04637
--output p04637_domains.jsonl
undefined2. Fetching all PDB structures for an Entry
2. 获取某一条目的所有PDB结构
bash
undefinedbash
undefinedURL equivalent: /structure/pdb/entry/interpro/IPR011615
等效URL: /structure/pdb/entry/interpro/IPR011615
Only fetch the first 5 structures
仅获取前5个结构
uv run ./scripts/interpro_client.py fetch structure
--source_db pdb
--linked_endpoint entry
--linked_source_db interpro
--linked_accession IPR011615
--output ipr011615_structures.jsonl
undefineduv run ./scripts/interpro_client.py fetch structure
--source_db pdb
--linked_endpoint entry
--linked_source_db interpro
--linked_accession IPR011615
--output ipr011615_structures.jsonl
undefined