interpro-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

InterPro Database Access

InterPro数据库访问

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/interpro/ and https://www.ebi.ac.uk/about/terms-of-use/, then (2) create the file recording the notification text and timestamp.
  1. uv
    : 阅读
    uv
    技能文档并按照其设置说明操作,确保
    uv
    已安装且在PATH环境变量中。
  2. 用户通知: 如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则需(1)显著告知用户查看https://www.ebi.ac.uk/interpro/和https://www.ebi.ac.uk/about/terms-of-use/上的条款,然后(2)创建该文件,记录通知文本和时间戳。

Overview

概述

InterPro combines signatures from multiple, diverse databases into a single searchable resource, reducing redundancy and helping users interpret their sequence analysis results. By uniting these member databases (e.g., Pfam, CDD, SMART), InterPro capitalises on their individual strengths to produce a powerful diagnostic tool and integrated resource.
Use
interpro-database
to:
  • Identify what domains, families, and sites are found in a particular protein.
  • Identify all proteins that belong to a protein family or contain a particular domain, even when the names and activities of the proteins are highly variable.
  • Examine the species in which a particular protein family or domain is found.
  • Annotate genomes with protein family information and Gene Ontology (GO) terms.
This skill provides a robust utility,
interpro_client.py
, to interact with the InterPro API seamlessly. It natively handles rate limiting (HTTP 429), background query sleep tracking (HTTP 408), terminal errors (HTTP 404/410), and lazy pagination.
InterPro将来自多个不同数据库的特征整合到一个可搜索的资源中,减少冗余并帮助用户解读序列分析结果。通过整合这些成员数据库(如Pfam、CDD、SMART),InterPro充分利用它们各自的优势,打造出一款强大的诊断工具和集成资源。
使用
interpro-database
可以:
  • 识别特定蛋白质中存在的结构域、家族和位点。
  • 识别属于某一蛋白质家族或包含特定结构域的所有蛋白质,即使这些蛋白质的名称和功能差异很大。
  • 研究特定蛋白质家族或结构域所在的物种。
  • 使用蛋白质家族信息和基因本体(GO)术语对基因组进行注释。
本技能提供了一个功能强大的实用工具
interpro_client.py
,可无缝与InterPro API交互。它原生支持速率限制处理(HTTP 429)、后台查询睡眠跟踪(HTTP 408)、终端错误处理(HTTP 404/410)以及延迟分页。

Core Rules

核心规则

  • Use the Wrapper: ALWAYS execute the
    scripts/interpro_client.py
    helper script to query the database rather than accessing the database directly. The scripts automatically enforce fair use and implement retry logic.
  • For exploratory queries: ALWAYS use the CLI with a strict
    --limit
    . This allows you to rapidly understand the data schema without polluting your context window or fetching millions of results.
  • Output to file: Use the CLI with --output to output to a file rather than attempting to print it all to the console. Process the output using jq or code.
  • For more complex pipelines import the module natively into your Python scripts to consume the generator directly, preventing the need to deserialize CLI strings in large workflows.
  • Notification: If this skill is used, ensure this is mentioned in the output.
Examples:
bash
uv run ./scripts/interpro_client.py fetch protein --source_db reviewed --limit 2 --query_params tax_id=9606 --output exploratory_results.jsonl
python
import sys
sys.path.append('scripts')
from interpro_client import fetch_interpro_data
import itertools
  • 使用封装工具: 始终执行
    scripts/interpro_client.py
    辅助脚本来查询数据库,而非直接访问数据库。该脚本会自动执行合理使用限制并实现重试逻辑。
  • 探索性查询: 始终使用CLI并设置严格的
    --limit
    参数。这能让你快速了解数据结构,同时避免占用过多上下文窗口或获取数百万条结果。
  • 输出到文件: 使用CLI的--output参数将结果输出到文件,而非尝试全部打印到控制台。可使用jq或代码处理输出内容。
  • 复杂流水线: 在Python脚本中直接导入模块,直接使用生成器,避免在大型工作流中反序列化CLI字符串。
  • 通知: 如果使用了本技能,需在输出中提及这一点。
示例:
bash
uv run ./scripts/interpro_client.py fetch protein --source_db reviewed --limit 2 --query_params tax_id=9606 --output exploratory_results.jsonl
python
import sys
sys.path.append('scripts')
from interpro_client import fetch_interpro_data
import itertools

fetch_interpro_data lazily yields results page-by-page

fetch_interpro_data 会逐页惰性返回结果

results = fetch_interpro_data( endpoint="entry", source_db="pfam", query_params={"page_size": 10} ) for match in itertools.islice(results, 10): print(match["metadata"]["accession"])
undefined
results = fetch_interpro_data( endpoint="entry", source_db="pfam", query_params={"page_size": 10} ) for match in itertools.islice(results, 10): print(match["metadata"]["accession"])
undefined

4 Ways to Construct Endpoints:

4种端点构造方式:

The arguments strictly map to the four common API path constructions. Do not format your own
/
separated strings:
  1. /{endpoint}
    (e.g.
    /entry
    )
    uv run ./scripts/interpro_client.py fetch entry --limit 10 --output entries.jsonl
  2. /{endpoint}/{sourceDB}
    (e.g.
    /entry/pfam
    )
    uv run ./scripts/interpro_client.py fetch entry --source_db pfam --limit 10 --output pfam_entries.jsonl
  3. /{endpoint}/{sourceDB}/{accession}
    (e.g.
    /entry/pfam/PF00001
    )
    uv run ./scripts/interpro_client.py fetch entry --source_db pfam --accession PF00001 --limit 10 --output pf00001_entry.jsonl
  4. /{endpoint}/{sourceDB}/{linked_endpoint}/{sourceDB}/{accession}
    (e.g.
    /entry/interpro/protein/uniprot/P04637
    )
    uv run ./scripts/interpro_client.py fetch entry \ --source_db interpro \ --linked_endpoint protein \ --linked_source_db uniprot \ --linked_accession P04637 \ --limit 10 --output p04637_entries.jsonl
参数严格对应四种常见的API路径构造方式。请勿自行构造以
/
分隔的字符串
:
  1. /{endpoint}
    (例如
    /entry
    )
    uv run ./scripts/interpro_client.py fetch entry --limit 10 --output entries.jsonl
  2. /{endpoint}/{sourceDB}
    (例如
    /entry/pfam
    )
    uv run ./scripts/interpro_client.py fetch entry --source_db pfam --limit 10 --output pfam_entries.jsonl
  3. /{endpoint}/{sourceDB}/{accession}
    (例如
    /entry/pfam/PF00001
    )
    uv run ./scripts/interpro_client.py fetch entry --source_db pfam --accession PF00001 --limit 10 --output pf00001_entry.jsonl
  4. /{endpoint}/{sourceDB}/{linked_endpoint}/{sourceDB}/{accession}
    (例如
    /entry/interpro/protein/uniprot/P04637
    )
    uv run ./scripts/interpro_client.py fetch entry \ --source_db interpro \ --linked_endpoint protein \ --linked_source_db uniprot \ --linked_accession P04637 \ --limit 10 --output p04637_entries.jsonl

Valid Source Databases (
--source_db
)

有效的源数据库(
--source_db

Each endpoint only accepts specific
source_db
values. Using an invalid value returns a 404 error.
  • /entry
    (16 values):
    interpro
    ,
    pfam
    ,
    cathgene3d
    ,
    ssf
    ,
    panther
    ,
    cdd
    ,
    profile
    ,
    smart
    ,
    ncbifam
    ,
    prosite
    ,
    prints
    ,
    hamap
    ,
    pirsf
    ,
    sfld
    ,
    antifam
    .
  • /protein
    (3 values):
    uniprot
    (all),
    reviewed
    (SwissProt),
    unreviewed
    (TrEMBL).
  • /structure
    (1 value):
    pdb
    .
  • /taxonomy
    (1 value):
    uniprot
    .
  • /proteome
    (1 value):
    uniprot
    .
  • /set
    (2 values):
    pfam
    ,
    cdd
    .
每个端点仅接受特定的
source_db
值。使用无效值会返回404错误。
  • /entry
    (16个值):
    interpro
    ,
    pfam
    ,
    cathgene3d
    ,
    ssf
    ,
    panther
    ,
    cdd
    ,
    profile
    ,
    smart
    ,
    ncbifam
    ,
    prosite
    ,
    prints
    ,
    hamap
    ,
    pirsf
    ,
    sfld
    ,
    antifam
  • /protein
    (3个值):
    uniprot
    (全部),
    reviewed
    (SwissProt),
    unreviewed
    (TrEMBL)。
  • /structure
    (1个值):
    pdb
  • /taxonomy
    (1个值):
    uniprot
  • /proteome
    (1个值):
    uniprot
  • /set
    (2个值):
    pfam
    ,
    cdd

Quick Reference / Core Endpoints & Parameters

快速参考 / 核心端点与参数

For a complete, exhaustive list of all query parameters, see the Full API Reference.
The API is fully open and supports 6 core endpoints. You can combine them using the linked parameters described above. Below is a nested list of the specific query parameters available for each endpoint:
  • /entry
    (Domain, family, active site, repeat, or homologous superfamily entries)
    • integrated
      : Filter by integrated status (e.g.,
      pfam
      ).
    • type
      : Filter by type (e.g.,
      family
      ,
      domain
      ,
      homologous_superfamily
      ).
    • go_term
      /
      go_category
      : Filter by Gene Ontology.
    • ida_search
      /
      ida_ignore
      /
      exact
      /
      ordered
      : Filter by domain architecture (see IDA Search section).
    • extra_fields
      : Request additional data (e.g.,
      counters
      for match coordinates).
    • group_by
      /
      sort_by
      : Aggregate or sort results (valid values depend on context, see Full API Reference).
    • Example:
      uv run ./scripts/interpro_client.py count entry --source_db pfam --query_params type=domain --output count.jsonl
  • /protein
    (Protein records matching entries or domains)
    • tax_id
      : Filter by taxonomy ID (does not search lineage).
    • match_presence
      : Filter by proteins having InterPro matches (
      true
      /
      false
      ).
    • is_fragment
      : Filter complete vs. fragment sequences.
    • group_by
      : Aggregate results (e.g.,
      taxonomy
      ).
    • extra_fields
      : Request sequence or match details.
    • isoforms
      /
      residues
      /
      structureinfo
      : Include specific sub-features.
    • conservation
      /
      extra_features
      : Append residue conservation flags or Mobidb/coil features (only valid for
      /protein/{source_db}/{accession}
      )
      .
    • Example:
      uv run ./scripts/interpro_client.py fetch protein --source_db uniprot --limit 20 --query_params tax_id=9606 --output human_proteins.jsonl
  • /structure
    (PDB structures linked to InterPro entries)
    • experiment_type
      : Filter by experimental method (e.g.,
      X-RAY DIFFRACTION
      ).
    • resolution
      : Filter by resolution limit.
    • extra_fields
      : Include additional structural metadata.
    • group_by
      : Aggregate results.
    • Example:
      ./scripts/interpro_client.py fetch structure --source_db pdb --accession 1ATP --limit 10 --output 1atp_structures.jsonl
  • /taxonomy
    (Taxonomy distribution nodes)
    • key_species
      : Filter to limit to key species.
    • with_names
      : Include scientific names.
    • filter_by_entry
      /
      filter_by_entry_db
      : Filter intersection with specific entries.
    • extra_fields
      : Additional taxonomic metadata.
    • Example:
      ./scripts/interpro_client.py fetch taxonomy --source_db uniprot --accession 9606 --limit 10 --output human_taxonomy.jsonl
  • /proteome
    (Complete proteomes linked to InterPro)
    • extra_fields
      : General query expansion.
    • Example:
      uv run ./scripts/interpro_client.py fetch proteome --source_db uniprot --accession UP000005640 --limit 10 --output proteome.jsonl
  • /set
    (Curated sets of related entries, e.g., Pfam clans)
    • extra_fields
      : Additional metadata (only valid for
      /set/{sourceDB}
      )
      .
    • Example:
      uv run ./scripts/interpro_client.py fetch set --source_db pfam --accession CL0001 --limit 10 --output pfam_clan.jsonl
完整的查询参数列表,请查看 完整API参考
该API完全开放,支持6个核心端点。你可以使用上述链接参数组合它们。以下是每个端点可用的特定查询参数嵌套列表:
  • /entry
    (结构域、家族、活性位点、重复序列或同源超家族条目)
    • integrated
      : 按整合状态筛选(例如
      pfam
      )。
    • type
      : 按类型筛选(例如
      family
      ,
      domain
      ,
      homologous_superfamily
      )。
    • go_term
      /
      go_category
      : 按基因本体(GO)筛选。
    • ida_search
      /
      ida_ignore
      /
      exact
      /
      ordered
      : 按域架构筛选(请查看IDA搜索部分)。
    • extra_fields
      : 请求额外数据(例如用于匹配坐标的
      counters
      )。
    • group_by
      /
      sort_by
      : 聚合或排序结果(有效值取决于上下文,请查看完整API参考)。
    • 示例:
      uv run ./scripts/interpro_client.py count entry --source_db pfam --query_params type=domain --output count.jsonl
  • /protein
    (与条目或结构域匹配的蛋白质记录)
    • tax_id
      : 按分类ID筛选(不搜索谱系)。
    • match_presence
      : 按是否有InterPro匹配筛选蛋白质(
      true
      /
      false
      )。
    • is_fragment
      : 筛选完整序列与片段序列。
    • group_by
      : 聚合结果(例如
      taxonomy
      )。
    • extra_fields
      : 请求序列或匹配详情。
    • isoforms
      /
      residues
      /
      structureinfo
      : 包含特定子特征。
    • conservation
      /
      extra_features
      : 添加残基保守性标记或Mobidb/coil特征(仅对
      /protein/{source_db}/{accession}
      有效
      )。
    • 示例:
      uv run ./scripts/interpro_client.py fetch protein --source_db uniprot --limit 20 --query_params tax_id=9606 --output human_proteins.jsonl
  • /structure
    (与InterPro条目关联的PDB结构)
    • experiment_type
      : 按实验方法筛选(例如
      X-RAY DIFFRACTION
      )。
    • resolution
      : 按分辨率限制筛选。
    • extra_fields
      : 包含额外的结构元数据。
    • group_by
      : 聚合结果。
    • 示例:
      ./scripts/interpro_client.py fetch structure --source_db pdb --accession 1ATP --limit 10 --output 1atp_structures.jsonl
  • /taxonomy
    (分类分布节点)
    • key_species
      : 筛选限定为关键物种。
    • with_names
      : 包含科学名称。
    • filter_by_entry
      /
      filter_by_entry_db
      : 与特定条目交集筛选。
    • extra_fields
      : 额外的分类元数据。
    • 示例:
      ./scripts/interpro_client.py fetch taxonomy --source_db uniprot --accession 9606 --limit 10 --output human_taxonomy.jsonl
  • /proteome
    (与InterPro关联的完整蛋白质组)
    • extra_fields
      : 通用查询扩展。
    • 示例:
      uv run ./scripts/interpro_client.py fetch proteome --source_db uniprot --accession UP000005640 --limit 10 --output proteome.jsonl
  • /set
    (相关条目的 curated 集合,例如Pfam clans)
    • extra_fields
      : 额外元数据(仅对
      /set/{sourceDB}
      有效
      )。
    • 示例:
      uv run ./scripts/interpro_client.py fetch set --source_db pfam --accession CL0001 --limit 10 --output pfam_clan.jsonl

InterPro Domain Architecture (IDA) Search

InterPro域架构(IDA)搜索

InterPro provides powerful tools for searching proteins by their domain architecture (the exact combination and order of domains). Because the API does not allow querying proteins directly by multiple domains at once (e.g., "give me proteins with PF00069 AND PF00017"), finding proteins with specific domain combinations requires a two-step process.
InterPro提供了强大的工具,可按域架构(结构域的确切组合和顺序)搜索蛋白质。由于API不允许直接按多个结构域查询蛋白质(例如“给我包含PF00069和PF00017的蛋白质”),因此查找具有特定结构域组合的蛋白质需要两步流程。

Step 1: Find matching architectures (
ida_search
)

步骤1: 查找匹配的架构(
ida_search

The
ida_search
parameter is used on the root
/entry
endpoint to find all Domain Architectures (IDAs) containing the domains you specify.
  • Constraints:
    • Valid ONLY on the root
      /entry
      endpoint.
    • Cannot be combined with non-IDA parameters.
  • Modifiers (Only valid with
    ida_search
    ):
    • ida_ignore
      : Ignores the given domains in the search (query param).
    • ordered
      : Ensures domains appear in the exact specified order (flag).
    • exact
      : Ensures the architecture matches exactly (no additional domains) (flag). Requires
      ordered
      flag to be present.
Example: Find architectures containing both a kinase domain (PF00069) and an SH2 domain (PF00017), in that exact order:
bash
uv run scripts/interpro_client.py fetch entry
  --query_params ida_search=PF00069,PF00017
  --flags ordered exact
  --output architectures.jsonl
Note: This returns the architectures and their unique
ida_id
s, not all individual proteins.
ida_search
参数用于根
/entry
端点,以查找包含你指定结构域的所有域架构(IDA)。
  • 约束:
    • 仅在根
      /entry
      端点有效。
    • 不能与非IDA参数组合使用。
  • 修饰符(仅与
    ida_search
    一起有效):
    • ida_ignore
      : 在搜索中忽略指定的结构域(查询参数)。
    • ordered
      : 确保结构域按指定的精确顺序出现(标记)。
    • exact
      : 确保架构完全匹配(无额外结构域)(标记)。需要同时设置
      ordered
      标记。
示例: 查找同时包含激酶结构域(PF00069)和SH2结构域(PF00017)且顺序完全一致的架构:
bash
uv run scripts/interpro_client.py fetch entry
  --query_params ida_search=PF00069,PF00017
  --flags ordered exact
  --output architectures.jsonl
注意: 此操作返回的是架构及其唯一的
ida_id
,而非所有单个蛋白质。

Step 2: Fetch proteins for those architectures (
ida
)

步骤2: 获取这些架构对应的蛋白质(
ida

Once you have the
ida_id
s (e.g.,
619edbb...
) from Step 1, you can fetch all the actual proteins that share that precise layout by filtering the
/protein
endpoint.
Constraints:
  • Valid on
    /protein
    and
    /entry/{sourceDB}/{accession}
    endpoints.
Example: Fetch proteins matching one of the architecture IDs from Step 1:
bash
uv run scripts/interpro_client.py fetch protein
  --source_db uniprot
  --query_params ida=619edbb2b445bfa3ad51bd894e3c115b025a5f25
  --output matching_proteins.jsonl
(When building pipelines or querying comprehensively, you would loop through all the
ida_id
s from Step 1 and run Step 2 for each one).
一旦从步骤1中获得
ida_id
(例如
619edbb...
),你可以通过筛选
/protein
端点获取所有具有该精确布局的实际蛋白质。
约束:
  • /protein
    /entry/{sourceDB}/{accession}
    端点有效。
示例: 获取与步骤1中某一架构ID匹配的蛋白质:
bash
uv run scripts/interpro_client.py fetch protein
  --source_db uniprot
  --query_params ida=619edbb2b445bfa3ad51bd894e3c115b025a5f25
  --output matching_proteins.jsonl
(在构建流水线或进行全面查询时,你需要遍历步骤1中的所有
ida_id
,并为每个ID执行步骤2。)

InterPro Entry Types

InterPro条目类型

Each InterPro entry is assigned a type indicating what you can infer when a protein matches the entry:
  • Domain: Distinct functional, structural or sequence units that may exist in a variety of biological contexts. Example: PH domain or classical C2H2 zinc finger.
  • Family: A group of proteins sharing a common evolutionary origin reflected by related functions, sequence similarities, or primary/secondary/tertiary structures.
  • Homologous Superfamily: Proteins sharing an evolutionary origin reflected by structural similarity but often displaying very low sequence similarity. Usually comprises signatures from the SUPERFAMILY and CATH-Gene3D databases.
  • Repeat: A short sequence that is typically repeated within a protein, often <50 amino acids long. Example: Leucine Rich Repeats or WD40 repeats.
  • Site: Includes
    Active site
    (sequence containing conserved residues for catalytic activity) and
    Binding site
    (sequence containing conserved residues forming a protein interaction site).
每个InterPro条目都被分配了一个类型,表明当蛋白质与该条目匹配时你可以推断出的信息:
  • Domain(结构域): 不同的功能、结构或序列单元,可能存在于多种生物学环境中。示例: PH结构域经典C2H2锌指
  • Family(家族): 共享共同进化起源的一组蛋白质,表现为功能相关、序列相似或一/二/三级结构相似。
  • Homologous Superfamily(同源超家族): 共享进化起源的蛋白质,表现为结构相似但通常序列相似度极低。通常包含来自SUPERFAMILY和CATH-Gene3D数据库的特征。
  • Repeat(重复序列): 通常在蛋白质内重复出现的短序列,长度通常小于50个氨基酸。示例: 亮氨酸重复序列WD40重复序列
  • Site(位点): 包括
    Active site(活性位点)
    (包含催化活性保守残基的序列)和
    Binding site(结合位点)
    (包含形成蛋白质相互作用位点的保守残基的序列)。

InterPro-N Predictions

InterPro-N预测

InterPro-N is a deep-learning-based extension of the standard InterPro database. It utilizes an AI architecture inspired by computer vision to treat protein sequence annotation as a "panoptic segmentation" task, labeling residues and distinguishing between domains.
InterPro-N是标准InterPro数据库的深度学习扩展。它采用受计算机视觉启发的AI架构,将蛋白质序列注释视为“全景分割”任务,标记残基并区分结构域。

When to use InterPro-N

何时使用InterPro-N

Standard InterPro signatures are the "gold standard" and should not be discarded in favor of InterPro-N predictions. Use InterPro-N primarily to fill in gaps or refine results.
In addition to standard InterPro:
  • Analyzing "Dark Matter" (Uncharacterised Proteins): Use when a protein returns no hits in standard InterPro. InterPro-N excels at identifying remote homologs.
  • Resolving Complex Repeats: Use for proteins with multiple tandem repeats (e.g., TPR or WD40) where standard HMMs might merge or miss them.
  • Predicting Discontinuous Domains: Use when a domain sequence is interrupted by a completely different inserted sequence.
Instead of standard InterPro (specific scenarios):
  • Precise Boundary Delineation: When you need more accurate start-and-stop coordinates for a domain than fuzzy standard hits provide.
  • Large-Scale Metagenomic Screening: For initial high-recovery screening of fragmented or highly divergent sequences.
标准InterPro特征是“黄金标准”,不应被InterPro-N预测取代。InterPro-N主要用于填补空白或优化结果。
补充标准InterPro的场景:
  • 分析“暗物质”(未表征蛋白质): 当蛋白质在标准InterPro中无匹配结果时使用。InterPro-N擅长识别远缘同源物。
  • 解析复杂重复序列: 用于具有多个串联重复序列的蛋白质(例如TPR或WD40),标准HMM可能会合并或遗漏这些序列。
  • 预测不连续结构域: 当结构域序列被完全不同的插入序列打断时使用。
替代标准InterPro的特定场景:
  • 精确边界划分: 当你需要比标准模糊匹配更准确的结构域起止坐标时使用。
  • 大规模宏基因组筛选: 用于对片段化或高度分化序列进行初始高回收率筛选。

Fallback Strategy: Checking InterPro-N

回退策略: 检查InterPro-N

When you are asked to find annotations for a protein and standard InterPro queries return no results or no annotations, you MUST check InterPro-N as a fallback.
Example Scenario: If a user asks to "List the SSF annotations for protein X" and the standard query returns no hits, you should retry the query with the
interpro_n
flag.
This fallback is crucial because InterPro-N can identify remote homologs and domains in "dark matter" proteins that standard methods miss.
If found, ALWAYS report to the user that these annotations are deep learning predictions from InterPro-N.
当你被要求查找蛋白质的注释,而标准InterPro查询无结果或无注释时,你必须将InterPro-N作为回退方案进行检查。
示例场景: 如果用户要求“列出蛋白质X的SSF注释”,而标准查询无结果,你应使用
interpro_n
标记重试查询。
此回退策略至关重要,因为InterPro-N可以识别标准方法遗漏的“暗物质”蛋白质中的远缘同源物和结构域。
如果找到结果,必须告知用户这些注释来自InterPro-N的深度学习预测。

How to Use

使用方法

InterPro-N predictions are accessed by passing the
interpro_n
flag to the
protein
endpoint with
uniprot
as the source database.
Via CLI:
bash
uv run ./scripts/interpro_client.py fetch protein
    --source_db uniprot
    --accession A0A096LNN2
    --flags interpro_n
    --output A0A096LNN2_interpro_n.jsonl
Via Python Pipeline:
python
results = fetch_interpro_data(
    endpoint="protein",
    source_db="uniprot",
    accession="A0A096LNN2",
    flags=["interpro_n"])
通过在
protein
端点传递
interpro_n
标记,并将
uniprot
作为源数据库,即可访问InterPro-N预测结果。
通过CLI:
bash
uv run ./scripts/interpro_client.py fetch protein
    --source_db uniprot
    --accession A0A096LNN2
    --flags interpro_n
    --output A0A096LNN2_interpro_n.jsonl
通过Python流水线:
python
results = fetch_interpro_data(
    endpoint="protein",
    source_db="uniprot",
    accession="A0A096LNN2",
    flags=["interpro_n"])

Strict Lookup Rules

严格查询规则

  1. Always Use UniProt Accessions, NEVER Gene Names: When looking up proteins in InterPro, you MUST use their UniProt Accessions (e.g.
    P04637
    ). InterPro does not natively support or reliably map gene names (e.g.
    TP53
    ). If the user provides a gene name, you must use a database like Ensembl or UniProt first to resolve it to an accession.
  2. NEVER Iterate to Count: When asked for an aggregate count (e.g., "How many domains are there?"), you MUST read the
    count
    field from the initial API JSON response using the
    get_interpro_count()
    helper. NEVER iterate over the
    fetch_interpro_data
    generator to tally elements. Iterating over an endpoint with 50,000+ entries just to count them silently hangs the agent and abuses the API. Every time. No exceptions.
    Correct:
    Via CLI:
    bash
    uv run ./scripts/interpro_client.py count entry
        --source_db interpro
        --query_params type=domain
        --output count.json
    Via Python Pipeline:
    python
    from interpro_client import get_interpro_count
    cnt = get_interpro_count(
        endpoint="entry",
        source_db="interpro",
        query_params={"type": "domain"},
    )
    Wrong (Iterating over fetch):
    bash
    # NEVER DO THIS:
    uv run ./scripts/interpro_client.py fetch entry
        --source_db interpro
        --query_params type=domain
        --output output.jsonl
        && wc -l output.jsonl
  1. 始终使用UniProt accession,绝不使用基因名称: 在InterPro中查询蛋白质时,必须使用其UniProt accession(例如
    P04637
    )。InterPro本身不支持也无法可靠映射基因名称(例如
    TP53
    )。如果用户提供基因名称,你必须先使用Ensembl或UniProt等数据库将其解析为accession。
  2. 绝不通过迭代计数: 当被要求获取聚合计数(例如“有多少个结构域?”)时,必须使用
    get_interpro_count()
    辅助函数从初始API JSON响应中读取
    count
    字段。绝不要遍历
    fetch_interpro_data
    生成器来统计元素。遍历包含50,000+条目的端点仅为计数会导致代理静默挂起并滥用API。每次都是如此,无一例外。
    正确做法:
    通过CLI:
    bash
    uv run ./scripts/interpro_client.py count entry
        --source_db interpro
        --query_params type=domain
        --output count.json
    通过Python流水线:
    python
    from interpro_client import get_interpro_count
    cnt = get_interpro_count(
        endpoint="entry",
        source_db="interpro",
        query_params={"type": "domain"},
    )
    错误做法(遍历fetch结果):
    bash
    # 绝不要这样做:
    uv run ./scripts/interpro_client.py fetch entry
        --source_db interpro
        --query_params type=domain
        --output output.jsonl
        && wc -l output.jsonl

Quick examples

快速示例

For detailed examples of the invocations and JSON output schemas returned by various endpoints, see the Example Responses Reference. This TSV contains command-line calls, Python equivalents, and the corresponding JSON payload structures.
有关各种端点的调用示例和返回的JSON输出模式的详细信息,请查看 示例响应参考 此TSV文件包含命令行调用、Python等效代码以及对应的JSON负载结构。

1. Determining all protein domains

1. 确定所有蛋白质结构域

bash
undefined
bash
undefined

Fetches InterPro Entries within UniProt protein P04637

获取UniProt蛋白质P04637对应的InterPro条目

URL equivalent: /entry/interpro/protein/uniprot/P04637

等效URL: /entry/interpro/protein/uniprot/P04637

uv run ./scripts/interpro_client.py fetch entry --source_db interpro --linked_endpoint protein --linked_source_db uniprot --linked_accession P04637 --output p04637_domains.jsonl
undefined
uv run ./scripts/interpro_client.py fetch entry --source_db interpro --linked_endpoint protein --linked_source_db uniprot --linked_accession P04637 --output p04637_domains.jsonl
undefined

2. Fetching all PDB structures for an Entry

2. 获取某一条目的所有PDB结构

bash
undefined
bash
undefined

URL equivalent: /structure/pdb/entry/interpro/IPR011615

等效URL: /structure/pdb/entry/interpro/IPR011615

Only fetch the first 5 structures

仅获取前5个结构

uv run ./scripts/interpro_client.py fetch structure --source_db pdb --linked_endpoint entry --linked_source_db interpro --linked_accession IPR011615 --output ipr011615_structures.jsonl
undefined
uv run ./scripts/interpro_client.py fetch structure --source_db pdb --linked_endpoint entry --linked_source_db interpro --linked_accession IPR011615 --output ipr011615_structures.jsonl
undefined