ncbi-sequence-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NCBI Sequence Fetch

NCBI序列获取

Prerequisites

前置条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ncbi.nlm.nih.gov/ and https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file recording the notification text and timestamp.
  3. .env
    file
    : Make sure the
    .env
    file exists in your home directory. Create one if it does not exist.
  4. NCBI_API_KEY
    (optional): Raises the NCBI rate limit from 3 to 10 requests/second. The skill works without it, but a key is recommended if the user plans many queries or encounters a 429 error. The user can obtain one for free by registering at https://www.ncbi.nlm.nih.gov/account/settings/. If the variable is missing from
    .env
    , do NOT ask the user to paste it into the chat (this would leak the key into the agent's context). Instead, give the user this command — substituting
    ENV_FILE
    with the resolved literal path to the
    .env
    file
    :
    bash
    printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
    The scripts load credentials automatically via
    dotenv
    . NEVER read, print, or inspect the
    .env
    file or its variables (e.g. no
    cat
    ,
    grep
    ,
    echo
    ,
    printenv
    , or
    os.environ.get
    on keys). Credentials must stay out of the agent's context.
  1. uv
    :查看
    uv
    技能并按照其设置说明操作,确保
    uv
    已安装并添加至PATH。
  2. 用户通知:若此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则需(1)显著告知用户查看https://www.ncbi.nlm.nih.gov/和https://www.ncbi.nlm.nih.gov/home/about/policies/上的条款,然后(2)创建该文件并记录通知文本和时间戳。
  3. .env
    文件
    :确保您的主目录中存在
    .env
    文件,若不存在则创建一个。
  4. NCBI_API_KEY
    (可选):将NCBI的请求速率限制从3次/秒提升至10次/秒。即使没有该密钥,技能也能正常运行,但如果用户计划进行大量查询或遇到429错误,建议使用密钥。用户可通过注册https://www.ncbi.nlm.nih.gov/account/settings/免费获取。若`.env`文件中缺少该变量,请勿要求用户在聊天中粘贴密钥(这会导致密钥泄露至Agent上下文),而是向用户提供以下命令——**将`ENV_FILE`替换为`.env`文件的实际路径**:
    bash
    printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
    脚本会通过
    dotenv
    自动加载凭证。绝对不要读取、打印或检查
    .env
    文件及其变量(例如不要对密钥使用
    cat
    grep
    echo
    printenv
    os.environ.get
    )。凭证必须远离Agent上下文。

Core Rules

核心规则

  • Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
  • API Key Support: If the user provides an
    NCBI_API_KEY
    in their environment, the query speed limits are automatically increased significantly.
  • Notification: If this skill is used, ensure this is mentioned in the output.
  • 使用封装脚本:始终执行提供的辅助脚本查询数据库,而非直接访问数据库。脚本会自动优雅地执行必要的速率限制。
  • API密钥支持:若用户环境中提供了
    NCBI_API_KEY
    ,查询速度限制会自动大幅提升。
  • 通知要求:若使用此技能,需确保在输出中提及这一点。

Overview

概述

Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for retrieving protein and nucleotide sequences. Provides 10 subcommands covering the full range of sequence retrieval workflows:
  • fetch-protein
    — Direct protein accession lookup (GenPept, RefSeq)
  • fetch-nucleotide
    — Direct nucleotide accession lookup
  • cds-translate
    — Fetch CDS and translate to protein (3 methods)
  • search
    — Free-text search of any NCBI database
  • elink
    — Follow cross-database links (PubMed→Protein, etc.)
  • gene-protein
    — Search protein by gene name + organism
  • locus-protein
    — Search protein by locus tag + organism
  • pubmed-proteins
    — Find proteins linked to a PubMed article
  • patent-search
    — Extract protein sequences from patents
  • organism-length
    — Last-resort search by organism + exact AA length
封装NCBI的Entrez E-utilities(efetch、esearch、elink、esummary)用于检索蛋白质和核苷酸序列。提供10个子命令,覆盖全范围的序列检索工作流:
  • fetch-protein
    — 直接蛋白质登录号查询(GenPept、RefSeq)
  • fetch-nucleotide
    — 直接核苷酸登录号查询
  • cds-translate
    — 获取CDS并翻译为蛋白质(3种方法)
  • search
    — 任意NCBI数据库的自由文本搜索
  • elink
    — 跨数据库链接跳转(如PubMed→Protein等)
  • gene-protein
    — 通过基因名称+物种搜索蛋白质
  • locus-protein
    — 通过基因座标签+物种搜索蛋白质
  • pubmed-proteins
    — 查找与PubMed文章关联的蛋白质
  • patent-search
    — 从专利中提取蛋白质序列
  • organism-length
    — 备选方案:通过物种+精确氨基酸长度搜索

Utility Scripts

实用脚本

scripts/ncbi_fetch.py
— Single script with subcommands.
All subcommands write structured JSON output. Use
--output FILE
to save to a file, or omit it to print to stdout. A human-readable summary is always printed to stdout.
scripts/ncbi_fetch.py
— 包含子命令的单脚本。
所有子命令均输出结构化JSON。使用
--output FILE
保存至文件,或省略该参数直接打印至标准输出。标准输出始终会打印人类可读的摘要。

1. Fetch Protein by Accession

1. 通过登录号获取蛋白质

Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)
bash
uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1
通过登录号(XP_、NP_、GenPept等)从NCBI获取蛋白质FASTA序列
bash
uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1

2. Fetch Nucleotide by Accession

2. 通过登录号获取核苷酸

Fetches nucleotide FASTA from NCBI by accession.
bash
uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json
通过登录号从NCBI获取核苷酸FASTA序列
bash
uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json

3. CDS Translate

3. CDS翻译

Fetches a CDS/nucleotide accession and translates to protein sequence. Tries three approaches in order: 1. NCBI's pre-translated CDS protein (
fasta_cds_aa
) 2. GenBank XML CDS annotation translations 3. Raw nucleotide → 6-frame ORF finding
bash
uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043
If the accession is a genomic record (not mRNA/CDS), the tool will report
is_genomic: true
so you can fall back to a homology-based approach instead.
获取CDS/核苷酸登录号并翻译为蛋白质序列。按以下顺序尝试三种方法:1. NCBI预翻译的CDS蛋白质(
fasta_cds_aa
)2. GenBank XML CDS注释翻译 3. 原始核苷酸→6框ORF查找
bash
uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043
若登录号对应的是基因组记录(而非mRNA/CDS),工具会返回
is_genomic: true
,此时您可以改用基于同源性的方法。

4. Search Any Database

4. 任意数据库搜索

Free-text search using Entrez query syntax. Supports all NCBI databases.
bash
undefined
使用Entrez查询语法进行自由文本搜索,支持所有NCBI数据库。
bash
undefined

Search protein database

搜索蛋白质数据库

uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]"
--database protein --retmax 5 --fetch-sequences
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]"
--database protein --retmax 5 --fetch-sequences

Search nucleotide database

搜索核苷酸数据库

uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]"
--database nuccore --retmax 10
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]"
--database nuccore --retmax 10

Search with patent filter

带专利过滤的搜索

uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]"
--database protein --fetch-sequences
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]"
--database protein --fetch-sequences

Search by sequence length

按序列长度搜索

uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]'
--database protein --fetch-sequences --retmax 50
undefined
uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]'
--database protein --fetch-sequences --retmax 50
undefined

5. Cross-Database Links (elink)

5. 跨数据库链接(elink)

Follow NCBI's cross-database links (e.g., PubMed article → linked proteins).
bash
uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
  --fetch-sequences -o /tmp/linked.json
跳转NCBI的跨数据库链接(例如,PubMed文章→关联蛋白质)
bash
uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
  --fetch-sequences -o /tmp/linked.json

6. Gene + Organism Search

6. 基因+物种搜索

Searches for protein sequences by gene name and organism. Searches NCBI Protein with
[Gene Name]
and
[Organism]
qualifiers.
bash
uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
  --target-length 1043 -o /tmp/result.json
通过基因名称和物种搜索蛋白质序列。使用
[Gene Name]
[Organism]
限定符搜索NCBI Protein数据库。
bash
uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
  --target-length 1043 -o /tmp/result.json

7. Locus Tag Search

7. 基因座标签搜索

Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS translations from GenBank XML when direct protein hits aren't available.
bash
uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
  --organism "Nicotiana benthamiana" -o /tmp/result.json
在NCBI Protein和Nuccore数据库中通过基因座标签搜索。当没有直接蛋白质匹配结果时,从GenBank XML中提取CDS翻译序列。
bash
uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
  --organism "Nicotiana benthamiana" -o /tmp/result.json

8. PubMed-Linked Proteins

8. PubMed关联蛋白质

Finds protein sequences linked to a PubMed article. Searches NCBI Protein by PMID, follows elink PubMed→Protein, and extracts CDS translations from linked Nuccore records.
bash
uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
  -o /tmp/result.json
查找与PubMed文章关联的蛋白质序列。通过PMID搜索NCBI Protein,跳转PubMed→Protein的elink链接,并从关联的Nuccore记录中提取CDS翻译序列。
bash
uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
  -o /tmp/result.json

9. Patent Sequence Search

9. 专利序列搜索

Two modes:
By patent number — fetches all protein sequences from a specific patent:
bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json
By keywords — searches NCBI Protein with
patent[Properties]
filter:
bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json
[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1 is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting the primary protein of interest.
两种模式:
按专利号 — 从特定专利中获取所有蛋白质序列:
bash
uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json
按关键词 — 使用
patent[Properties]
过滤条件搜索NCBI Protein:
bash
uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json
[!IMPORTANT] 专利惯例:在分子生物学专利中,SEQ ID NO:1通常是DNA序列,SEQ ID NO:2是主要蛋白质序列。编号更高的SEQ ID NO是变体或相关序列。选择目标主要蛋白质时优先考虑Sequence 2。

10. Organism + Length Search

10. 物种+长度搜索

Last-resort search when only organism and expected protein length are known. Uses NCBI's
[SLEN]
filter for exact length matching.
bash
uv run scripts/ncbi_fetch.py organism-length \
  --organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
  -o /tmp/result.json
[!NOTE] This often returns multiple candidates. Use the JSON output headers to identify the correct protein.
当仅已知物种和预期蛋白质长度时的备选搜索方案。使用NCBI的
[SLEN]
过滤器进行精确长度匹配。
bash
uv run scripts/ncbi_fetch.py organism-length \
  --organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
  -o /tmp/result.json
[!NOTE] 此方法通常会返回多个候选结果。使用JSON输出的标题信息识别正确的蛋白质。

Workflow

工作流程

Standard Sequence Retrieval Cascade

标准序列检索流程

When trying to find a protein sequence, follow this priority order:
  1. Direct accession
    fetch-protein
    with GenPept/RefSeq accession
  2. CDS translation
    cds-translate
    with nucleotide/CDS accession
  3. PubMed-linked
    pubmed-proteins
    with PMID + gene name
  4. Locus lookup
    locus-protein
    with locus tag + organism
  5. Gene + organism
    gene-protein
    with gene name + organism
  6. Patent search
    patent-search
    with patent number or keywords
  7. Organism + length
    organism-length
    as last resort
查找蛋白质序列时,请遵循以下优先级顺序:
  1. 直接登录号 — 使用
    fetch-protein
    搭配GenPept/RefSeq登录号
  2. CDS翻译 — 使用
    cds-translate
    搭配核苷酸/CDS登录号
  3. PubMed关联 — 使用
    pubmed-proteins
    搭配PMID+基因名称
  4. 基因座查询 — 使用
    locus-protein
    搭配基因座标签+物种
  5. 基因+物种 — 使用
    gene-protein
    搭配基因名称+物种
  6. 专利搜索 — 使用
    patent-search
    搭配专利号或关键词
  7. 物种+长度 — 使用
    organism-length
    作为最后备选方案

Interpreting Results

结果解读

  • All subcommands return JSON with a
    results
    array
  • Each result has
    sequence
    (AA string),
    length
    , and
    header
    /metadata
  • When multiple results are returned, select by:
    • Closest match to expected length (
      target_length
      )
    • Header relevance (matching gene name, "disease resistance" keywords)
    • Source priority (RefSeq > GenPept > patent)
  • 所有子命令均返回包含
    results
    数组的JSON
  • 每个结果包含
    sequence
    (氨基酸字符串)、
    length
    以及
    header
    /元数据
  • 返回多个结果时,按以下条件选择:
    • 与预期长度(
      target_length
      )最接近
    • 标题相关性(匹配基因名称、“抗病性”等关键词)
    • 来源优先级(RefSeq > GenPept > 专利)

Reference

参考资料

  • NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
  • Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
  • Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
  • Common accession formats:
    • XP_
      /
      NP_
      — NCBI RefSeq protein
    • AAA
      to
      AZZ
      + digits — GenPept (translated GenBank)
    • MK
      ,
      MN
      ,
      HQ
      , etc. + digits — GenBank nucleotide
    • ENSG
      ,
      ENST
      ,
      ENSP
      — Ensembl (use
      ensembl-database
      skill instead)
    • Q
      ,
      P
      ,
      O
      + digits — UniProt (use
      uniprot-database
      skill instead)
  • NCBI E-utilities文档https://www.ncbi.nlm.nih.gov/books/NBK25499/
  • Entrez搜索语法https://www.ncbi.nlm.nih.gov/books/NBK49540/
  • 数据库列表:protein、nuccore、gene、pubmed、pmc、biosample等
  • 常见登录号格式
    • XP_
      /
      NP_
      — NCBI RefSeq蛋白质
    • AAA
      AZZ
      +数字 — GenPept(翻译后的GenBank)
    • MK
      MN
      HQ
      等+数字 — GenBank核苷酸
    • ENSG
      ENST
      ENSP
      — Ensembl(请改用
      ensembl-database
      技能)
    • Q
      P
      O
      +数字 — UniProt(请改用
      uniprot-database
      技能)