ncbi-sequence-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NCBI Sequence Fetch

NCBI序列获取

Prerequisites

前置条件

uv
: Read the
```
uv
```
skill and follow its Setup instructions to ensure
```
uv
```
is installed and on PATH.
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ncbi.nlm.nih.gov/ and https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file recording the notification text and timestamp.
.env
file: Make sure the
```
.env
```
file exists in your home directory. Create one if it does not exist.
NCBI_API_KEY
(optional): Raises the NCBI rate limit from 3 to 10 requests/second. The skill works without it, but a key is recommended if the user plans many queries or encounters a 429 error. The user can obtain one for free by registering at https://www.ncbi.nlm.nih.gov/account/settings/. If the variable is missing from
```
.env
```
, do NOT ask the user to paste it into the chat (this would leak the key into the agent's context). Instead, give the user this command — substituting
ENV_FILE
with the resolved literal path to the
.env
file:
bash
```
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
```
The scripts load credentials automatically via
```
dotenv
```
. NEVER read, print, or inspect the
```
.env
```
file or its variables (e.g. no
```
cat
```
,
```
grep
```
,
```
echo
```
,
```
printenv
```
, or
```
os.environ.get
```
on keys). Credentials must stay out of the agent's context.

uv
：查看
```
uv
```
技能并按照其设置说明操作，确保
```
uv
```
已安装并添加至PATH。
用户通知：若此技能目录中不存在LICENSE_NOTIFICATION.txt文件，则需(1)显著告知用户查看https://www.ncbi.nlm.nih.gov/和https://www.ncbi.nlm.nih.gov/home/about/policies/上的条款，然后(2)创建该文件并记录通知文本和时间戳。
.env
文件：确保您的主目录中存在
```
.env
```
文件，若不存在则创建一个。
NCBI_API_KEY
（可选）：将NCBI的请求速率限制从3次/秒提升至10次/秒。即使没有该密钥，技能也能正常运行，但如果用户计划进行大量查询或遇到429错误，建议使用密钥。用户可通过注册https://www.ncbi.nlm.nih.gov/account/settings/免费获取。若`.env`文件中缺少该变量，请勿要求用户在聊天中粘贴密钥（这会导致密钥泄露至Agent上下文），而是向用户提供以下命令——**将`ENV_FILE`替换为`.env`文件的实际路径**：
bash
```
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
```
脚本会通过
```
dotenv
```
自动加载凭证。绝对不要读取、打印或检查
```
.env
```
文件及其变量（例如不要对密钥使用
```
cat
```
、
```
grep
```
、
```
echo
```
、
```
printenv
```
或
```
os.environ.get
```
）。凭证必须远离Agent上下文。

Core Rules

核心规则

Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
API Key Support: If the user provides an
```
NCBI_API_KEY
```
in their environment, the query speed limits are automatically increased significantly.
Notification: If this skill is used, ensure this is mentioned in the output.

使用封装脚本：始终执行提供的辅助脚本查询数据库，而非直接访问数据库。脚本会自动优雅地执行必要的速率限制。
API密钥支持：若用户环境中提供了
```
NCBI_API_KEY
```
，查询速度限制会自动大幅提升。
通知要求：若使用此技能，需确保在输出中提及这一点。

Overview

概述

Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for retrieving protein and nucleotide sequences. Provides 10 subcommands covering the full range of sequence retrieval workflows:

```
fetch-protein
```
— Direct protein accession lookup (GenPept, RefSeq)
```
fetch-nucleotide
```
— Direct nucleotide accession lookup
```
cds-translate
```
— Fetch CDS and translate to protein (3 methods)
```
search
```
— Free-text search of any NCBI database
```
elink
```
— Follow cross-database links (PubMed→Protein, etc.)
```
gene-protein
```
— Search protein by gene name + organism
```
locus-protein
```
— Search protein by locus tag + organism
```
pubmed-proteins
```
— Find proteins linked to a PubMed article
```
patent-search
```
— Extract protein sequences from patents
```
organism-length
```
— Last-resort search by organism + exact AA length

封装NCBI的Entrez E-utilities（efetch、esearch、elink、esummary）用于检索蛋白质和核苷酸序列。提供10个子命令，覆盖全范围的序列检索工作流：

```
fetch-protein
```
— 直接蛋白质登录号查询（GenPept、RefSeq）
```
fetch-nucleotide
```
— 直接核苷酸登录号查询
```
cds-translate
```
— 获取CDS并翻译为蛋白质（3种方法）
```
search
```
— 任意NCBI数据库的自由文本搜索
```
elink
```
— 跨数据库链接跳转（如PubMed→Protein等）
```
gene-protein
```
— 通过基因名称+物种搜索蛋白质
```
locus-protein
```
— 通过基因座标签+物种搜索蛋白质
```
pubmed-proteins
```
— 查找与PubMed文章关联的蛋白质
```
patent-search
```
— 从专利中提取蛋白质序列
```
organism-length
```
— 备选方案：通过物种+精确氨基酸长度搜索

Utility Scripts

实用脚本

scripts/ncbi_fetch.py
— Single script with subcommands.

All subcommands write structured JSON output. Use

--output FILE

to save to a file, or omit it to print to stdout. A human-readable summary is always printed to stdout.

scripts/ncbi_fetch.py
— 包含子命令的单脚本。

所有子命令均输出结构化JSON。使用

--output FILE

保存至文件，或省略该参数直接打印至标准输出。标准输出始终会打印人类可读的摘要。

1. Fetch Protein by Accession

1. 通过登录号获取蛋白质

Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)

bash

uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1

通过登录号（XP_、NP_、GenPept等）从NCBI获取蛋白质FASTA序列

bash

uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1

2. Fetch Nucleotide by Accession

2. 通过登录号获取核苷酸

Fetches nucleotide FASTA from NCBI by accession.

bash

uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json

通过登录号从NCBI获取核苷酸FASTA序列

bash

uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json

3. CDS Translate

3. CDS翻译

Fetches a CDS/nucleotide accession and translates to protein sequence. Tries three approaches in order: 1. NCBI's pre-translated CDS protein (

fasta_cds_aa

) 2. GenBank XML CDS annotation translations 3. Raw nucleotide → 6-frame ORF finding

bash

uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043

If the accession is a genomic record (not mRNA/CDS), the tool will report

is_genomic: true

so you can fall back to a homology-based approach instead.

获取CDS/核苷酸登录号并翻译为蛋白质序列。按以下顺序尝试三种方法：1. NCBI预翻译的CDS蛋白质（

fasta_cds_aa

）2. GenBank XML CDS注释翻译 3. 原始核苷酸→6框ORF查找

bash

uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043

若登录号对应的是基因组记录（而非mRNA/CDS），工具会返回

is_genomic: true

，此时您可以改用基于同源性的方法。

4. Search Any Database

4. 任意数据库搜索

Free-text search using Entrez query syntax. Supports all NCBI databases.

bash

undefined

使用Entrez查询语法进行自由文本搜索，支持所有NCBI数据库。

bash

undefined

Search protein database

搜索蛋白质数据库

uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]"
--database protein --retmax 5 --fetch-sequences

Search nucleotide database

搜索核苷酸数据库

uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]"
--database nuccore --retmax 10

Search with patent filter

带专利过滤的搜索

uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]"
--database protein --fetch-sequences

Search by sequence length

按序列长度搜索

uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]'
--database protein --fetch-sequences --retmax 50

undefined

uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]'
--database protein --fetch-sequences --retmax 50

undefined

5. Cross-Database Links (elink)

5. 跨数据库链接（elink）

Follow NCBI's cross-database links (e.g., PubMed article → linked proteins).

bash

uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
  --fetch-sequences -o /tmp/linked.json

跳转NCBI的跨数据库链接（例如，PubMed文章→关联蛋白质）

bash

uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
  --fetch-sequences -o /tmp/linked.json

6. Gene + Organism Search

6. 基因+物种搜索

Searches for protein sequences by gene name and organism. Searches NCBI Protein with

[Gene Name]

and

[Organism]

qualifiers.

bash

uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
  --target-length 1043 -o /tmp/result.json

通过基因名称和物种搜索蛋白质序列。使用

[Gene Name]

和

[Organism]

限定符搜索NCBI Protein数据库。

bash

uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
  --target-length 1043 -o /tmp/result.json

7. Locus Tag Search

7. 基因座标签搜索

Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS translations from GenBank XML when direct protein hits aren't available.

bash

uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
  --organism "Nicotiana benthamiana" -o /tmp/result.json

在NCBI Protein和Nuccore数据库中通过基因座标签搜索。当没有直接蛋白质匹配结果时，从GenBank XML中提取CDS翻译序列。

bash

uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
  --organism "Nicotiana benthamiana" -o /tmp/result.json

8. PubMed-Linked Proteins

8. PubMed关联蛋白质

Finds protein sequences linked to a PubMed article. Searches NCBI Protein by PMID, follows elink PubMed→Protein, and extracts CDS translations from linked Nuccore records.

bash

uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
  -o /tmp/result.json

查找与PubMed文章关联的蛋白质序列。通过PMID搜索NCBI Protein，跳转PubMed→Protein的elink链接，并从关联的Nuccore记录中提取CDS翻译序列。

bash

uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
  -o /tmp/result.json

9. Patent Sequence Search

9. 专利序列搜索

Two modes:

By patent number — fetches all protein sequences from a specific patent:

bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json

By keywords — searches NCBI Protein with

patent[Properties]

filter:

bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json

[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1 is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting the primary protein of interest.

两种模式：

按专利号 — 从特定专利中获取所有蛋白质序列：

bash

uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json

按关键词 — 使用

patent[Properties]

过滤条件搜索NCBI Protein：

bash

uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json

[!IMPORTANT] 专利惯例：在分子生物学专利中，SEQ ID NO:1通常是DNA序列，SEQ ID NO:2是主要蛋白质序列。编号更高的SEQ ID NO是变体或相关序列。选择目标主要蛋白质时优先考虑Sequence 2。

10. Organism + Length Search

10. 物种+长度搜索

Last-resort search when only organism and expected protein length are known. Uses NCBI's

[SLEN]

filter for exact length matching.

bash

uv run scripts/ncbi_fetch.py organism-length \
  --organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
  -o /tmp/result.json

[!NOTE] This often returns multiple candidates. Use the JSON output headers to identify the correct protein.

当仅已知物种和预期蛋白质长度时的备选搜索方案。使用NCBI的

[SLEN]

过滤器进行精确长度匹配。

bash

uv run scripts/ncbi_fetch.py organism-length \
  --organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
  -o /tmp/result.json

[!NOTE] 此方法通常会返回多个候选结果。使用JSON输出的标题信息识别正确的蛋白质。

Workflow

工作流程

Standard Sequence Retrieval Cascade

标准序列检索流程

When trying to find a protein sequence, follow this priority order:

Direct accession —
```
fetch-protein
```
with GenPept/RefSeq accession
CDS translation —
```
cds-translate
```
with nucleotide/CDS accession
PubMed-linked —
```
pubmed-proteins
```
with PMID + gene name
Locus lookup —
```
locus-protein
```
with locus tag + organism
Gene + organism —
```
gene-protein
```
with gene name + organism
Patent search —
```
patent-search
```
with patent number or keywords
Organism + length —
```
organism-length
```
as last resort

查找蛋白质序列时，请遵循以下优先级顺序：

直接登录号 — 使用
```
fetch-protein
```
搭配GenPept/RefSeq登录号
CDS翻译 — 使用
```
cds-translate
```
搭配核苷酸/CDS登录号
PubMed关联 — 使用
```
pubmed-proteins
```
搭配PMID+基因名称
基因座查询 — 使用
```
locus-protein
```
搭配基因座标签+物种
基因+物种 — 使用
```
gene-protein
```
搭配基因名称+物种
专利搜索 — 使用
```
patent-search
```
搭配专利号或关键词
物种+长度 — 使用
```
organism-length
```
作为最后备选方案

Interpreting Results

结果解读

All subcommands return JSON with a
```
results
```
array
Each result has
```
sequence
```
(AA string),
```
length
```
, and
```
header
```
/metadata
When multiple results are returned, select by:
- Closest match to expected length (
```
target_length
```
  )
- Header relevance (matching gene name, "disease resistance" keywords)
- Source priority (RefSeq > GenPept > patent)

所有子命令均返回包含
```
results
```
数组的JSON
每个结果包含
```
sequence
```
（氨基酸字符串）、
```
length
```
以及
```
header
```
/元数据
返回多个结果时，按以下条件选择：
- 与预期长度（
```
target_length
```
  ）最接近
- 标题相关性（匹配基因名称、“抗病性”等关键词）
- 来源优先级（RefSeq > GenPept > 专利）

Reference

参考资料

NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
Common accession formats:
- ```
XP_
```
  /
```
NP_
```
  — NCBI RefSeq protein
- ```
AAA
```
  to
```
AZZ
```
  + digits — GenPept (translated GenBank)
- ```
MK
```
  ,
```
MN
```
  ,
```
HQ
```
  , etc. + digits — GenBank nucleotide
- ```
ENSG
```
  ,
```
ENST
```
  ,
```
ENSP
```
  — Ensembl (use
```
ensembl-database
```
  skill instead)
- ```
Q
```
  ,
```
P
```
  ,
```
O
```
  + digits — UniProt (use
```
uniprot-database
```
  skill instead)

NCBI E-utilities文档：https://www.ncbi.nlm.nih.gov/books/NBK25499/
Entrez搜索语法：https://www.ncbi.nlm.nih.gov/books/NBK49540/
数据库列表：protein、nuccore、gene、pubmed、pmc、biosample等
常见登录号格式：
- ```
XP_
```
  /
```
NP_
```
  — NCBI RefSeq蛋白质
- ```
AAA
```
  至
```
AZZ
```
  +数字 — GenPept（翻译后的GenBank）
- ```
MK
```
  、
```
MN
```
  、
```
HQ
```
  等+数字 — GenBank核苷酸
- ```
ENSG
```
  、
```
ENST
```
  、
```
ENSP
```
  — Ensembl（请改用
```
ensembl-database
```
  技能）
- ```
Q
```
  、
```
P
```
  、
```
O
```
  +数字 — UniProt（请改用
```
uniprot-database
```
  技能）