NCBI Sequence Fetch
Prerequisites
-
: Read the
skill and follow its Setup instructions to ensure
is installed and on PATH.
-
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in
this skill directory then (1) prominently notify the user to check the terms
at
https://www.ncbi.nlm.nih.gov/ and
https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file
recording the notification text and timestamp.
-
file: Make sure the
file exists in your home directory.
Create one if it does not exist.
-
(optional): Raises the NCBI rate limit from 3 to 10
requests/second. The skill works without it, but a key is recommended if the
user plans many queries or encounters a 429 error. The user can obtain one
for free by registering at
https://www.ncbi.nlm.nih.gov/account/settings/.
If the variable is missing from
, do NOT ask the user to paste it into
the chat (this would leak the key into the agent's context). Instead, give
the user this command —
substituting with the resolved literal
path to the file:
bash
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
The scripts load credentials automatically via
.
NEVER read,
print, or inspect the
file or its variables (e.g. no
,
,
,
, or
on keys). Credentials must stay out
of the agent's context.
Core Rules
- Use the Wrapper: ALWAYS execute the provided helper scripts to query the
database rather than accessing the database directly. The scripts
automatically enforce the required rate limit gracefully.
- API Key Support: If the user provides an in their
environment, the query speed limits are automatically increased
significantly.
- Notification: If this skill is used, ensure this is mentioned in the
output.
Overview
Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for
retrieving protein and nucleotide sequences. Provides 10 subcommands covering
the full range of sequence retrieval workflows:
- — Direct protein accession lookup (GenPept, RefSeq)
- — Direct nucleotide accession lookup
- — Fetch CDS and translate to protein (3 methods)
- — Free-text search of any NCBI database
- — Follow cross-database links (PubMed→Protein, etc.)
- — Search protein by gene name + organism
- — Search protein by locus tag + organism
- — Find proteins linked to a PubMed article
- — Extract protein sequences from patents
- — Last-resort search by organism + exact AA length
Utility Scripts
— Single script with subcommands.
All subcommands write structured JSON output. Use
to save to a
file, or omit it to print to stdout. A human-readable summary is always printed
to stdout.
1. Fetch Protein by Accession
Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)
bash
uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1
2. Fetch Nucleotide by Accession
Fetches nucleotide FASTA from NCBI by accession.
bash
uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json
3. CDS Translate
Fetches a CDS/nucleotide accession and translates to protein sequence. Tries
three approaches in order: 1. NCBI's pre-translated CDS protein (
)
2. GenBank XML CDS annotation translations 3. Raw nucleotide → 6-frame ORF
finding
bash
uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043
If the accession is a
genomic record (not mRNA/CDS), the tool will report
so you can fall back to a homology-based approach instead.
4. Search Any Database
Free-text search using Entrez query syntax. Supports all NCBI databases.
bash
# Search protein database
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]" \
--database protein --retmax 5 --fetch-sequences
# Search nucleotide database
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]" \
--database nuccore --retmax 10
# Search with patent filter
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]" \
--database protein --fetch-sequences
# Search by sequence length
uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]' \
--database protein --fetch-sequences --retmax 50
5. Cross-Database Links (elink)
Follow NCBI's cross-database links (e.g., PubMed article → linked proteins).
bash
uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
--fetch-sequences -o /tmp/linked.json
6. Gene + Organism Search
Searches for protein sequences by gene name and organism. Searches NCBI Protein
with
and
qualifiers.
bash
uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
--target-length 1043 -o /tmp/result.json
7. Locus Tag Search
Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS
translations from GenBank XML when direct protein hits aren't available.
bash
uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
--organism "Nicotiana benthamiana" -o /tmp/result.json
8. PubMed-Linked Proteins
Finds protein sequences linked to a PubMed article. Searches NCBI Protein by
PMID, follows elink PubMed→Protein, and extracts CDS translations from linked
Nuccore records.
bash
uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
-o /tmp/result.json
9. Patent Sequence Search
Two modes:
By patent number — fetches all protein sequences from a specific patent:
bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json
By keywords — searches NCBI Protein with
filter:
bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json
[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1
is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher
SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting
the primary protein of interest.
10. Organism + Length Search
Last-resort search when only organism and expected protein length are known.
Uses NCBI's
filter for exact length matching.
bash
uv run scripts/ncbi_fetch.py organism-length \
--organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
-o /tmp/result.json
[!NOTE] This often returns multiple candidates. Use the JSON output headers to
identify the correct protein.
Workflow
Standard Sequence Retrieval Cascade
When trying to find a protein sequence, follow this priority order:
- Direct accession — with GenPept/RefSeq accession
- CDS translation — with nucleotide/CDS accession
- PubMed-linked — with PMID + gene name
- Locus lookup — with locus tag + organism
- Gene + organism — with gene name + organism
- Patent search — with patent number or keywords
- Organism + length — as last resort
Interpreting Results
- All subcommands return JSON with a array
- Each result has (AA string), , and /metadata
- When multiple results are returned, select by:
- Closest match to expected length ()
- Header relevance (matching gene name, "disease resistance" keywords)
- Source priority (RefSeq > GenPept > patent)
Reference
- NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
- Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
- Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
- Common accession formats:
- / — NCBI RefSeq protein
- to + digits — GenPept (translated GenBank)
- , , , etc. + digits — GenBank nucleotide
- , , — Ensembl (use skill instead)
- , , + digits — UniProt (use skill instead)