NCBI Sequence Fetch

Prerequisites

uv
: Read the
```
uv
```
skill and follow its Setup instructions to ensure
```
uv
```
is installed and on PATH.
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ncbi.nlm.nih.gov/ and https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file recording the notification text and timestamp.
.env
file: Make sure the
```
.env
```
file exists in your home directory. Create one if it does not exist.
NCBI_API_KEY
(optional): Raises the NCBI rate limit from 3 to 10 requests/second. The skill works without it, but a key is recommended if the user plans many queries or encounters a 429 error. The user can obtain one for free by registering at https://www.ncbi.nlm.nih.gov/account/settings/. If the variable is missing from
```
.env
```
, do NOT ask the user to paste it into the chat (this would leak the key into the agent's context). Instead, give the user this command — substituting
ENV_FILE
with the resolved literal path to the
.env
file:
bash
```
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
```
The scripts load credentials automatically via
```
dotenv
```
. NEVER read, print, or inspect the
```
.env
```
file or its variables (e.g. no
```
cat
```
,
```
grep
```
,
```
echo
```
,
```
printenv
```
, or
```
os.environ.get
```
on keys). Credentials must stay out of the agent's context.

Core Rules

Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
API Key Support: If the user provides an
```
NCBI_API_KEY
```
in their environment, the query speed limits are automatically increased significantly.
Notification: If this skill is used, ensure this is mentioned in the output.

Overview

Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for retrieving protein and nucleotide sequences. Provides 10 subcommands covering the full range of sequence retrieval workflows:

```
fetch-protein
```
— Direct protein accession lookup (GenPept, RefSeq)
```
fetch-nucleotide
```
— Direct nucleotide accession lookup
```
cds-translate
```
— Fetch CDS and translate to protein (3 methods)
```
search
```
— Free-text search of any NCBI database
```
elink
```
— Follow cross-database links (PubMed→Protein, etc.)
```
gene-protein
```
— Search protein by gene name + organism
```
locus-protein
```
— Search protein by locus tag + organism
```
pubmed-proteins
```
— Find proteins linked to a PubMed article
```
patent-search
```
— Extract protein sequences from patents
```
organism-length
```
— Last-resort search by organism + exact AA length

Utility Scripts

scripts/ncbi_fetch.py
— Single script with subcommands.

All subcommands write structured JSON output. Use

--output FILE

to save to a file, or omit it to print to stdout. A human-readable summary is always printed to stdout.

1. Fetch Protein by Accession

Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)

bash

uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1

2. Fetch Nucleotide by Accession

Fetches nucleotide FASTA from NCBI by accession.

bash

uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json

3. CDS Translate

Fetches a CDS/nucleotide accession and translates to protein sequence. Tries three approaches in order: 1. NCBI's pre-translated CDS protein (

fasta_cds_aa

) 2. GenBank XML CDS annotation translations 3. Raw nucleotide → 6-frame ORF finding

bash

uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043

If the accession is a genomic record (not mRNA/CDS), the tool will report

is_genomic: true

so you can fall back to a homology-based approach instead.

4. Search Any Database

Free-text search using Entrez query syntax. Supports all NCBI databases.

bash

# Search protein database
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]" \
  --database protein --retmax 5 --fetch-sequences

# Search nucleotide database
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]" \
  --database nuccore --retmax 10

# Search with patent filter
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]" \
  --database protein --fetch-sequences

# Search by sequence length
uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]' \
  --database protein --fetch-sequences --retmax 50

5. Cross-Database Links (elink)

Follow NCBI's cross-database links (e.g., PubMed article → linked proteins).

bash

uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
  --fetch-sequences -o /tmp/linked.json

6. Gene + Organism Search

Searches for protein sequences by gene name and organism. Searches NCBI Protein with

[Gene Name]

and

[Organism]

qualifiers.

bash

uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
  --target-length 1043 -o /tmp/result.json

7. Locus Tag Search

Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS translations from GenBank XML when direct protein hits aren't available.

bash

uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
  --organism "Nicotiana benthamiana" -o /tmp/result.json

8. PubMed-Linked Proteins

Finds protein sequences linked to a PubMed article. Searches NCBI Protein by PMID, follows elink PubMed→Protein, and extracts CDS translations from linked Nuccore records.

bash

uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
  -o /tmp/result.json

9. Patent Sequence Search

Two modes:

By patent number — fetches all protein sequences from a specific patent:

bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json

By keywords — searches NCBI Protein with

patent[Properties]

filter:

bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json

[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1 is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting the primary protein of interest.

10. Organism + Length Search

Last-resort search when only organism and expected protein length are known. Uses NCBI's

[SLEN]

filter for exact length matching.

bash

uv run scripts/ncbi_fetch.py organism-length \
  --organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
  -o /tmp/result.json

[!NOTE] This often returns multiple candidates. Use the JSON output headers to identify the correct protein.

Workflow

Standard Sequence Retrieval Cascade

When trying to find a protein sequence, follow this priority order:

Direct accession —
```
fetch-protein
```
with GenPept/RefSeq accession
CDS translation —
```
cds-translate
```
with nucleotide/CDS accession
PubMed-linked —
```
pubmed-proteins
```
with PMID + gene name
Locus lookup —
```
locus-protein
```
with locus tag + organism
Gene + organism —
```
gene-protein
```
with gene name + organism
Patent search —
```
patent-search
```
with patent number or keywords
Organism + length —
```
organism-length
```
as last resort

Interpreting Results

All subcommands return JSON with a
```
results
```
array
Each result has
```
sequence
```
(AA string),
```
length
```
, and
```
header
```
/metadata
When multiple results are returned, select by:
- Closest match to expected length (
```
target_length
```
  )
- Header relevance (matching gene name, "disease resistance" keywords)
- Source priority (RefSeq > GenPept > patent)

Reference

NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
Common accession formats:
- ```
XP_
```
  /
```
NP_
```
  — NCBI RefSeq protein
- ```
AAA
```
  to
```
AZZ
```
  + digits — GenPept (translated GenBank)
- ```
MK
```
  ,
```
MN
```
  ,
```
HQ
```
  , etc. + digits — GenBank nucleotide
- ```
ENSG
```
  ,
```
ENST
```
  ,
```
ENSP
```
  — Ensembl (use
```
ensembl-database
```
  skill instead)
- ```
Q
```
  ,
```
P
```
  ,
```
O
```
  + digits — UniProt (use
```
uniprot-database
```
  skill instead)

ncbi-sequence-fetch

NPX Install

Tags

SKILL.md Content

NCBI Sequence Fetch

Prerequisites

Core Rules

Overview

Utility Scripts

1. Fetch Protein by Accession

2. Fetch Nucleotide by Accession

3. CDS Translate

4. Search Any Database

5. Cross-Database Links (elink)

6. Gene + Organism Search

7. Locus Tag Search

8. PubMed-Linked Proteins

9. Patent Sequence Search

10. Organism + Length Search

Workflow

Standard Sequence Retrieval Cascade

Interpreting Results

Reference