dbSNP Database Integration
Prerequisites
-
: Read the
skill and follow its Setup instructions to ensure
is installed and on PATH.
-
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in
this skill directory then (1) prominently notify the user to check the terms
at
https://www.ncbi.nlm.nih.gov/snp/, then (2) create the file recording the
notification text and timestamp.
-
file: Make sure the
file exists in your home directory.
Create one if it does not exist.
-
(optional): Raises the NCBI rate limit from 3 to 10
requests/second. The skill works without it, but a key is recommended if the
user plans many queries or encounters a 429 error. The user can obtain one
for free by registering at
https://www.ncbi.nlm.nih.gov/account/settings/.
If the variable is missing from
, do NOT ask the user to paste it into
the chat (this would leak the key into the agent's context). Instead, give
the user this command —
substituting with the resolved literal
path to the file:
bash
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
The scripts load credentials automatically via
.
NEVER read,
print, or inspect the
file or its variables (e.g. no
,
,
,
, or
on keys). Credentials must stay out
of the agent's context. See the
API Key section for more details.
Core Rules
- Use the Wrapper: ALWAYS execute the provided wrapper script
to query the database rather than constructing custom
HTTP or curl requests. The script automatically handles rate limiting,
retries, and JSON parsing.
- Command Choice: Do NOT use to find the rsID of a
specific variant; use instead.
- Output Size: Avoid using on unless specifically
needed, as raw payloads can exceed 1 MB.
- Shell Safety: Always wrap HGVS strings in single quotes to prevent shell
expansion errors.
- Notification: If this skill is used, ensure this is mentioned in the
output.
When to Use
Use this skill when you need to:
- Map a genomic variant to its canonical rsID (from VCF coordinates or HGVS
notation).
- Retrieve summary data for an rsID: variant type, gene associations, clinical
significance, and population allele frequencies.
- Convert an rsID back to genomic coordinates on a specific assembly.
- Find all known variants within a chromosomal region.
Do NOT use when you need to:
- Obtain clinical pathogenicity classifications with submitter rationales (use
clinvar-database).
- Get precise population-level allele frequencies stratified by ancestry (use
gnomad-database).
- Predict the functional effect of a novel mutation (use
alphagenome-single-variant-analysis).
- View 3D protein structures affected by a variant (use
alphafold-database-fetch-and-analyze / pdb-database).
Command Selection Guide
Pick the right command on the first try. Match the user's input to the
correct subcommand below — one command call is almost always sufficient.
- User gives you…: Run this command
- An rsID (e.g. , ):
- Genomic coordinates: chrom pos ref alt (e.g. ):
- An HGVS string (e.g.
NC_000008.11:g.19962213del
):
- An rsID and they want coordinates back:
- A chromosomal region (chrom start end):
[!CAUTION]
Do NOT use to find the rsID of a specific
variant. If the user provides a chromosome, position, reference allele, and
alternate allele (four values), use
— it is a direct,
single-API-call lookup.
is only for surveying all variants
within a positional range and returns hundreds/thousands of results.
Quick Start
bash
# Look up variant rs7412: type, gene, clinical significance, MAF
uv run scripts/dbsnp_cli.py get-variant rs7412 --output /tmp/rs7412.json
# Find the rsID for a variant at chr8:19962213 C>T
uv run scripts/dbsnp_cli.py resolve-variant 8 19962213 C T \
--output /tmp/resolve.json
All subcommands write JSON to disk. Always save output in the
directory.
The
flag is required.
Commands
1. — Fetch Variant Record
Retrieve the RefSNP record for one rsID. By default the output is abbreviated to
the most useful fields. Both
and
are accepted.
bash
uv run scripts/dbsnp_cli.py get-variant rs268 --output /tmp/rs268.json
uv run scripts/dbsnp_cli.py get-variant 268 --assembly GCF_000001405.40 \
--output /tmp/rs268.json
Arguments:
- (positional, required): The RefSNP identifier.
- : RefSeq assembly accession (default: =
GRCh38).
- : Return the complete raw JSON payload — see warning below.
- : Output file path (default: ).
Abbreviated output fields:
- : Numeric rsID
- : e.g. , , ,
- : Sorted list of gene symbols (locus names)
- : List of clinical significance labels
- : Study name, allele count, total count
- : Genomic placements for the requested assembly
[!WARNING]
About : The raw RefSNP payload is typically 50–500 KB
and can exceed 1 MB for clinically significant variants with many submissions.
Only use
when you specifically need data absent from the abbreviated
output — for example:
- The complete HGVS nomenclature across every transcript and protein
isoform.
- Full submission history with individual submitter details and timestamps.
- Population-level allele frequency breakdowns by sub-population within a
study (e.g. per-population gnomAD counts).
- The full set of genomic placements across multiple assemblies (GRCh37 and
GRCh38 simultaneously).
- Merge history showing which older rsIDs were merged into this one.
2. — Genomic Coordinates → rsID
Determine the rsID(s) for a variant given its genomic coordinates (chromosome,
position, reference allele, alternate allele).
This is the command to use when
the user provides a variant as space-separated coordinates like
.
bash
uv run scripts/dbsnp_cli.py resolve-variant 8 19962213 C T \
--output /tmp/resolve.json
Arguments:
- (positional): Chromosome number (e.g. ) or RefSeq sequence
accession (e.g. ). Chromosomes X and Y must be passed as
their numeric equivalents: for X and for Y.
- (positional): 1-based genomic position.
- (positional): Reference allele (e.g. ).
- (positional): Alternate allele(s), comma-separated (e.g. ).
- : RefSeq assembly accession (default: ).
- : Output file path (default: ).
Output: {"rsids": ["12345", "67890"]}
3. — rsID → Genomic Coordinates
Get the genomic placement (sequence ID and allele details) for a known rsID on a
specific assembly.
bash
uv run scripts/dbsnp_cli.py resolve-rsid rs7412 --output /tmp/coords.json
Arguments:
- (positional): The RefSNP identifier.
- : RefSeq assembly accession (default: ).
- : Output file path (default: ).
Output: {"rsid": "7412", "assembly": "...", "placements": [...]}
4. — HGVS → rsID
Find the rsID(s) corresponding to an HGVS expression.
bash
uv run scripts/dbsnp_cli.py resolve-hgvs 'NC_000008.11:g.19962213del' \
--output /tmp/hgvs.json
Arguments:
- (positional): The HGVS string.
- : RefSeq assembly accession (default: ).
- : Output file path (default: ).
[!TIP] HGVS strings often contain characters that shells interpret (colons,
greater-than signs). Always wrap them in single quotes to prevent shell
expansion.
5. — Regional Variant Search
Find all rsIDs within a bounded chromosomal region.
bash
uv run scripts/dbsnp_cli.py search-region 7 117100000 117300000 \
--output /tmp/region.json
Arguments:
- (positional): Chromosome (e.g. ). Use for chromosome X and
for chromosome Y.
- (positional): Start position.
- (positional): End position.
- : Maximum rsIDs to return (default: 500, ceiling: 5 000).
- : Output file path (default: ).
Output:
json
{
"rsids": ["12345", "67890", "..."],
"returned": 500,
"total_available": 1423,
"truncated": true,
"note": "Only 500 of 1423 variants returned. Increase --retmax ..."
}
When
exceeds the returned count, the output includes a
flag and a
. Increase
to retrieve more (up to 5
000).
Typical Workflows
Identify a known variant from coordinates
bash
# Step 1: Map VCF coordinates to rsID
uv run scripts/dbsnp_cli.py resolve-variant 19 44908684 T C \
--output /tmp/step1.json
# Step 2: Get the full details for the resolved rsID
uv run scripts/dbsnp_cli.py get-variant <rsid_from_step1> \
--output /tmp/step2.json
Survey variants in a gene region
bash
# Step 1: Find all variants in a region spanning the CFTR gene
uv run scripts/dbsnp_cli.py search-region 7 117100000 117300000 \
--retmax 1000 --output /tmp/region.json
# Step 2: Retrieve details on individual rsIDs of interest
uv run scripts/dbsnp_cli.py get-variant <rsid> --output /tmp/detail.json
Translate HGVS notation to genomic coordinates
bash
# Step 1: Get the rsID for an HGVS expression
uv run scripts/dbsnp_cli.py resolve-hgvs 'NC_000019.10:g.44908684T>C' \
--output /tmp/hgvs.json
# Step 2: Resolve that rsID to VCF-style coordinates
uv run scripts/dbsnp_cli.py resolve-rsid <rsid> --output /tmp/coords.json
Assembly Defaults and Automatic Fallback
The Variation Services endpoints (used by
,
,
,
) expect a
RefSeq assembly accession. The
RefSeq accession for GRCh38 is
, and for GRCh37 it is
.
The
subcommand always searches GRCh38 positions.
[!IMPORTANT]
Automatic assembly fallback: The
and
commands automatically try GRCh38 first. If no rsIDs are found,
they retry with GRCh37 before reporting failure. When a fallback occurs the
output JSON includes a
field explaining which assembly succeeded.
You do NOT need to manually retry with a different assembly — the script
handles this transparently.
You only need to override
when you specifically want to
restrict the lookup to one assembly (e.g. because the user's coordinates are
known to be GRCh37).
NCBI API Key and Rate Limiting
Without an API key the script is limited to 3 requests per second. With a
key this increases to 10 requests per second.
bash
uv run scripts/dbsnp_cli.py get-variant rs268 --output out.json
If a
is raised, pause execution and follow the prerequisite
instructions to help the user add
to the
file. See
for details.
Troubleshooting HTTP 500 Errors
Reference Allele Mismatch
If you receive an HTTP 500 error with a message detailing that the asserted
reference allele is not equal to the reference sequence:
What it means: The coordinate position is likely valid, but the reference
allele (
) you provided does not match the base at that position in the
requested assembly.
Action: 1.
DO NOT RETRY the exact same query mechanically. 2.
Check
the assembly: Coordinates are assembly-specific. 3.
Switch assembly: If
you were querying GRCh37, try GRCh38 (using
--assembly GCF_000001405.40
), or
if querying GRCh38, try GRCh37 (using
--assembly GCF_000001405.25
).
Common Mistakes
-
Mistake: Forgetting to quote HGVS strings
Fix: Wrap in single
quotes:
'NC_000008.11:g.19962213del'
-
Mistake: Passing a chromosome name to
instead of a
sequence accession
Fix: Use the numeric chromosome ID (e.g.
) or a
RefSeq accession like
-
Mistake: Using
on
without needing it
Fix: The
abbreviated output covers most use cases;
returns 50–500 KB+ of
JSON
-
Mistake: Expecting
to return all results by default
Fix: The default
is 500; check
in the
output to see if results were truncated
-
Mistake: Using GRCh37 coordinates with
Fix:
always uses GRCh38 positions; lift over coordinates first if
starting from GRCh37
-
Mistake: Manually retrying
or
with a
different
when the first call fails
Fix: The script
automatically tries GRCh38 then GRCh37; a single call is sufficient
-
Mistake: Passing
or
as the chromosome value
Fix: Use the
numeric equivalents:
for chromosome X and
for chromosome Y. The
CLI treats chromosomes numerically by default.