ClinVar Database
Prerequisites
-
: Read the
skill and follow its Setup instructions to ensure
is installed and on PATH.
-
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in
this skill directory then (1) prominently notify the user to check the terms
at
https://www.ncbi.nlm.nih.gov/clinvar/, then (2) create the file recording
the notification text and timestamp.
-
file: Make sure the
file exists in your home directory.
Create one if it does not exist.
-
(optional): Raises the NCBI rate limit from 3 to 10
requests/second. The skill works without it, but a key is recommended if the
user plans many queries or encounters a 429 error. The user can obtain one
for free by registering at
https://www.ncbi.nlm.nih.gov/account/settings/.
If the variable is missing from
, do NOT ask the user to paste it into
the chat (this would leak the key into the agent's context). Instead, give
the user this command —
substituting with the resolved literal
path to the file:
bash
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
The scripts load credentials automatically via
.
NEVER read,
print, or inspect the
file or its variables (e.g. no
,
,
,
, or
on keys). Credentials must stay out
of the agent's context. See the
API Key section for more details.
Overview
ClinVar is the primary consensus record for clinical classifications of human
genomic variations. It provides the "clinical ground truth" for pathogenicity
labels (Pathogenic, Likely Pathogenic, Benign, VUS) based on assertions from
global laboratories.
When to Use
Use when you need to:
- Find the current clinical significance and star rating (review status) for a
specific variant.
- Fetch clinician notes, assertion criteria, or rationales for previous
clinical laboratory classifications.
- Retrieve the preferred condition name and associated HPO terms for a
specific variant.
- Find a list of variant controls (e.g., "Find all Pathogenic variants in the
HBB gene within 50bp of a signal").
- Check for conflicting interpretations for a given variant and identify the
organizations submitting each classification.
Do NOT use when you need to:
- Find specific allele frequencies in global populations (use gnomAD).
- Describe the normal biological role of a protein and typical inheritance
patterns (use OMIM).
- Predict mechanistic effects of novel mutations, like frameshifts or exon
skipping (use AlphaGenome).
- Find recommended surveillance schedules for patients with a pathogenic
variant (use GeneReviews).
- Generate or view 3D structural models of affected proteins (use PDB /
AlphaFold).
Quick Start
ClinVar queries are executed via a robust Python wrapper script to handle strict
rate limiting and XML/JSON parsing.
Example: Search for BRCA1 variants
bash
uv run scripts/clinvar_api.py search --query "BRCA1[gene]" --output results.json
Core Rules
- Retmax Constraint: The search command defaults to . For
any "List all" or gene-wide request, you MUST explicitly set
higher (e.g., 1000) to ensure data completeness.
- Use the Wrapper: Prefer the wrapper script for standard queries. It
handles rate limiting, retries, and the complex XML parsing for you. If the
script's parsed output does not contain the specific fields you need, you
may modify the script or query the NCBI E-utilities API directly — but be
aware that the raw XML schemas are complex and vary between record types.
- If the rate limit is hit, the script will throw a clear error. Follow the
prerequisite instructions above to help the user add to the
file.
- Notification: If this skill is used, ensure this is mentioned in the
output.
Utility Scripts
1. — Count Matching Variants
Purpose: Check how many variants match a query without fetching IDs. Use to
decide whether a full
is warranted.
Arguments:
- : (Required) NCBI Entrez search query string.
- : (Required) Output JSON file path.
Example: uv run scripts/clinvar_api.py count \ --query "TP53[gene] AND \"uncertain significance\"[clinsig]" \ --output count.json
Output:
2. — Search Variants
Purpose: Identify variants based on genomic location, gene symbols, or
clinical attributes using NCBI Entrez search syntax. The search command
automatically paginates through all matching results to ensure complete,
deterministic retrieval.
bash
# Fetch ALL matching variants (default behavior)
uv run scripts/clinvar_api.py search \
--query "BRCA1[gene]" --output results.json
# Search by Chromosome and Position Range
uv run scripts/clinvar_api.py search \
--query "11[chr] AND 5225000:5226000[chrpos]" --output results.json
# Combine terms using Entrez syntax
uv run scripts/clinvar_api.py search \
--query "HBB[gene] AND pathogenic[clinsig]" --output results.json
# Cap results at 50
uv run scripts/clinvar_api.py search \
--query "TP53[gene]" --retmax 50 --output results.json
Arguments:
- : (Required) NCBI Entrez search query string.
- : Maximum total number of variant IDs to return. Default is 0,
which means "fetch all matching results." Set to a positive integer to cap
the result set.
- : Number of IDs to fetch per API request (default: 500, max:
10000 per NCBI limits).
- : (Required) Output JSON file path.
Output: A JSON object containing:
- — Total number of matching variants in ClinVar.
- — Number of IDs actually retrieved.
- — List of ClinVar Variation ID strings.
3. — Get Interpretation Summary
Purpose: Retrieve top-line clinical significance labels, star ratings
(review status), and basic phenotype data for rapid variant screening.
bash
# Get summary for one or more Variation IDs
uv run scripts/clinvar_api.py summary \
--variant_ids 12345 67890 --output summary.json
Arguments:
- : (Required) One or more ClinVar Variation IDs.
- : (Required) Output JSON file path.
Output: A JSON list of summary objects, each containing:
- , , , ,
,
- — list of
{gene_id, symbol, strand}
- — e.g., single nucleotide variant, Deletion, Insertion
- — list of strings (e.g., ["missense variant",
"nonsense"])
4. — Get Clinical Evidence
Purpose: Fetch the full clinical record for a single variant, including
free-text clinician rationales, assertion methods, and specific submitter notes.
bash
# Get full evidence for a single Variation ID
uv run scripts/clinvar_api.py evidence \
--variant_id 12345 --output evidence.json
Arguments:
- : (Required) A single ClinVar Variation ID.
- : (Required) Output JSON file path.
Output: A JSON object containing:
- —
{chromosome, position_start, position_stop, reference_allele, alternate_allele, cytogenetic_band, dbsnp_rsid}
(GRCh38
preferred)
- — list of
{name, medgen_cui, omim_id, orphanet_id, hpo_terms}
- — list of
{value, sequence_ontology_id}
structural_variant_details
— {outer_start, inner_start, inner_stop, outer_stop, copy_number}
(present only for CNVs, otherwise null)
- — list of PubMed IDs cited in the global "Citations"
section
- — list of per-submitter records, each containing:
- , , ,
- — when the submitter last reviewed the
classification
Typical Workflows
Count-First Workflow (Recommended)
For large or unknown result sets, use
first to decide whether to
proceed, then
(which auto-paginates and returns
/
), then
to screen.
bash
# Step 1: Gauge size (optional — search also returns total_count)
uv run scripts/clinvar_api.py count \
--query "HBB[gene] AND pathogenic[clinsig]" --output count.json
# Step 2: Fetch all variant IDs (auto-paginates)
uv run scripts/clinvar_api.py search \
--query "HBB[gene] AND pathogenic[clinsig]" --output ids.json
# Step 3: Get summaries (extract variant_ids from search output)
uv run scripts/clinvar_api.py summary \
--variant_ids 12345 67890 --output summary.json
Deep Dive: search → evidence
When you need the full clinical picture for a specific variant — including
submitter rationales, PubMed citations, ontology-linked conditions, and allele
coordinates — use
.
bash
uv run scripts/clinvar_api.py evidence \
--variant_id 12345 --output evidence.json
Workflow: Robust Variant Discovery (Triangulation)
ClinVar metadata is inconsistent. To fulfill "List all" requests, do not rely on
a single filter. Perform the following in a single turn and merge results:
- Search by exact label (e.g.,
"3 prime UTR variant"[molecular_consequence]
).
- Search by HGVS nomenclature pattern (e.g., ).
- Search by genomic coordinate range (using ).
This "triangulation" ensures structural variants with missing labels are not
overlooked.
Verifying Coding vs. Non-Coding Status via HGVS
alone can be ambiguous (e.g.,
appears in both coding and non-coding contexts). Always cross-check the
field for HGVS patterns:
- — 5' UTR (non-coding)
- — 3' UTR (non-coding)
- / — intronic (non-coding)
- etc. — protein effect (coding)
A variant with UTR/intronic HGVS and no
annotation is non-coding, even with
splicing labels. Conversely, any
annotation indicates a coding effect.
ClinVar Metadata Reference
- 3' UTR
- Search String:
"3 prime UTR variant"[mol_consequence]
- HGVS:
- 5' UTR
- Search String:
"5 prime UTR variant"[mol_consequence]
- HGVS:
- To find "high-confidence" variants or expert-reviewed consensus, use the
filter. This is the most efficient way to distinguish
between single-laboratory assertions and panel-reviewed ground truth.
When to Use Which Fields
- Quick pathogenicity label — Use →
- Gene symbol and strand — Use →
- Variant type (SNV, del, etc.) — Use →
- Protein-level effect — Use →
- Genomic coordinates (GRCh38) — Use →
- Linked conditions (ontology) — Use →
- SO functional consequence — Use →
- CNV breakpoints/copy number — Use →
structural_variant_details
- PubMed references — Use →
- Date of last lab review — Use both →
- Clinician rationales — Use →
submissions[].curator_notes
Retrieving Genomic Coordinates (Default HG38/GRCh38)
To get precise genomic coordinates in the format
<chrom>:<pos>:<ref>><alt>
(e.g.,
), you must use the
command, as these
details are not available in the
output.
You MUST always include genomic coordinates in the format
<chrom>:<pos>:<ref>><alt>
when listing or presenting variants, even if not
explicitly requested by the user. If coordinates are missing from the summary,
use the command or dbSNP fallback to retrieve them.
- Fetch Evidence: Use
uv run scripts/clinvar_api.py evidence --variant_id <ID> --output evidence.json
.
- Extract VCF Attributes: The command parses the XML. Extract:
- Chromosome:
- Position: (or )
- Ref: (or )
- Alt: (or ) from the
element with .
Fallback for Imprecise Coordinates (Gene Range): ClinVar often returns the
full gene range for non-coding variants. If the extracted coordinates correspond
to the gene range instead of a specific position, use the
skill
to resolve the precise coordinates using the
or HGVS title: 1.Check
for
in the
output. 2. Run
uv run scripts/dbsnp_cli.py resolve-rsid {rsid}
to get precise GRCh38 coordinates. 3. Format as
<chrom>:<pos>:<ref>><alt>
using the SPDI or HGVS data from dbSNP.
Structural Variant Note
The
structural_variant_details
field is
only populated for copy number
variants (CNVs). For standard SNVs and small indels this field will be
.
Use the
fields (
,
,
,
) instead.
CNV / Large Deletion Note
Large copy-number variants (CNVs) frequently have empty
. If a variant title mentions "del" and coordinates
overlap your target region, it is relevant regardless of missing labels.
Obtaining and Using an API Key
To increase the rate limit to 10 requests per second, you need to obtain an NCBI
API key and add it to the
file. You can obtain a key by following the
instructions at
NCBI ClinVar API docs
Once you have a key, follow the prerequisite instructions to add it to the
file.
bash
uv run scripts/clinvar_api.py search --query "BRCA1[gene]" --output results.json
If a
is encountered, follow the prerequisite instructions to
help the user add
to the
file, providing the
NCBI ClinVar API docs URL for instructions on how to obtain one.
Best Practices
- Always use to execute .
- If is unavailable pivot immediately to using Python one-liners for
processing JSON (e.g.,
uv run python3 -c "import json; ..."
).
- Use before to understand the result set size.
- The command fetches all results by default and includes
and in the output — always verify these match
to confirm complete retrieval.
- Entrez results are unsorted. To order by date, fetch all results and
sort locally by .
Common Mistakes
- Attempting to parse the E-utilities XML yourself — Always use the
provided client which handles the unpredictable XML schemas
robustly.
- Getting HTTP 429 Too Many Requests — The client throws an exception
telling you to pause. Follow the prerequisite instructions to help the user
add to the file, then retry.
- Sending raw DNA sequences to the API — The API expects HGVS
nomenclature, RS IDs, or proper Entrez coordinate syntax (), not raw ATCG strings.
- For synonymous or non-coding variants — HGVS nomenclature (e.g., CAPN3
AND "c.551C>T") is more reliable than coordinate searches ([chrpos]), as
many ClinVar records for these types lack precise genomic mappings.
- Case sensitivity in molecular consequences — ClinVar returns mixed-case
strings. Always use case-insensitive matching () when filtering.
- Parsing output as a bare list — returns a JSON object
with , , and — not a bare list.