ENCODE Database Skill
This skill allows you to query the ENCODE Registry of cCREs (candidate
cis-Regulatory Elements) via the SCREEN GraphQL API. It helps identify
functional non-coding DNA elements (like Promoters, Enhancers, and insulators)
by analyzing biochemical signatures (DNase, H3K4me3, H3K27ac, CTCF).
Prerequisites
- : Read the skill and follow its Setup instructions to ensure
is installed and on PATH.
- User Notification: If LICENSE_NOTIFICATION.txt does not already exist in
this skill directory then (1) prominently notify the user to check the terms
at https://www.encodeproject.org/help/rest-api/, then (2) create the file
recording the notification text and timestamp.
Core Rules
- Use the Wrapper: ALWAYS execute the provided helper scripts to query the
database rather than accessing the database directly. The scripts
automatically enforce the required rate limit gracefully.
- Parsing Output: Do NOT use to read the entire JSON output file
into context, as it can be extremely large. You MUST use to efficiently
parse and extract relevant fields.
- Notification: If this skill is used, ensure this is mentioned in the
output.
Quick Start
bash
# Search cCREs by coordinates
uv run scripts/screen_api.py search --chromosome chr11 \
--start 5205263 --end 5207263 \
--output /tmp/search.json
# Get details for a specific cCRE
uv run scripts/screen_api.py details EH38E2941922 \
--output /tmp/details.json
All subcommands write JSON to disk. Always save output in a temporary location
like
.
Identifying High-Confidence ("Type A") Biosamples
Biosamples in ENCODE are often categorized by their data completeness. "Type
A" (or high-confidence) biosamples are those that have experimental data for
all four core epigenetic markers: DNase, H3K4me3, H3K27ac, and CTCF.
The
and
commands automatically enrich their output with
an
boolean flag for each biosample.
Example: Finding high-confidence cell types
bash
uv run scripts/screen_api.py biosamples --output /tmp/biosamples.json
# Use jq to filter for Type A biosamples
jq '.data.ccREBiosampleQuery.biosamples[] | select(.is_type_a == true) | .displayname' /tmp/biosamples.json
Parsing Output (CRITICAL)
Do NOT use to read the entire JSON output file into context, as it
can be extremely large. Instead, you MUST use
to efficiently parse and
extract the relevant fields from the JSON file saved by the script. If
is
not available on the system, write your own Python filtering code (e.g.,
python3 -c "import json..."
) to extract the necessary data.
For a complete reference of the JSON structure returned by eachmcommand (so you
know which fields to query with
), read
references/json_output_structure.md
.
Available Commands
-
: Search cCREs by coordinates, accessions, or epigenetic signals.
bash
uv run scripts/screen_api.py search \
--chromosome chr11 --start 5205263 --end 5207263 \
--output /tmp/search.json
-
: Find nearby genes for given cCRE accessions.
bash
uv run scripts/screen_api.py nearby-genes \
EH38E1516972 --output /tmp/nearby.json
-
: Get detailed information and biosample-specific max Z-scores for
a specific cCRE.
bash
uv run scripts/screen_api.py details EH38E2941922 \
--output /tmp/details.json
-
: Get biosample metadata for an assembly.
bash
uv run scripts/screen_api.py biosamples \
--output /tmp/biosamples.json
-
: Get orthologous cCREs in another assembly.
bash
uv run scripts/screen_api.py orthologs EH38E2941922 \
--output /tmp/orthologs.json
-
: Find linked genes via methods like HiC or eQTLs.
bash
uv run scripts/screen_api.py linked-genes \
EH38E1516972 --output /tmp/linked.json
-
: Get gene expression (TPM) across all biosamples for a
named gene. Internally resolves the gene symbol to an Ensembl gene ID, then
queries per-biosample RNA-seq quantifications.
bash
uv run scripts/screen_api.py gene-expression GAPDH \
--output /tmp/gene_expr.json
-
: Get ENTEx data for a cCRE or genomic region.
bash
uv run scripts/screen_api.py entex \
--accession EH38E1310345 \
--output /tmp/entex.json
bash
uv run scripts/screen_api.py entex \
--region chr1:1000068:1000409 \
--output /tmp/entex.json
-
: Query genome-wide association studies, SNPs, or enrichment data.
bash
uv run scripts/screen_api.py gwas studies \
--output /tmp/gwas.json
bash
uv run scripts/screen_api.py gwas snps --study \
Ahola-Olli_AV-27989323-Eotaxin_levels \
--output /tmp/gwas_snps.json
You can supply the
or
flag to explicitly
request a specific assembly for most commands. By default, the script targets
but will automatically fall back to
if no results are found or
if the query fails.
ENCODE Portal REST API (Direct Access)
For accessing raw experiments, ChIP-seq peaks, or other datasets that are not
represented as cCREs in SCREEN, use the
scripts/encode_portal_api.py
script.
It allows custom queries to the ENCODE Portal REST API.
Usage
bash
uv run scripts/encode_portal_api.py search "type=Experiment&target.label=ZNF549" --output /tmp/znf549_experiments.json
Data Analysis Tips
When analyzing
or
files downloaded from ENCODE, standard
bioinformatics tools are highly recommended for finding overlaps (e.g., between
gene promoters and peaks):
- : For fast mathematical operations on genomic intervals.
- : For converting binary BigBed files to readable BED
format.
- : A Python wrapper for .
Write custom logic if these tools are not pre-installed.
Custom Queries (SCREEN GraphQL)
If you need to make a complex GraphQL query that the script does not support,
read
references/graphql_schema.md
for a reference of available queries,
arguments, and return fields in the SCREEN GraphQL API.