encode-ccres-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ENCODE Database Skill

ENCODE 数据库技能

This skill allows you to query the ENCODE Registry of cCREs (candidate cis-Regulatory Elements) via the SCREEN GraphQL API. It helps identify functional non-coding DNA elements (like Promoters, Enhancers, and insulators) by analyzing biochemical signatures (DNase, H3K4me3, H3K27ac, CTCF).
本技能允许您通过SCREEN GraphQL API查询ENCODE候选顺式调控元件注册库(cCREs)。它通过分析生化特征(DNase、H3K4me3、H3K27ac、CTCF)帮助识别功能性非编码DNA元件(如启动子、增强子和绝缘子)。

Prerequisites

前置条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.encodeproject.org/help/rest-api/, then (2) create the file recording the notification text and timestamp.
  1. uv
    :阅读
    uv
    技能文档并遵循其安装说明,确保
    uv
    已安装且在系统PATH中。
  2. 用户通知:如果此技能目录中尚未存在LICENSE_NOTIFICATION.txt文件,则(1) 显著通知用户查看https://www.encodeproject.org/help/rest-api/上的条款,然后(2) 创建记录通知文本和时间戳的文件。

Core Rules

核心规则

  • Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
  • Parsing Output: Do NOT use
    cat
    to read the entire JSON output file into context, as it can be extremely large. You MUST use
    jq
    to efficiently parse and extract relevant fields.
  • Notification: If this skill is used, ensure this is mentioned in the output.
  • 使用封装脚本:始终执行提供的辅助脚本查询数据库,而非直接访问数据库。这些脚本会自动优雅地执行所需的速率限制。
  • 输出解析:请勿使用
    cat
    将整个JSON输出文件读入上下文,因为文件可能极大。您必须使用
    jq
    来高效解析和提取相关字段。
  • 通知要求:如果使用了此技能,请确保在输出中提及这一点。

Quick Start

快速开始

bash
undefined
bash
undefined

Search cCREs by coordinates

Search cCREs by coordinates

uv run scripts/screen_api.py search --chromosome chr11
--start 5205263 --end 5207263
--output /tmp/search.json
uv run scripts/screen_api.py search --chromosome chr11
--start 5205263 --end 5207263
--output /tmp/search.json

Get details for a specific cCRE

Get details for a specific cCRE

uv run scripts/screen_api.py details EH38E2941922
--output /tmp/details.json

All subcommands write JSON to disk. Always save output in a temporary location
like `/tmp/`.
uv run scripts/screen_api.py details EH38E2941922
--output /tmp/details.json

所有子命令都会将JSON写入磁盘。请始终将输出保存到`/tmp/`等临时位置。

Identifying High-Confidence ("Type A") Biosamples

识别高可信度("A型")生物样本

Biosamples in ENCODE are often categorized by their data completeness. "Type A" (or high-confidence) biosamples are those that have experimental data for all four core epigenetic markers: DNase, H3K4me3, H3K27ac, and CTCF.
The
biosamples
and
details
commands automatically enrich their output with an
is_type_a
boolean flag for each biosample.
Example: Finding high-confidence cell types
bash
uv run scripts/screen_api.py biosamples --output /tmp/biosamples.json
ENCODE中的生物样本通常按数据完整性分类。"A型"(或高可信度)生物样本是指拥有全部四种核心表观遗传标记实验数据的样本:DNase、H3K4me3、H3K27ac和CTCF
biosamples
details
命令会自动在输出中为每个生物样本添加
is_type_a
布尔标记。
示例:查找高可信度细胞类型
bash
uv run scripts/screen_api.py biosamples --output /tmp/biosamples.json

Use jq to filter for Type A biosamples

Use jq to filter for Type A biosamples

jq '.data.ccREBiosampleQuery.biosamples[] | select(.is_type_a == true) | .displayname' /tmp/biosamples.json
undefined
jq '.data.ccREBiosampleQuery.biosamples[] | select(.is_type_a == true) | .displayname' /tmp/biosamples.json
undefined

Parsing Output (CRITICAL)

输出解析(关键)

Do NOT use
cat
to read the entire JSON output file into context, as it
can be extremely large. Instead, you MUST use
jq
to efficiently parse and extract the relevant fields from the JSON file saved by the script. If
jq
is not available on the system, write your own Python filtering code (e.g.,
python3 -c "import json..."
) to extract the necessary data.
For a complete reference of the JSON structure returned by eachmcommand (so you know which fields to query with
jq
), read
references/json_output_structure.md
.
请勿使用
cat
将整个JSON输出文件读入上下文,因为文件可能极大。
相反,您必须使用
jq
从脚本保存的JSON文件中高效解析和提取相关字段。如果系统中未安装
jq
,请编写自定义Python过滤代码(如
python3 -c "import json..."
)来提取必要数据。
如需了解每个命令返回的JSON结构完整参考(以便知道用
jq
查询哪些字段),请阅读
references/json_output_structure.md

Available Commands

可用命令

  • search
    : Search cCREs by coordinates, accessions, or epigenetic signals.
    bash
    uv run scripts/screen_api.py search \
        --chromosome chr11 --start 5205263 --end 5207263 \
        --output /tmp/search.json
  • nearby-genes
    : Find nearby genes for given cCRE accessions.
    bash
    uv run scripts/screen_api.py nearby-genes \
        EH38E1516972 --output /tmp/nearby.json
  • details
    : Get detailed information and biosample-specific max Z-scores for a specific cCRE.
    bash
    uv run scripts/screen_api.py details EH38E2941922 \
        --output /tmp/details.json
  • biosamples
    : Get biosample metadata for an assembly.
    bash
    uv run scripts/screen_api.py biosamples \
        --output /tmp/biosamples.json
  • orthologs
    : Get orthologous cCREs in another assembly.
    bash
    uv run scripts/screen_api.py orthologs EH38E2941922 \
        --output /tmp/orthologs.json
  • linked-genes
    : Find linked genes via methods like HiC or eQTLs.
    bash
    uv run scripts/screen_api.py linked-genes \
        EH38E1516972 --output /tmp/linked.json
  • gene-expression
    : Get gene expression (TPM) across all biosamples for a named gene. Internally resolves the gene symbol to an Ensembl gene ID, then queries per-biosample RNA-seq quantifications.
    bash
    uv run scripts/screen_api.py gene-expression GAPDH \
        --output /tmp/gene_expr.json
  • entex
    : Get ENTEx data for a cCRE or genomic region.
    bash
    uv run scripts/screen_api.py entex \
        --accession EH38E1310345 \
        --output /tmp/entex.json
    bash
    uv run scripts/screen_api.py entex \
        --region chr1:1000068:1000409 \
        --output /tmp/entex.json
  • gwas
    : Query genome-wide association studies, SNPs, or enrichment data.
    bash
    uv run scripts/screen_api.py gwas studies \
        --output /tmp/gwas.json
    bash
    uv run scripts/screen_api.py gwas snps --study \
        Ahola-Olli_AV-27989323-Eotaxin_levels \
        --output /tmp/gwas_snps.json
You can supply the
--assembly mm10
or
--assembly grch38
flag to explicitly request a specific assembly for most commands. By default, the script targets
grch38
but will automatically fall back to
mm10
if no results are found or if the query fails.
  • search
    :通过坐标、登录号或表观遗传信号搜索cCREs。
    bash
    uv run scripts/screen_api.py search \
        --chromosome chr11 --start 5205263 --end 5207263 \
        --output /tmp/search.json
  • nearby-genes
    :查找给定cCRE登录号的邻近基因。
    bash
    uv run scripts/screen_api.py nearby-genes \
        EH38E1516972 --output /tmp/nearby.json
  • details
    :获取特定cCRE的详细信息及生物样本特异性最大Z值。
    bash
    uv run scripts/screen_api.py details EH38E2941922 \
        --output /tmp/details.json
  • biosamples
    :获取某个组装版本的生物样本元数据。
    bash
    uv run scripts/screen_api.py biosamples \
        --output /tmp/biosamples.json
  • orthologs
    :获取另一个组装版本中的同源cCREs。
    bash
    uv run scripts/screen_api.py orthologs EH38E2941922 \
        --output /tmp/orthologs.json
  • linked-genes
    :通过HiC或eQTL等方法查找关联基因。
    bash
    uv run scripts/screen_api.py linked-genes \
        EH38E1516972 --output /tmp/linked.json
  • gene-expression
    :获取指定基因在所有生物样本中的基因表达量(TPM)。内部会将基因符号解析为Ensembl基因ID,然后查询每个生物样本的RNA-seq定量数据。
    bash
    uv run scripts/screen_api.py gene-expression GAPDH \
        --output /tmp/gene_expr.json
  • entex
    :获取cCRE或基因组区域的ENTEx数据。
    bash
    uv run scripts/screen_api.py entex \
        --accession EH38E1310345 \
        --output /tmp/entex.json
    bash
    uv run scripts/screen_api.py entex \
        --region chr1:1000068:1000409 \
        --output /tmp/entex.json
  • gwas
    :查询全基因组关联研究、SNP或富集数据。
    bash
    uv run scripts/screen_api.py gwas studies \
        --output /tmp/gwas.json
    bash
    uv run scripts/screen_api.py gwas snps --study \
        Ahola-Olli_AV-27989323-Eotaxin_levels \
        --output /tmp/gwas_snps.json
您可以为大多数命令提供
--assembly mm10
--assembly grch38
标志,明确请求特定的组装版本。默认情况下,脚本以
grch38
为目标,但如果未找到结果或查询失败,会自动回退到
mm10

ENCODE Portal REST API (Direct Access)

ENCODE Portal REST API(直接访问)

For accessing raw experiments, ChIP-seq peaks, or other datasets that are not represented as cCREs in SCREEN, use the
scripts/encode_portal_api.py
script. It allows custom queries to the ENCODE Portal REST API.
如需访问未在SCREEN中以cCREs形式呈现的原始实验、ChIP-seq峰或其他数据集,请使用
scripts/encode_portal_api.py
脚本。它允许对ENCODE Portal REST API进行自定义查询。

Usage

使用方法

bash
uv run scripts/encode_portal_api.py search "type=Experiment&target.label=ZNF549" --output /tmp/znf549_experiments.json
bash
uv run scripts/encode_portal_api.py search "type=Experiment&target.label=ZNF549" --output /tmp/znf549_experiments.json

Data Analysis Tips

数据分析技巧

When analyzing
.bed
or
.bigBed
files downloaded from ENCODE, standard bioinformatics tools are highly recommended for finding overlaps (e.g., between gene promoters and peaks):
  • bedtools
    : For fast mathematical operations on genomic intervals.
  • bigBedToBed
    : For converting binary BigBed files to readable BED format.
  • pybedtools
    : A Python wrapper for
    bedtools
    .
Write custom logic if these tools are not pre-installed.
分析从ENCODE下载的
.bed
.bigBed
文件时,强烈推荐使用标准生物信息学工具查找重叠区域(如基因启动子和峰之间的重叠):
  • bedtools
    :用于对基因组区间进行快速数学运算。
  • bigBedToBed
    :用于将二进制BigBed文件转换为可读的BED格式。
  • pybedtools
    bedtools
    的Python封装库。
如果这些工具未预安装,请编写自定义逻辑。

Custom Queries (SCREEN GraphQL)

自定义查询(SCREEN GraphQL)

If you need to make a complex GraphQL query that the script does not support, read
references/graphql_schema.md
for a reference of available queries, arguments, and return fields in the SCREEN GraphQL API.
如果您需要执行脚本不支持的复杂GraphQL查询,请阅读
references/graphql_schema.md
,了解SCREEN GraphQL API中可用的查询、参数和返回字段参考。