ucsc-conservation-and-tfbs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Conservation Scores & TFBS Lookup (UCSC)

保守分数与TFBS查询(UCSC)

This skill provides access to evolutionary constraint scores and conserved elements from the UCSC Genome Browser. It retrieves scores from the PHAST package — specifically
phastCons
(identifying functional blocks) and
phyloP
(measuring individual site constraint) — calculated from multiple alignments.
Use this skill to determine if a non-coding variant hits a site that hasn't changed since a common ancestor (which is a strong signal for pathogenicity) or to find conservation peaks across a regulatory element.
本技能可从UCSC Genome Browser获取进化约束分数和保守元件。它会检索PHAST包中的分数——具体包括
phastCons
(识别功能区块)和
phyloP
(测量单个位点的约束程度)——这些分数由多序列比对计算得出。
使用本技能可以判断非编码变异是否发生在自共同祖先以来未发生变化的位点(这是致病性的强烈信号),或者查找调控元件上的保守性峰值。

Prerequisites

前置条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://genome.ucsc.edu/conditions.html and https://genome.ucsc.edu/goldenPath/help/api.html, then (2) create the file recording the notification text and timestamp.
  1. uv
    : 阅读
    uv
    技能并遵循其设置说明,确保
    uv
    已安装且在PATH路径中。
  2. 用户通知: 如果本技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显著通知用户查看 https://genome.ucsc.edu/conditions.htmlhttps://genome.ucsc.edu/goldenPath/help/api.html 上的条款,然后(2) 创建记录通知文本和时间戳的文件。

Core Rules

核心规则

  • Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
  • Large Output Handling: Always pass --output to redirect output to a file. Parse it separately (using jq or your own code).
  • Notification: If this skill is used, ensure this is mentioned in the output.
  • 使用包装器脚本: 务必使用提供的辅助脚本查询数据库,而非直接访问数据库。这些脚本会自动优雅地执行所需的速率限制。
  • 大输出处理: 始终通过--output参数将输出重定向到文件。单独解析该文件(使用jq或自定义代码)。
  • 通知要求: 如果使用本技能,需确保在输出中提及这一点。

Utility Scripts

实用脚本

This skill includes scripts to query different types of genomic data from UCSC:
  1. scripts/get_conservation.py
    : For Evolutionary Conservation scores (phyloP, phastCons).
  2. scripts/get_tfbs.py
    : For Transcription Factor Binding Sites (TFBS).
  3. scripts/list_tracks.py
    : For listing available tracks based on search or group constraints.
Always use the
hg38
genome assembly by default, unless the user has specified otherwise.
本技能包含用于从UCSC查询不同类型基因组数据的脚本:
  1. scripts/get_conservation.py
    : 用于获取进化保守分数(phyloP、phastCons)。
  2. scripts/get_tfbs.py
    : 用于获取转录因子结合位点(TFBS)。
  3. scripts/list_tracks.py
    : 用于根据搜索或组约束列出可用的轨道。
默认情况下始终使用
hg38
基因组组装,除非用户另有指定。

Fetching Conservation for Specific Variants

获取特定变异的保守性数据

To get the evolutionary constraint at a single base, or a list of specific bases. This is optimal for single nucleotide variants (SNVs).
phyloP
is the best metric for individual bases.
bash
uv run scripts/get_conservation.py --coordinates "chr1:215867804" "chr1:215867823" --output /tmp/cons_output.json
获取单个碱基或特定碱基列表的进化约束程度。这适用于单核苷酸变异(SNVs)。
phyloP
是衡量单个碱基的最佳指标。
bash
uv run scripts/get_conservation.py --coordinates "chr1:215867804" "chr1:215867823" --output /tmp/cons_output.json

Fetching Regions and Conserved Elements

获取区域与保守元件

To identify "conservation peaks" across a non-coding regulatory element (like an enhancer) to see if an ISM-predicted importance peak aligns with evolutionary history.
phastCons
is best for functional windows due to HMM smoothing. The
--conserved-elements
flag will also retrieve predefined blocks under extreme constraint.
bash
uv run scripts/get_conservation.py --coordinates "chr8:11748914-11749085" --conserved-elements --output /tmp/region_cons.json
识别非编码调控元件(如增强子)上的“保守性峰值”,查看ISM预测的重要性峰值是否与进化历史一致。由于HMM平滑处理,
phastCons
是衡量功能窗口的最佳指标。
--conserved-elements
标志还将检索处于极端约束下的预定义区块。
bash
uv run scripts/get_conservation.py --coordinates "chr8:11748914-11749085" --conserved-elements --output /tmp/region_cons.json

Lineage-Specific Constraints

谱系特异性约束

You can control the evolutionary depth using the
--collection
flag. The default (
vertebrate
) uses the 100-vertebrate Multiz alignment for both hg38 and hg19, matching the UCSC Genome Browser's default comparative genomics tracks.
可以使用
--collection
标志控制进化深度。默认值(
vertebrate
)对hg38和hg19均使用100脊椎动物Multiz比对,与UCSC Genome Browser的默认比较基因组轨道一致。

hg38 Collections

hg38集合

  • vertebrate
    (default): UCSC 100-vertebrate Multiz alignment. phyloP:
    phyloP100way
    , phastCons:
    phastCons100way
    .
  • mammal
    : Hiller Lab 470-way mammalian alignment. phyloP:
    phyloP470wayBW
    , phastCons:
    phastCons470way
    .
  • primate
    : UCSC 30-primate Multiz alignment. phyloP:
    phyloP30way
    , phastCons:
    phastCons30way
    .
  • vertebrate
    (默认): UCSC 100脊椎动物Multiz比对。phyloP:
    phyloP100way
    ,phastCons:
    phastCons100way
  • mammal
    : Hiller实验室470种哺乳动物比对。phyloP:
    phyloP470wayBW
    ,phastCons:
    phastCons470way
  • primate
    : UCSC 30种灵长类动物Multiz比对。phyloP:
    phyloP30way
    ,phastCons:
    phastCons30way

hg19 Collections

hg19集合

  • vertebrate
    (default): UCSC 100-vertebrate Multiz alignment. phyloP:
    phyloP100way
    , phastCons:
    phastCons100way
    .
  • vertebrate46
    : UCSC 46-vertebrate Multiz alignment (legacy). phyloP:
    phyloP46wayAll
    , phastCons:
    phastCons46way
    .
  • mammal
    : 46-way placental mammal subset. phyloP:
    phyloP46wayPlacental
    , phastCons:
    phastCons46wayPlacental
    .
  • primate
    : 46-way primate subset. phyloP:
    phyloP46wayPrimates
    , phastCons:
    phastCons46wayPrimates
    .
bash
undefined
  • vertebrate
    (默认): UCSC 100脊椎动物Multiz比对。phyloP:
    phyloP100way
    ,phastCons:
    phastCons100way
  • vertebrate46
    : UCSC 46脊椎动物Multiz比对(旧版)。phyloP:
    phyloP46wayAll
    ,phastCons:
    phastCons46way
  • mammal
    : 46种胎盘哺乳动物子集。phyloP:
    phyloP46wayPlacental
    ,phastCons:
    phastCons46wayPlacental
  • primate
    : 46种灵长类动物子集。phyloP:
    phyloP46wayPrimates
    ,phastCons:
    phastCons46wayPrimates
bash
undefined

hg38 mammal (Hiller 470-way)

hg38哺乳动物(Hiller 470种比对)

uv run scripts/get_conservation.py --coordinates "chr5:1045330-1046172" --collection mammal --output /tmp/mammal_cons.json
uv run scripts/get_conservation.py --coordinates "chr5:1045330-1046172" --collection mammal --output /tmp/mammal_cons.json

hg19 with legacy 46-vertebrate alignment

hg19使用旧版46脊椎动物比对

uv run scripts/get_conservation.py --coordinates "chr5:1045330-1046172" --genome hg19 --collection vertebrate46 --output /tmp/vert46_cons.json
undefined
uv run scripts/get_conservation.py --coordinates "chr5:1045330-1046172" --genome hg19 --collection vertebrate46 --output /tmp/vert46_cons.json
undefined

Analyzing Evolutionary Acceleration

分析进化加速

To analyze whether a specific locus is undergoing evolutionary acceleration (i.e. evolving more rapidly than the neutral drift baseline), use
--analyze
. This will compute scalar statistics (mean, min, max) for
phyloP
scores and provide a heuristic boolean
is_accelerated
to simplify your evaluation.
bash
uv run scripts/get_conservation.py --coordinates "chr5:1045330-1046172" --analyze --output /tmp/accelerated_cons.json
要分析特定位点是否正在经历进化加速(即进化速度快于中性漂变基线),请使用
--analyze
参数。这将计算
phyloP
分数的标量统计值(均值、最小值、最大值),并提供启发式布尔值
is_accelerated
以简化评估。
bash
uv run scripts/get_conservation.py --coordinates "chr5:1045330-1046172" --analyze --output /tmp/accelerated_cons.json

Fetching Transcription Factor Binding Sites (TFBS)

获取转录因子结合位点(TFBS)

To identify transcription factor binding sites for a given genomic interval. This is useful for interpreting non-coding variants that might disrupt TF binding.
Run
scripts/get_tfbs.py
with
--coordinates
and
--tracks
. You can query multiple tracks at once.
bash
uv run scripts/get_tfbs.py --coordinates "chr11:1001000-1010000" --tracks encRegTfbsClustered --output /tmp/tfbs_encode.json
JASPAR tracks may return very large result sets. Use
--tf-filter
to keep only items whose
TFName
field contains the given substring (case-insensitive):
bash
uv run scripts/get_tfbs.py --coordinates "chr6:36670000-36690000" --tracks jaspar2024 --tf-filter TP53 --output /tmp/tp53_sites.json
识别给定基因组区间的转录因子结合位点。这有助于解释可能破坏转录因子结合的非编码变异。
运行
scripts/get_tfbs.py
并传入
--coordinates
--tracks
参数。可以同时查询多个轨道。
bash
uv run scripts/get_tfbs.py --coordinates "chr11:1001000-1010000" --tracks encRegTfbsClustered --output /tmp/tfbs_encode.json
JASPAR轨道可能返回非常大的结果集。使用
--tf-filter
参数仅保留
TFName
字段包含指定子字符串的条目(不区分大小写):
bash
uv run scripts/get_tfbs.py --coordinates "chr6:36670000-36690000" --tracks jaspar2024 --tf-filter TP53 --output /tmp/tp53_sites.json

Common Verified Tracks (hg38)

常用已验证轨道(hg38)

  • ENCODE:
    encRegTfbsClustered
    (TF Clusters)
  • JASPAR:
    jaspar2026
    ,
    jaspar2024
    (Predicted TFBS)
  • ReMap:
    ReMapTFs
    (ChIP-seq Atlas)
[!CAUTION] Tracks like
jaspar
or
ReMap
without years are often "container" tracks and will fail with a 400 error. Always use the specific subtrack name (e.g.,
jaspar2026
).
  • ENCODE:
    encRegTfbsClustered
    (转录因子簇)
  • JASPAR:
    jaspar2026
    ,
    jaspar2024
    (预测TFBS)
  • ReMap:
    ReMapTFs
    (ChIP-seq图谱)
[!CAUTION] 不带年份的轨道如
jaspar
ReMap
通常是“容器”轨道,会返回400错误。请始终使用特定的子轨道名称(例如
jaspar2026
)。

Listing Available Tracks

列出可用轨道

To list available tracks (such as different versions of JASPAR, or purely to discover what tracks exist for a particular genome assembly):
bash
uv run scripts/list_tracks.py --search "jaspar" --output /tmp/jaspar_tracks.json
You can also filter by functional group:
bash
uv run scripts/list_tracks.py --group "regulation" --output /tmp/regulation_tracks.json
要列出可用轨道(例如不同版本的JASPAR,或仅发现特定基因组组装的可用轨道):
bash
uv run scripts/list_tracks.py --search "jaspar" --output /tmp/jaspar_tracks.json
也可以按功能组筛选:
bash
uv run scripts/list_tracks.py --group "regulation" --output /tmp/regulation_tracks.json

Anti-Patterns

反模式

  • DON'T query mammalian (
    --collection mammal
    ) constraint if you are explicitly looking for deep evolutionary roots across all vertebrates. Use the default
    vertebrate
    collection.
  • DON'T use this skill for determining the ancestral state reconstruction of a nucleotide (this skill provides measures of how much sites have changed, not what the ancestral nucleotide was).
  • DON'T assume low conservation strictly means neutral/useless sequence; it could also reflect a high local mutation rate which conservation scores alone cannot distinguish.
  • DON'T print output on standard out, or run cat on output to files. The output is too large. Use jq or write your own code to parse the output files.
  • DON'T use hg19 unless the user has explicitly asked for it. The default should be to always use hg38.
  • 请勿在明确寻找所有脊椎动物的深层进化根源时查询哺乳动物(
    --collection mammal
    )约束。请使用默认的
    vertebrate
    集合。
  • 请勿使用本技能确定核苷酸的祖先状态重建(本技能提供的是位点变化程度的测量值,而非祖先核苷酸的具体信息)。
  • 请勿假设低保守性严格意味着序列是中性/无用的;这也可能反映局部突变率高,而仅靠保守分数无法区分这一点。
  • 请勿在标准输出打印结果,或对输出文件执行cat命令。输出内容过大。请使用jq或编写自定义代码解析输出文件。
  • 请勿使用hg19,除非用户明确要求。默认应始终使用hg38。