tooluniverse-sequence-retrieval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Biological Sequence Retrieval

生物序列检索

Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.
IMPORTANT: Always use English terms in tool calls (gene names, organism names, sequence descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.
检索DNA、RNA和蛋白质序列,支持精准消歧与跨数据库处理。
重要提示:工具调用中始终使用英文术语(基因名称、物种名称、序列描述),即使用户使用其他语言提问。仅当英文检索无结果时,才尝试使用用户原语言术语作为备选。回复需使用用户的语言。

Workflow Overview

工作流程概述

Phase 0: Clarify (if needed)
Phase 1: Disambiguate Gene/Organism
Phase 2: Search & Retrieve (Internal)
Phase 3: Report Sequence Profile

Phase 0: Clarify (if needed)
Phase 1: Disambiguate Gene/Organism
Phase 2: Search & Retrieve (Internal)
Phase 3: Report Sequence Profile

Phase 0: Clarification (When Needed)

阶段0:信息确认(必要时)

Ask the user ONLY if:
  • Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
  • Sequence type unclear (mRNA, genomic, protein?)
  • Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)
Skip clarification for:
  • Specific accession numbers (NC_, NM_, U*, etc.)
  • Clear organism + gene combinations
  • Complete genome requests with organism specified

仅在以下情况询问用户:
  • 基因名称存在于多个物种中(例如:"BRCA1" → 人类还是小鼠?)
  • 序列类型不明确(mRNA、基因组序列、蛋白质?)
  • 菌株/分离株信息至关重要(例如:大肠杆菌 → K-12、O157:H7等)
无需确认的情况:
  • 提供了具体登录号(NC_、NM_、U*等)
  • 物种+基因的组合清晰明确
  • 指定物种的完整基因组请求

Phase 1: Gene/Organism Disambiguation

阶段1:基因/物种消歧

1.1 Resolve Identifiers

1.1 解析标识符

python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

Strategy depends on input type

Strategy depends on input type

if user_provided_accession: # Direct retrieval based on accession type accession = user_provided_accession
elif user_provided_gene_and_organism: # Search NCBI Nucleotide result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, limit=10 )
undefined
if user_provided_accession: # Direct retrieval based on accession type accession = user_provided_accession
elif user_provided_gene_and_organism: # Search NCBI Nucleotide result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, limit=10 )
undefined

1.2 Accession Type Decision Tree

1.2 登录号类型决策树

CRITICAL: Accession prefix determines which tools to use.
PrefixTypeUse With
NC_*RefSeq chromosomeNCBI only
NM_*RefSeq mRNANCBI only
NR_*RefSeq ncRNANCBI only
NP_*RefSeq proteinNCBI only
XM_*RefSeq predicted mRNANCBI only
U*, M*, K*, X*GenBankNCBI or ENA
CP*, NZ_*GenBank genomeNCBI or ENA
EMBL formatEMBLENA preferred
关键:登录号前缀决定使用的工具。
前缀类型适用工具
NC_*RefSeq染色体仅NCBI
NM_*RefSeq mRNA仅NCBI
NR_*RefSeq非编码RNA仅NCBI
NP_*RefSeq蛋白质仅NCBI
XM_*RefSeq预测mRNA仅NCBI
U*, M*, K*, X*GenBankNCBI或ENA
CP*, NZ_*GenBank基因组NCBI或ENA
EMBL格式EMBL优先使用ENA

1.3 Identity Resolution Checklist

1.3 身份解析检查清单

  • Organism confirmed (scientific name)
  • Gene symbol/name identified
  • Sequence type determined (genomic/mRNA/protein)
  • Strain specified (if relevant)
  • Accession prefix identified → tool selection

  • 已确认物种(学名)
  • 已识别基因符号/名称
  • 已确定序列类型(基因组/mRNA/蛋白质)
  • 已指定菌株(如相关)
  • 已识别登录号前缀 → 完成工具选择

Phase 2: Data Retrieval (Internal)

阶段2:数据检索(内部操作)

Retrieve silently. Do NOT narrate the search process.
静默执行检索,无需告知用户搜索过程。

2.1 Search for Sequences

2.1 搜索序列

python
undefined
python
undefined

Search NCBI Nucleotide

Search NCBI Nucleotide

result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, strain=strain, # Optional keywords=keywords, # Optional seq_type=seq_type, # complete_genome, mrna, refseq limit=10 )
result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, strain=strain, # Optional keywords=keywords, # Optional seq_type=seq_type, # complete_genome, mrna, refseq limit=10 )

Get accession numbers from UIDs

Get accession numbers from UIDs

accessions = tu.tools.NCBI_fetch_accessions( operation="fetch_accession", uids=result["data"]["uids"] )
undefined
accessions = tu.tools.NCBI_fetch_accessions( operation="fetch_accession", uids=result["data"]["uids"] )
undefined

2.2 Retrieve Sequence Data

2.2 检索序列数据

python
undefined
python
undefined

Get sequence in desired format

Get sequence in desired format

sequence = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="fasta" # or "genbank" )
sequence = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="fasta" # or "genbank" )

GenBank format for annotations

GenBank format for annotations

annotations = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="genbank" )
undefined
annotations = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="genbank" )
undefined

2.3 ENA Alternative (for GenBank/EMBL accessions)

2.3 ENA备选方案(适用于GenBank/EMBL登录号)

python
undefined
python
undefined

Only for non-RefSeq accessions!

Only for non-RefSeq accessions!

if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")): # ENA entry info entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)

# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)
undefined
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")): # ENA entry info entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)

# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)
undefined

Fallback Chains

备选检索链

PrimaryFallbackNotes
NCBI_get_sequenceENA (if GenBank format)NCBI unavailable
ENA_get_entryNCBI_get_sequenceENA doesn't have RefSeq
NCBI_search_nucleotideTry broader keywordsNo results
Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.

主工具备选工具说明
NCBI_get_sequenceENA(如为GenBank格式)NCBI不可用时
ENA_get_entryNCBI_get_sequenceENA无RefSeq数据时
NCBI_search_nucleotide尝试更宽泛的关键词无检索结果时
关键规则:切勿使用ENA工具处理RefSeq登录号(NC_、NM_等)——会返回404错误。

Phase 3: Report Sequence Profile

阶段3:生成序列分析报告

Output Structure

输出结构

Present as a Sequence Profile Report. Hide search process.
markdown
undefined
序列分析报告形式呈现,隐藏搜索过程。
markdown
undefined

Sequence Profile: [Gene/Organism]

Sequence Profile: [Gene/Organism]

Search Summary
  • Query: [gene] in [organism]
  • Database: NCBI Nucleotide
  • Results: [N] sequences found

Search Summary
  • Query: [gene] in [organism]
  • Database: NCBI Nucleotide
  • Results: [N] sequences found

Primary Sequence

Primary Sequence

[Accession]: [Definition/Title]

[Accession]: [Definition/Title]

AttributeValue
Accession[accession]
TypeRefSeq / GenBank
Organism[scientific name]
Strain[strain if applicable]
Length[X,XXX bp / aa]
MoleculeDNA / mRNA / Protein
TopologyLinear / Circular
Curation Level: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party
AttributeValue
Accession[accession]
TypeRefSeq / GenBank
Organism[scientific name]
Strain[strain if applicable]
Length[X,XXX bp / aa]
MoleculeDNA / mRNA / Protein
TopologyLinear / Circular
Curation Level: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party

Sequence Statistics

Sequence Statistics

StatisticValue
Length[X,XXX] bp
GC Content[XX.X]%
Genes[N] (if genome)
CDS[N] (if annotated)
StatisticValue
Length[X,XXX] bp
GC Content[XX.X]%
Genes[N] (if genome)
CDS[N] (if annotated)

Sequence Preview

Sequence Preview

fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]
fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]

Annotations Summary (from GenBank format)

Annotations Summary (from GenBank format)

FeatureCountExamples
CDS[N][gene names]
tRNA[N]-
rRNA[N]16S, 23S
Regulatory[N]promoters

FeatureCountExamples
CDS[N][gene names]
tRNA[N]-
rRNA[N]16S, 23S
Regulatory[N]promoters

Alternative Sequences

Alternative Sequences

Ranked by relevance and curation level:
AccessionTypeLengthDescriptionENA Compatible
NC_000913.3RefSeq4.6 MbE. coli K-12 reference
U00096.3GenBank4.6 MbE. coli K-12
CP001509.3GenBank4.6 MbE. coli DH10B

Ranked by relevance and curation level:
AccessionTypeLengthDescriptionENA Compatible
NC_000913.3RefSeq4.6 MbE. coli K-12 reference
U00096.3GenBank4.6 MbE. coli K-12
CP001509.3GenBank4.6 MbE. coli DH10B

Cross-Database References

Cross-Database References

DatabaseAccessionLink
RefSeq[NC_*][NCBI link]
GenBank[U*][NCBI link]
ENA/EMBL[same as GenBank][ENA link]
BioProject[PRJNA*][link]
BioSample[SAMN*][link]

DatabaseAccessionLink
RefSeq[NC_*][NCBI link]
GenBank[U*][NCBI link]
ENA/EMBL[same as GenBank][ENA link]
BioProject[PRJNA*][link]
BioSample[SAMN*][link]

Download Options

Download Options

Formats Available

Formats Available

FormatDescriptionUse Case
FASTASequence onlyBLAST, alignment
GenBankSequence + annotationsGene analysis
GFF3Annotations onlyGenome browsers
FormatDescriptionUse Case
FASTASequence onlyBLAST, alignment
GenBankSequence + annotationsGene analysis
GFF3Annotations onlyGenome browsers

Direct Commands

Direct Commands

python
undefined
python
undefined

FASTA format

FASTA format

tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="fasta" )
tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="fasta" )

GenBank format (with annotations)

GenBank format (with annotations)

tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="genbank" )

---
tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="genbank" )

---

Related Sequences

Related Sequences

Other Strains/Isolates

Other Strains/Isolates

AccessionStrainSimilarityNotes
[acc1][strain1]99.9%[notes]
[acc2][strain2]99.5%[notes]
AccessionStrainSimilarityNotes
[acc1][strain1]99.9%[notes]
[acc2][strain2]99.5%[notes]

Protein Products (if applicable)

Protein Products (if applicable)

Protein AccessionProduct NameLength
[NP_*][protein name][X] aa

Retrieved: [date] Database: NCBI Nucleotide

---
Protein AccessionProduct NameLength
[NP_*][protein name][X] aa

Retrieved: [date] Database: NCBI Nucleotide

---

Curation Level Tiers

注释等级分层

TierSymbolAccession PrefixDescription
RefSeq Reference●●●●NC_, NM_, NP_NCBI-curated, gold standard
RefSeq Predicted●●●○XM_, XP_, XR_Computationally predicted
GenBank Validated●●○○VariousSubmitted, some curation
GenBank Direct●○○○VariousDirect submission
Third Party○○○○TPA_Third-party annotation
Include in report:
markdown
**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use

层级符号登录号前缀描述
RefSeq参考序列●●●●NC_, NM_, NP_NCBI注释,黄金标准
RefSeq预测序列●●●○XM_, XP_, XR_计算预测序列
GenBank已验证●●○○多种前缀提交后经过部分注释
GenBank直接提交●○○○多种前缀直接提交未注释
第三方注释○○○○TPA_第三方提供的注释
报告中需包含:
markdown
**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use

Completeness Checklist

完整性检查清单

Every sequence report MUST include:
每份序列报告必须包含:

Per Sequence (Required)

单条序列必填项

  • Accession number
  • Organism (scientific name)
  • Sequence type (DNA/RNA/protein)
  • Length
  • Curation level
  • Database source
  • 登录号
  • 物种(学名)
  • 序列类型(DNA/RNA/蛋白质)
  • 长度
  • 注释等级
  • 数据库来源

Search Summary (Required)

搜索摘要必填项

  • Query parameters
  • Number of results
  • Ranking rationale
  • 查询参数
  • 结果数量
  • 排序依据

Include Even If Limited

即使数据有限也需包含

  • Alternative sequences (or "Only one sequence found")
  • Cross-database references (or "No cross-references available")
  • Download instructions

  • 备选序列(或“仅找到一条序列”)
  • 跨数据库引用(或“无可用跨数据库引用”)
  • 下载说明

Common Use Cases

常见使用场景

Reference Genome

参考基因组

User: "Get E. coli K-12 complete genome"
python
result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Escherichia coli",
    strain="K-12",
    seq_type="complete_genome",
    limit=3
)
用户:“获取大肠杆菌K-12完整基因组”
python
result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Escherichia coli",
    strain="K-12",
    seq_type="complete_genome",
    limit=3
)

Return NC_000913.3 (RefSeq reference)

Return NC_000913.3 (RefSeq reference)

undefined
undefined

Gene Sequence

基因序列

User: "Find human BRCA1 mRNA"
python
result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Homo sapiens",
    gene="BRCA1",
    seq_type="mrna",
    limit=10
)
用户:“查找人类BRCA1的mRNA序列”
python
result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Homo sapiens",
    gene="BRCA1",
    seq_type="mrna",
    limit=10
)

Specific Accession

特定登录号

User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata
用户:“获取NC_045512.2的序列” → 直接检索并返回完整元数据

Strain Comparison

菌株对比

User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table

用户:“对比大肠杆菌K-12和O157:H7的基因组” → 检索两种菌株,提供对比表格

Error Handling

错误处理

ErrorResponse
"No search criteria provided"Add organism, gene, or keywords
"ENA 404 error"Accession is likely RefSeq → use NCBI only
"No results found"Broaden search, check spelling, try synonyms
"Sequence too large"Note size, provide download link instead of preview
"API rate limit"Tools auto-retry; if persistent, wait briefly

错误响应
"No search criteria provided"补充物种、基因或关键词
"ENA 404 error"该登录号可能为RefSeq → 仅使用NCBI工具
"No results found"扩大搜索范围,检查拼写,尝试同义词
"Sequence too large"标注序列大小,提供下载链接而非预览
"API rate limit"工具自动重试;若持续失败,短暂等待后重试

Tool Reference

工具参考

NCBI Tools (All Accessions)
ToolPurpose
NCBI_search_nucleotide
Search by gene/organism
NCBI_fetch_accessions
Convert UIDs to accessions
NCBI_get_sequence
Retrieve sequence data
ENA Tools (GenBank/EMBL Only)
ToolPurpose
ena_get_entry
Entry metadata
ena_get_sequence_fasta
FASTA sequence
ena_get_entry_summary
Summary info

NCBI工具(支持所有登录号)
工具用途
NCBI_search_nucleotide
按基因/物种搜索
NCBI_fetch_accessions
将UID转换为登录号
NCBI_get_sequence
检索序列数据
ENA工具(仅适用于GenBank/EMBL登录号)
工具用途
ena_get_entry
获取条目元数据
ena_get_sequence_fasta
获取FASTA序列
ena_get_entry_summary
获取条目摘要

Search Parameters Reference

搜索参数参考

NCBI_search_nucleotide
ParameterDescriptionExample
operation
Always "search""search"
organism
Scientific name"Homo sapiens"
gene
Gene symbol"BRCA1"
strain
Specific strain"K-12"
keywords
Free text"complete genome"
seq_type
Sequence type"complete_genome", "mrna", "refseq"
limit
Max results10
NCBI_get_sequence
ParameterDescriptionExample
operation
Always "fetch_sequence""fetch_sequence"
accession
Accession number"NC_000913.3"
format
Output format"fasta", "genbank"
NCBI_search_nucleotide
参数描述示例
operation
固定为"search""search"
organism
物种学名"Homo sapiens"
gene
基因符号"BRCA1"
strain
特定菌株"K-12"
keywords
自由文本"complete genome"
seq_type
序列类型"complete_genome", "mrna", "refseq"
limit
最大结果数10
NCBI_get_sequence
参数描述示例
operation
固定为"fetch_sequence""fetch_sequence"
accession
登录号"NC_000913.3"
format
输出格式"fasta", "genbank"