tooluniverse-sequence-retrieval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Biological Sequence Retrieval

生物序列检索

Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.

IMPORTANT: Always use English terms in tool calls (gene names, organism names, sequence descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.

检索DNA、RNA和蛋白质序列，支持精准消歧与跨数据库处理。

重要提示：工具调用中始终使用英文术语（基因名称、物种名称、序列描述），即使用户使用其他语言提问。仅当英文检索无结果时，才尝试使用用户原语言术语作为备选。回复需使用用户的语言。

Workflow Overview

工作流程概述

Phase 0: Clarify (if needed)
    ↓
Phase 1: Disambiguate Gene/Organism
    ↓
Phase 2: Search & Retrieve (Internal)
    ↓
Phase 3: Report Sequence Profile

Phase 0: Clarify (if needed)
    ↓
Phase 1: Disambiguate Gene/Organism
    ↓
Phase 2: Search & Retrieve (Internal)
    ↓
Phase 3: Report Sequence Profile

Phase 0: Clarification (When Needed)

阶段0：信息确认（必要时）

Ask the user ONLY if:

Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
Sequence type unclear (mRNA, genomic, protein?)
Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)

Skip clarification for:

Specific accession numbers (NC_, NM_, U*, etc.)
Clear organism + gene combinations
Complete genome requests with organism specified

仅在以下情况询问用户：

基因名称存在于多个物种中（例如："BRCA1" → 人类还是小鼠？）
序列类型不明确（mRNA、基因组序列、蛋白质？）
菌株/分离株信息至关重要（例如：大肠杆菌 → K-12、O157:H7等）

无需确认的情况：

提供了具体登录号（NC_、NM_、U*等）
物种+基因的组合清晰明确
指定物种的完整基因组请求

Phase 1: Gene/Organism Disambiguation

阶段1：基因/物种消歧

1.1 Resolve Identifiers

1.1 解析标识符

python

from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

python

from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

Strategy depends on input type

if user_provided_accession: # Direct retrieval based on accession type accession = user_provided_accession

elif user_provided_gene_and_organism: # Search NCBI Nucleotide result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, limit=10 )

undefined

if user_provided_accession: # Direct retrieval based on accession type accession = user_provided_accession

elif user_provided_gene_and_organism: # Search NCBI Nucleotide result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, limit=10 )

undefined

1.2 Accession Type Decision Tree

1.2 登录号类型决策树

CRITICAL: Accession prefix determines which tools to use.

Prefix	Type	Use With
NC_*	RefSeq chromosome	NCBI only
NM_*	RefSeq mRNA	NCBI only
NR_*	RefSeq ncRNA	NCBI only
NP_*	RefSeq protein	NCBI only
XM_*	RefSeq predicted mRNA	NCBI only
U, M, K, X	GenBank	NCBI or ENA
CP, NZ_	GenBank genome	NCBI or ENA
EMBL format	EMBL	ENA preferred

关键：登录号前缀决定使用的工具。

前缀	类型	适用工具
NC_*	RefSeq染色体	仅NCBI
NM_*	RefSeq mRNA	仅NCBI
NR_*	RefSeq非编码RNA	仅NCBI
NP_*	RefSeq蛋白质	仅NCBI
XM_*	RefSeq预测mRNA	仅NCBI
U, M, K, X	GenBank	NCBI或ENA
CP, NZ_	GenBank基因组	NCBI或ENA
EMBL格式	EMBL	优先使用ENA

1.3 Identity Resolution Checklist

1.3 身份解析检查清单

Phase 2: Data Retrieval (Internal)

阶段2：数据检索（内部操作）

Retrieve silently. Do NOT narrate the search process.

静默执行检索，无需告知用户搜索过程。

2.1 Search for Sequences

2.1 搜索序列

python

undefined

python

undefined

Search NCBI Nucleotide

result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, strain=strain, # Optional keywords=keywords, # Optional seq_type=seq_type, # complete_genome, mrna, refseq limit=10 )

Get accession numbers from UIDs

accessions = tu.tools.NCBI_fetch_accessions( operation="fetch_accession", uids=result["data"]["uids"] )

undefined

accessions = tu.tools.NCBI_fetch_accessions( operation="fetch_accession", uids=result["data"]["uids"] )

undefined

2.2 Retrieve Sequence Data

2.2 检索序列数据

python

undefined

python

undefined

Get sequence in desired format

sequence = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="fasta" # or "genbank" )

GenBank format for annotations

annotations = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="genbank" )

undefined

annotations = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="genbank" )

undefined

2.3 ENA Alternative (for GenBank/EMBL accessions)

2.3 ENA备选方案（适用于GenBank/EMBL登录号）

python

undefined

python

undefined

Only for non-RefSeq accessions!

if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")): # ENA entry info entry = tu.tools.ena_get_entry(accession=accession)

# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)

# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)

undefined

if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")): # ENA entry info entry = tu.tools.ena_get_entry(accession=accession)

# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)

# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)

undefined

Fallback Chains

备选检索链

Primary	Fallback	Notes
NCBI_get_sequence	ENA (if GenBank format)	NCBI unavailable
ENA_get_entry	NCBI_get_sequence	ENA doesn't have RefSeq
NCBI_search_nucleotide	Try broader keywords	No results

Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.

主工具	备选工具	说明
NCBI_get_sequence	ENA（如为GenBank格式）	NCBI不可用时
ENA_get_entry	NCBI_get_sequence	ENA无RefSeq数据时
NCBI_search_nucleotide	尝试更宽泛的关键词	无检索结果时

关键规则：切勿使用ENA工具处理RefSeq登录号（NC_、NM_等）——会返回404错误。

Phase 3: Report Sequence Profile

阶段3：生成序列分析报告

Output Structure

输出结构

Present as a Sequence Profile Report. Hide search process.

markdown

undefined

以序列分析报告形式呈现，隐藏搜索过程。

markdown

undefined

Sequence Profile: [Gene/Organism]

Search Summary

Query: [gene] in [organism]
Database: NCBI Nucleotide
Results: [N] sequences found

Search Summary

Query: [gene] in [organism]
Database: NCBI Nucleotide
Results: [N] sequences found

Primary Sequence

[Accession]: [Definition/Title]

Attribute	Value
Accession	[accession]
Type	RefSeq / GenBank
Organism	[scientific name]
Strain	[strain if applicable]
Length	[X,XXX bp / aa]
Molecule	DNA / mRNA / Protein
Topology	Linear / Circular

Curation Level: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party

Attribute	Value
Accession	[accession]
Type	RefSeq / GenBank
Organism	[scientific name]
Strain	[strain if applicable]
Length	[X,XXX bp / aa]
Molecule	DNA / mRNA / Protein
Topology	Linear / Circular

Curation Level: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party

Sequence Statistics

Statistic	Value
Length	[X,XXX] bp
GC Content	[XX.X]%
Genes	[N] (if genome)
CDS	[N] (if annotated)

Statistic	Value
Length	[X,XXX] bp
GC Content	[XX.X]%
Genes	[N] (if genome)
CDS	[N] (if annotated)

Sequence Preview

fasta

>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]

fasta

>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]

Annotations Summary (from GenBank format)

Feature	Count	Examples
CDS	[N]	[gene names]
tRNA	[N]	-
rRNA	[N]	16S, 23S
Regulatory	[N]	promoters

Feature	Count	Examples
CDS	[N]	[gene names]
tRNA	[N]	-
rRNA	[N]	16S, 23S
Regulatory	[N]	promoters

Alternative Sequences

Ranked by relevance and curation level:

Accession	Type	Length	Description	ENA Compatible
NC_000913.3	RefSeq	4.6 Mb	E. coli K-12 reference	✗
U00096.3	GenBank	4.6 Mb	E. coli K-12	✓
CP001509.3	GenBank	4.6 Mb	E. coli DH10B	✓

Ranked by relevance and curation level:

Accession	Type	Length	Description	ENA Compatible
NC_000913.3	RefSeq	4.6 Mb	E. coli K-12 reference	✗
U00096.3	GenBank	4.6 Mb	E. coli K-12	✓
CP001509.3	GenBank	4.6 Mb	E. coli DH10B	✓

Cross-Database References

Database	Accession	Link
RefSeq	[NC_*]	[NCBI link]
GenBank	[U*]	[NCBI link]
ENA/EMBL	[same as GenBank]	[ENA link]
BioProject	[PRJNA*]	[link]
BioSample	[SAMN*]	[link]

Database	Accession	Link
RefSeq	[NC_*]	[NCBI link]
GenBank	[U*]	[NCBI link]
ENA/EMBL	[same as GenBank]	[ENA link]
BioProject	[PRJNA*]	[link]
BioSample	[SAMN*]	[link]

Download Options

Formats Available

Format	Description	Use Case
FASTA	Sequence only	BLAST, alignment
GenBank	Sequence + annotations	Gene analysis
GFF3	Annotations only	Genome browsers

Format	Description	Use Case
FASTA	Sequence only	BLAST, alignment
GenBank	Sequence + annotations	Gene analysis
GFF3	Annotations only	Genome browsers

Direct Commands

python

undefined

python

undefined

FASTA format

tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="fasta" )

GenBank format (with annotations)

tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="genbank" )

---

tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="genbank" )

---

Related Sequences

Other Strains/Isolates

Accession	Strain	Similarity	Notes
[acc1]	[strain1]	99.9%	[notes]
[acc2]	[strain2]	99.5%	[notes]

Accession	Strain	Similarity	Notes
[acc1]	[strain1]	99.9%	[notes]
[acc2]	[strain2]	99.5%	[notes]

Protein Products (if applicable)

Protein Accession	Product Name	Length
[NP_*]	[protein name]	[X] aa

Retrieved: [date] Database: NCBI Nucleotide

---

Protein Accession	Product Name	Length
[NP_*]	[protein name]	[X] aa

Retrieved: [date] Database: NCBI Nucleotide

---

Curation Level Tiers

注释等级分层

Tier	Symbol	Accession Prefix	Description
RefSeq Reference	●●●●	NC_, NM_, NP_	NCBI-curated, gold standard
RefSeq Predicted	●●●○	XM_, XP_, XR_	Computationally predicted
GenBank Validated	●●○○	Various	Submitted, some curation
GenBank Direct	●○○○	Various	Direct submission
Third Party	○○○○	TPA_	Third-party annotation

Include in report:

markdown

**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use

层级	符号	登录号前缀	描述
RefSeq参考序列	●●●●	NC_, NM_, NP_	NCBI注释，黄金标准
RefSeq预测序列	●●●○	XM_, XP_, XR_	计算预测序列
GenBank已验证	●●○○	多种前缀	提交后经过部分注释
GenBank直接提交	●○○○	多种前缀	直接提交未注释
第三方注释	○○○○	TPA_	第三方提供的注释

报告中需包含：

markdown

**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use

Completeness Checklist

完整性检查清单

Every sequence report MUST include:

每份序列报告必须包含：

Per Sequence (Required)

单条序列必填项

Search Summary (Required)

搜索摘要必填项

Include Even If Limited

即使数据有限也需包含

Alternative sequences (or "Only one sequence found")
Cross-database references (or "No cross-references available")
Download instructions

备选序列（或“仅找到一条序列”）
跨数据库引用（或“无可用跨数据库引用”）
下载说明

Common Use Cases

常见使用场景

Reference Genome

参考基因组

User: "Get E. coli K-12 complete genome"

python

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Escherichia coli",
    strain="K-12",
    seq_type="complete_genome",
    limit=3
)

用户：“获取大肠杆菌K-12完整基因组”

python

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Escherichia coli",
    strain="K-12",
    seq_type="complete_genome",
    limit=3
)

Return NC_000913.3 (RefSeq reference)

undefined

undefined

Gene Sequence

基因序列

User: "Find human BRCA1 mRNA"

python

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Homo sapiens",
    gene="BRCA1",
    seq_type="mrna",
    limit=10
)

用户：“查找人类BRCA1的mRNA序列”

python

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Homo sapiens",
    gene="BRCA1",
    seq_type="mrna",
    limit=10
)

Specific Accession

特定登录号

User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata

用户：“获取NC_045512.2的序列” → 直接检索并返回完整元数据

Strain Comparison

菌株对比

User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table

用户：“对比大肠杆菌K-12和O157:H7的基因组” → 检索两种菌株，提供对比表格

Error Handling

错误处理

Error	Response
"No search criteria provided"	Add organism, gene, or keywords
"ENA 404 error"	Accession is likely RefSeq → use NCBI only
"No results found"	Broaden search, check spelling, try synonyms
"Sequence too large"	Note size, provide download link instead of preview
"API rate limit"	Tools auto-retry; if persistent, wait briefly

错误	响应
"No search criteria provided"	补充物种、基因或关键词
"ENA 404 error"	该登录号可能为RefSeq → 仅使用NCBI工具
"No results found"	扩大搜索范围，检查拼写，尝试同义词
"Sequence too large"	标注序列大小，提供下载链接而非预览
"API rate limit"	工具自动重试；若持续失败，短暂等待后重试

Tool Reference

工具参考

NCBI Tools (All Accessions)

Tool	Purpose
`NCBI_search_nucleotide`	Search by gene/organism
`NCBI_fetch_accessions`	Convert UIDs to accessions
`NCBI_get_sequence`	Retrieve sequence data

ENA Tools (GenBank/EMBL Only)

Tool	Purpose
`ena_get_entry`	Entry metadata
`ena_get_sequence_fasta`	FASTA sequence
`ena_get_entry_summary`	Summary info

NCBI工具（支持所有登录号）

工具	用途
`NCBI_search_nucleotide`	按基因/物种搜索
`NCBI_fetch_accessions`	将UID转换为登录号
`NCBI_get_sequence`	检索序列数据

ENA工具（仅适用于GenBank/EMBL登录号）

工具	用途
`ena_get_entry`	获取条目元数据
`ena_get_sequence_fasta`	获取FASTA序列
`ena_get_entry_summary`	获取条目摘要

Search Parameters Reference

搜索参数参考

NCBI_search_nucleotide

Parameter	Description	Example
`operation`	Always "search"	"search"
`organism`	Scientific name	"Homo sapiens"
`gene`	Gene symbol	"BRCA1"
`strain`	Specific strain	"K-12"
`keywords`	Free text	"complete genome"
`seq_type`	Sequence type	"complete_genome", "mrna", "refseq"
`limit`	Max results	10

NCBI_get_sequence

Parameter	Description	Example
`operation`	Always "fetch_sequence"	"fetch_sequence"
`accession`	Accession number	"NC_000913.3"
`format`	Output format	"fasta", "genbank"

NCBI_search_nucleotide

参数	描述	示例
`operation`	固定为"search"	"search"
`organism`	物种学名	"Homo sapiens"
`gene`	基因符号	"BRCA1"
`strain`	特定菌株	"K-12"
`keywords`	自由文本	"complete genome"
`seq_type`	序列类型	"complete_genome", "mrna", "refseq"
`limit`	最大结果数	10

NCBI_get_sequence

参数	描述	示例
`operation`	固定为"fetch_sequence"	"fetch_sequence"
`accession`	登录号	"NC_000913.3"
`format`	输出格式	"fasta", "genbank"