tooluniverse-sequence-retrieval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBiological Sequence Retrieval
生物序列检索
Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.
IMPORTANT: Always use English terms in tool calls (gene names, organism names, sequence descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.
检索DNA、RNA和蛋白质序列,支持精准消歧与跨数据库处理。
重要提示:工具调用中始终使用英文术语(基因名称、物种名称、序列描述),即使用户使用其他语言提问。仅当英文检索无结果时,才尝试使用用户原语言术语作为备选。回复需使用用户的语言。
Workflow Overview
工作流程概述
Phase 0: Clarify (if needed)
↓
Phase 1: Disambiguate Gene/Organism
↓
Phase 2: Search & Retrieve (Internal)
↓
Phase 3: Report Sequence ProfilePhase 0: Clarify (if needed)
↓
Phase 1: Disambiguate Gene/Organism
↓
Phase 2: Search & Retrieve (Internal)
↓
Phase 3: Report Sequence ProfilePhase 0: Clarification (When Needed)
阶段0:信息确认(必要时)
Ask the user ONLY if:
- Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
- Sequence type unclear (mRNA, genomic, protein?)
- Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)
Skip clarification for:
- Specific accession numbers (NC_, NM_, U*, etc.)
- Clear organism + gene combinations
- Complete genome requests with organism specified
仅在以下情况询问用户:
- 基因名称存在于多个物种中(例如:"BRCA1" → 人类还是小鼠?)
- 序列类型不明确(mRNA、基因组序列、蛋白质?)
- 菌株/分离株信息至关重要(例如:大肠杆菌 → K-12、O157:H7等)
无需确认的情况:
- 提供了具体登录号(NC_、NM_、U*等)
- 物种+基因的组合清晰明确
- 指定物种的完整基因组请求
Phase 1: Gene/Organism Disambiguation
阶段1:基因/物种消歧
1.1 Resolve Identifiers
1.1 解析标识符
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()Strategy depends on input type
Strategy depends on input type
if user_provided_accession:
# Direct retrieval based on accession type
accession = user_provided_accession
elif user_provided_gene_and_organism:
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
limit=10
)
undefinedif user_provided_accession:
# Direct retrieval based on accession type
accession = user_provided_accession
elif user_provided_gene_and_organism:
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
limit=10
)
undefined1.2 Accession Type Decision Tree
1.2 登录号类型决策树
CRITICAL: Accession prefix determines which tools to use.
| Prefix | Type | Use With |
|---|---|---|
| NC_* | RefSeq chromosome | NCBI only |
| NM_* | RefSeq mRNA | NCBI only |
| NR_* | RefSeq ncRNA | NCBI only |
| NP_* | RefSeq protein | NCBI only |
| XM_* | RefSeq predicted mRNA | NCBI only |
| U*, M*, K*, X* | GenBank | NCBI or ENA |
| CP*, NZ_* | GenBank genome | NCBI or ENA |
| EMBL format | EMBL | ENA preferred |
关键:登录号前缀决定使用的工具。
| 前缀 | 类型 | 适用工具 |
|---|---|---|
| NC_* | RefSeq染色体 | 仅NCBI |
| NM_* | RefSeq mRNA | 仅NCBI |
| NR_* | RefSeq非编码RNA | 仅NCBI |
| NP_* | RefSeq蛋白质 | 仅NCBI |
| XM_* | RefSeq预测mRNA | 仅NCBI |
| U*, M*, K*, X* | GenBank | NCBI或ENA |
| CP*, NZ_* | GenBank基因组 | NCBI或ENA |
| EMBL格式 | EMBL | 优先使用ENA |
1.3 Identity Resolution Checklist
1.3 身份解析检查清单
- Organism confirmed (scientific name)
- Gene symbol/name identified
- Sequence type determined (genomic/mRNA/protein)
- Strain specified (if relevant)
- Accession prefix identified → tool selection
- 已确认物种(学名)
- 已识别基因符号/名称
- 已确定序列类型(基因组/mRNA/蛋白质)
- 已指定菌株(如相关)
- 已识别登录号前缀 → 完成工具选择
Phase 2: Data Retrieval (Internal)
阶段2:数据检索(内部操作)
Retrieve silently. Do NOT narrate the search process.
静默执行检索,无需告知用户搜索过程。
2.1 Search for Sequences
2.1 搜索序列
python
undefinedpython
undefinedSearch NCBI Nucleotide
Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
strain=strain, # Optional
keywords=keywords, # Optional
seq_type=seq_type, # complete_genome, mrna, refseq
limit=10
)
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
strain=strain, # Optional
keywords=keywords, # Optional
seq_type=seq_type, # complete_genome, mrna, refseq
limit=10
)
Get accession numbers from UIDs
Get accession numbers from UIDs
accessions = tu.tools.NCBI_fetch_accessions(
operation="fetch_accession",
uids=result["data"]["uids"]
)
undefinedaccessions = tu.tools.NCBI_fetch_accessions(
operation="fetch_accession",
uids=result["data"]["uids"]
)
undefined2.2 Retrieve Sequence Data
2.2 检索序列数据
python
undefinedpython
undefinedGet sequence in desired format
Get sequence in desired format
sequence = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="fasta" # or "genbank"
)
sequence = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="fasta" # or "genbank"
)
GenBank format for annotations
GenBank format for annotations
annotations = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="genbank"
)
undefinedannotations = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="genbank"
)
undefined2.3 ENA Alternative (for GenBank/EMBL accessions)
2.3 ENA备选方案(适用于GenBank/EMBL登录号)
python
undefinedpython
undefinedOnly for non-RefSeq accessions!
Only for non-RefSeq accessions!
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
# ENA entry info
entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)undefinedif not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
# ENA entry info
entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)undefinedFallback Chains
备选检索链
| Primary | Fallback | Notes |
|---|---|---|
| NCBI_get_sequence | ENA (if GenBank format) | NCBI unavailable |
| ENA_get_entry | NCBI_get_sequence | ENA doesn't have RefSeq |
| NCBI_search_nucleotide | Try broader keywords | No results |
Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.
| 主工具 | 备选工具 | 说明 |
|---|---|---|
| NCBI_get_sequence | ENA(如为GenBank格式) | NCBI不可用时 |
| ENA_get_entry | NCBI_get_sequence | ENA无RefSeq数据时 |
| NCBI_search_nucleotide | 尝试更宽泛的关键词 | 无检索结果时 |
关键规则:切勿使用ENA工具处理RefSeq登录号(NC_、NM_等)——会返回404错误。
Phase 3: Report Sequence Profile
阶段3:生成序列分析报告
Output Structure
输出结构
Present as a Sequence Profile Report. Hide search process.
markdown
undefined以序列分析报告形式呈现,隐藏搜索过程。
markdown
undefinedSequence Profile: [Gene/Organism]
Sequence Profile: [Gene/Organism]
Search Summary
- Query: [gene] in [organism]
- Database: NCBI Nucleotide
- Results: [N] sequences found
Search Summary
- Query: [gene] in [organism]
- Database: NCBI Nucleotide
- Results: [N] sequences found
Primary Sequence
Primary Sequence
[Accession]: [Definition/Title]
[Accession]: [Definition/Title]
| Attribute | Value |
|---|---|
| Accession | [accession] |
| Type | RefSeq / GenBank |
| Organism | [scientific name] |
| Strain | [strain if applicable] |
| Length | [X,XXX bp / aa] |
| Molecule | DNA / mRNA / Protein |
| Topology | Linear / Circular |
Curation Level: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party
| Attribute | Value |
|---|---|
| Accession | [accession] |
| Type | RefSeq / GenBank |
| Organism | [scientific name] |
| Strain | [strain if applicable] |
| Length | [X,XXX bp / aa] |
| Molecule | DNA / mRNA / Protein |
| Topology | Linear / Circular |
Curation Level: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party
Sequence Statistics
Sequence Statistics
| Statistic | Value |
|---|---|
| Length | [X,XXX] bp |
| GC Content | [XX.X]% |
| Genes | [N] (if genome) |
| CDS | [N] (if annotated) |
| Statistic | Value |
|---|---|
| Length | [X,XXX] bp |
| GC Content | [XX.X]% |
| Genes | [N] (if genome) |
| CDS | [N] (if annotated) |
Sequence Preview
Sequence Preview
fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]Annotations Summary (from GenBank format)
Annotations Summary (from GenBank format)
| Feature | Count | Examples |
|---|---|---|
| CDS | [N] | [gene names] |
| tRNA | [N] | - |
| rRNA | [N] | 16S, 23S |
| Regulatory | [N] | promoters |
| Feature | Count | Examples |
|---|---|---|
| CDS | [N] | [gene names] |
| tRNA | [N] | - |
| rRNA | [N] | 16S, 23S |
| Regulatory | [N] | promoters |
Alternative Sequences
Alternative Sequences
Ranked by relevance and curation level:
| Accession | Type | Length | Description | ENA Compatible |
|---|---|---|---|---|
| NC_000913.3 | RefSeq | 4.6 Mb | E. coli K-12 reference | ✗ |
| U00096.3 | GenBank | 4.6 Mb | E. coli K-12 | ✓ |
| CP001509.3 | GenBank | 4.6 Mb | E. coli DH10B | ✓ |
Ranked by relevance and curation level:
| Accession | Type | Length | Description | ENA Compatible |
|---|---|---|---|---|
| NC_000913.3 | RefSeq | 4.6 Mb | E. coli K-12 reference | ✗ |
| U00096.3 | GenBank | 4.6 Mb | E. coli K-12 | ✓ |
| CP001509.3 | GenBank | 4.6 Mb | E. coli DH10B | ✓ |
Cross-Database References
Cross-Database References
| Database | Accession | Link |
|---|---|---|
| RefSeq | [NC_*] | [NCBI link] |
| GenBank | [U*] | [NCBI link] |
| ENA/EMBL | [same as GenBank] | [ENA link] |
| BioProject | [PRJNA*] | [link] |
| BioSample | [SAMN*] | [link] |
| Database | Accession | Link |
|---|---|---|
| RefSeq | [NC_*] | [NCBI link] |
| GenBank | [U*] | [NCBI link] |
| ENA/EMBL | [same as GenBank] | [ENA link] |
| BioProject | [PRJNA*] | [link] |
| BioSample | [SAMN*] | [link] |
Download Options
Download Options
Formats Available
Formats Available
| Format | Description | Use Case |
|---|---|---|
| FASTA | Sequence only | BLAST, alignment |
| GenBank | Sequence + annotations | Gene analysis |
| GFF3 | Annotations only | Genome browsers |
| Format | Description | Use Case |
|---|---|---|
| FASTA | Sequence only | BLAST, alignment |
| GenBank | Sequence + annotations | Gene analysis |
| GFF3 | Annotations only | Genome browsers |
Direct Commands
Direct Commands
python
undefinedpython
undefinedFASTA format
FASTA format
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="fasta"
)
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="fasta"
)
GenBank format (with annotations)
GenBank format (with annotations)
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="genbank"
)
---tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="genbank"
)
---Related Sequences
Related Sequences
Other Strains/Isolates
Other Strains/Isolates
| Accession | Strain | Similarity | Notes |
|---|---|---|---|
| [acc1] | [strain1] | 99.9% | [notes] |
| [acc2] | [strain2] | 99.5% | [notes] |
| Accession | Strain | Similarity | Notes |
|---|---|---|---|
| [acc1] | [strain1] | 99.9% | [notes] |
| [acc2] | [strain2] | 99.5% | [notes] |
Protein Products (if applicable)
Protein Products (if applicable)
| Protein Accession | Product Name | Length |
|---|---|---|
| [NP_*] | [protein name] | [X] aa |
Retrieved: [date]
Database: NCBI Nucleotide
---| Protein Accession | Product Name | Length |
|---|---|---|
| [NP_*] | [protein name] | [X] aa |
Retrieved: [date]
Database: NCBI Nucleotide
---Curation Level Tiers
注释等级分层
| Tier | Symbol | Accession Prefix | Description |
|---|---|---|---|
| RefSeq Reference | ●●●● | NC_, NM_, NP_ | NCBI-curated, gold standard |
| RefSeq Predicted | ●●●○ | XM_, XP_, XR_ | Computationally predicted |
| GenBank Validated | ●●○○ | Various | Submitted, some curation |
| GenBank Direct | ●○○○ | Various | Direct submission |
| Third Party | ○○○○ | TPA_ | Third-party annotation |
Include in report:
markdown
**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use| 层级 | 符号 | 登录号前缀 | 描述 |
|---|---|---|---|
| RefSeq参考序列 | ●●●● | NC_, NM_, NP_ | NCBI注释,黄金标准 |
| RefSeq预测序列 | ●●●○ | XM_, XP_, XR_ | 计算预测序列 |
| GenBank已验证 | ●●○○ | 多种前缀 | 提交后经过部分注释 |
| GenBank直接提交 | ●○○○ | 多种前缀 | 直接提交未注释 |
| 第三方注释 | ○○○○ | TPA_ | 第三方提供的注释 |
报告中需包含:
markdown
**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference useCompleteness Checklist
完整性检查清单
Every sequence report MUST include:
每份序列报告必须包含:
Per Sequence (Required)
单条序列必填项
- Accession number
- Organism (scientific name)
- Sequence type (DNA/RNA/protein)
- Length
- Curation level
- Database source
- 登录号
- 物种(学名)
- 序列类型(DNA/RNA/蛋白质)
- 长度
- 注释等级
- 数据库来源
Search Summary (Required)
搜索摘要必填项
- Query parameters
- Number of results
- Ranking rationale
- 查询参数
- 结果数量
- 排序依据
Include Even If Limited
即使数据有限也需包含
- Alternative sequences (or "Only one sequence found")
- Cross-database references (or "No cross-references available")
- Download instructions
- 备选序列(或“仅找到一条序列”)
- 跨数据库引用(或“无可用跨数据库引用”)
- 下载说明
Common Use Cases
常见使用场景
Reference Genome
参考基因组
User: "Get E. coli K-12 complete genome"
python
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Escherichia coli",
strain="K-12",
seq_type="complete_genome",
limit=3
)用户:“获取大肠杆菌K-12完整基因组”
python
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Escherichia coli",
strain="K-12",
seq_type="complete_genome",
limit=3
)Return NC_000913.3 (RefSeq reference)
Return NC_000913.3 (RefSeq reference)
undefinedundefinedGene Sequence
基因序列
User: "Find human BRCA1 mRNA"
python
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Homo sapiens",
gene="BRCA1",
seq_type="mrna",
limit=10
)用户:“查找人类BRCA1的mRNA序列”
python
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Homo sapiens",
gene="BRCA1",
seq_type="mrna",
limit=10
)Specific Accession
特定登录号
User: "Get sequence for NC_045512.2"
→ Direct retrieval with full metadata
用户:“获取NC_045512.2的序列”
→ 直接检索并返回完整元数据
Strain Comparison
菌株对比
User: "Compare E. coli K-12 and O157:H7 genomes"
→ Search both strains, provide comparison table
用户:“对比大肠杆菌K-12和O157:H7的基因组”
→ 检索两种菌株,提供对比表格
Error Handling
错误处理
| Error | Response |
|---|---|
| "No search criteria provided" | Add organism, gene, or keywords |
| "ENA 404 error" | Accession is likely RefSeq → use NCBI only |
| "No results found" | Broaden search, check spelling, try synonyms |
| "Sequence too large" | Note size, provide download link instead of preview |
| "API rate limit" | Tools auto-retry; if persistent, wait briefly |
| 错误 | 响应 |
|---|---|
| "No search criteria provided" | 补充物种、基因或关键词 |
| "ENA 404 error" | 该登录号可能为RefSeq → 仅使用NCBI工具 |
| "No results found" | 扩大搜索范围,检查拼写,尝试同义词 |
| "Sequence too large" | 标注序列大小,提供下载链接而非预览 |
| "API rate limit" | 工具自动重试;若持续失败,短暂等待后重试 |
Tool Reference
工具参考
NCBI Tools (All Accessions)
| Tool | Purpose |
|---|---|
| Search by gene/organism |
| Convert UIDs to accessions |
| Retrieve sequence data |
ENA Tools (GenBank/EMBL Only)
| Tool | Purpose |
|---|---|
| Entry metadata |
| FASTA sequence |
| Summary info |
NCBI工具(支持所有登录号)
| 工具 | 用途 |
|---|---|
| 按基因/物种搜索 |
| 将UID转换为登录号 |
| 检索序列数据 |
ENA工具(仅适用于GenBank/EMBL登录号)
| 工具 | 用途 |
|---|---|
| 获取条目元数据 |
| 获取FASTA序列 |
| 获取条目摘要 |
Search Parameters Reference
搜索参数参考
NCBI_search_nucleotide
| Parameter | Description | Example |
|---|---|---|
| Always "search" | "search" |
| Scientific name | "Homo sapiens" |
| Gene symbol | "BRCA1" |
| Specific strain | "K-12" |
| Free text | "complete genome" |
| Sequence type | "complete_genome", "mrna", "refseq" |
| Max results | 10 |
NCBI_get_sequence
| Parameter | Description | Example |
|---|---|---|
| Always "fetch_sequence" | "fetch_sequence" |
| Accession number | "NC_000913.3" |
| Output format | "fasta", "genbank" |
NCBI_search_nucleotide
| 参数 | 描述 | 示例 |
|---|---|---|
| 固定为"search" | "search" |
| 物种学名 | "Homo sapiens" |
| 基因符号 | "BRCA1" |
| 特定菌株 | "K-12" |
| 自由文本 | "complete genome" |
| 序列类型 | "complete_genome", "mrna", "refseq" |
| 最大结果数 | 10 |
NCBI_get_sequence
| 参数 | 描述 | 示例 |
|---|---|---|
| 固定为"fetch_sequence" | "fetch_sequence" |
| 登录号 | "NC_000913.3" |
| 输出格式 | "fasta", "genbank" |