Optimizing ToolUniverse Skills

Best practices for creating high-quality ToolUniverse research skills that produce detailed, evidence-graded reports with proper source attribution.

When to Use This Skill

Apply when:

Creating new ToolUniverse research skills
Reviewing/improving existing skills
User complains about missing details, noisy results, or unclear reports
Skill produces process-heavy instead of content-heavy output
Tools are failing silently or returning empty results

Core Optimization Principles

1. Tool Interface Verification (Pre-flight Check)

Problem: Tool APIs change parameter names over time, or skills are written with incorrect parameter assumptions. This causes silent failures - tools return empty results without errors.

Solution: Verify tool parameters before calling unfamiliar tools:

python

# Always check tool params to prevent silent failures
tool_info = tu.tools.get_tool_info(tool_name="Reactome_map_uniprot_to_pathways")
# Reveals: takes `id` not `uniprot_id`

Maintain a known corrections table in skills that use many tools:

Tool	WRONG Parameter	CORRECT Parameter
`Reactome_map_uniprot_to_pathways`	`uniprot_id`	`id`
`ensembl_get_xrefs`	`gene_id`	`id`
`GTEx_get_median_gene_expression`	`gencode_id` only	`gencode_id` + `operation="median"`
`OpenTargets_*`	`ensemblID`	`ensemblId` (camelCase)

Rule: Before calling any tool for the first time in a skill, confirm params via

get_tool_info()

once per tool family, or maintain a vetted param map in the skill.

Why this matters: Retry logic won't help if you're calling a tool with wrong parameter names - it will consistently return empty. This is different from API flakiness.

2. Foundation Data Layer (Path 0)

Problem: Skills query specialized tools for each section independently, missing data that a comprehensive aggregator already has. Results are inconsistent when specialized tools fail.

Solution: Identify if your domain has a comprehensive aggregator and query it FIRST before specialized tools.

Examples by domain:

Domain	Foundation Source	What It Provides
Drug targets	Open Targets	Diseases, tractability, safety, drugs, GO, publications, mouse models
Chemicals	PubChem	Properties, bioactivity, patents, literature
Diseases	Open Targets / OMIM	Genes, drugs, phenotypes, literature
Genes	MyGene / Ensembl	Annotations, cross-refs, GO, pathways

Pattern:

markdown

## Workflow
Phase 0: Foundation Data (aggregator query)
Phase 1: Disambiguation (ID resolution, collision detection)
Phase 2: Specialized Queries (fill gaps from Phase 0)
Phase 3: Report Synthesis

Why this works: The aggregator provides reliable baseline data across multiple sections. Specialized tools then add depth or fill gaps, rather than being the sole source.

3. Versioned Identifier Handling

Problem: Some APIs require versioned identifiers (e.g., GTEx needs

ENSG00000123456.12

), while others reject them. Skills fail silently when using the wrong format.

Solution: During ID resolution, capture BOTH versioned and unversioned forms:

python

ids = {
    'ensembl': 'ENSG00000123456',           # Unversioned (most APIs)
    'ensembl_versioned': 'ENSG00000123456.12'  # Versioned (GTEx, some others)
}

# Get version from Ensembl lookup
gene_info = tu.tools.ensembl_lookup_gene(id=ensembl_id, species="human")
if gene_info and gene_info.get('version'):
    ids['ensembl_versioned'] = f"{ensembl_id}.{gene_info['version']}"

Fallback strategy:

Try unversioned first (more portable)
If empty, try versioned
Document which format worked

Common versioned ID APIs: GTEx, GENCODE, some Ensembl endpoints

4. Disambiguation Before Research

Problem: Skills that jump straight to literature search often miss target details or retrieve irrelevant papers due to naming collisions.

Solution: Add a disambiguation phase before any literature search:

markdown

## Phase 1: Target Disambiguation (Default ON)

### 1.1 Resolve Official Identifiers
- UniProt accession (canonical protein)
- Ensembl gene ID + version (for expression data)
- NCBI Gene ID (for literature)
- ChEMBL target ID (for drug data)

### 1.2 Gather Synonyms and Aliases
- All known gene symbols
- Protein name variants
- Historical names

### 1.3 Detect Naming Collisions
- Search "[SYMBOL]"[Title] - review top 20 results
- If >20% off-topic → identify collision terms
- Build negative filter: NOT [collision1] NOT [collision2]

### 1.4 Get Baseline Profile (from annotation DBs, not literature)
- Protein domains (InterPro)
- Subcellular location (HPA)
- Tissue expression (GTEx)
- GO terms and pathways

Why this works: Annotation databases provide reliable baseline data even when literature is sparse or noisy.

5. Report-Only Output (Hide Search Process)

Problem: Users don't want to see "searched 8 databases, found 1,247 papers, deduplicated to 892..."

Solution: Output structure:

File	Content	When
`[topic]_report.md`	Narrative findings only	Always (default)
`[topic]_bibliography.json`	Full deduplicated papers	Always
`methods_appendix.md`	Search methodology	Only if requested

In the report:

✅ DO: "The literature reveals three main therapeutic approaches..."
❌ DON'T: "I searched PubMed, OpenAlex, and EuropePMC, finding 342 papers..."

6. Evidence Grading

Problem: A review article mention is treated the same as a mechanistic study with direct evidence.

Solution: Apply evidence tiers to every claim:

Tier	Symbol	Criteria
T1	★★★	Mechanistic study with direct evidence
T2	★★☆	Functional study (knockdown, overexpression)
T3	★☆☆	Association (screen hit, GWAS, correlation)
T4	☆☆☆	Mention (review, text-mined, peripheral)

In report:

markdown

ATP6V1A drives lysosomal acidification [★★★: PMID:12345678] and has been 
implicated in cancer progression [★☆☆: PMID:23456789, TCGA expression data].

Required locations for evidence grades:

Executive Summary - key disease claims
Disease Associations - every disease link
Key Papers table - evidence tier column
Recommendations - reference evidence quality

Per-section summary:

markdown

### Theme: Lysosomal Function (47 papers)
**Evidence Quality**: Strong (32 mechanistic, 11 functional, 4 association)

7. Quantified Completeness (Not Just Categorical)

Problem: "Include PPIs" is aspirational; reports pass the checklist but are data-thin.

Solution: Define numeric minimums for each section:

Section	Minimum Data	If Not Met
PPIs	≥20 interactors	Explain why fewer + which tools failed
Expression	Top 10 tissues with values	Note "limited data" with specific gaps
Disease	Top 10 associations with scores	Note if fewer available
Variants	All 4 constraint scores (pLI, LOEUF, missense Z, pRec)	Note which unavailable
Druggability	All modalities assessed	"No drugs/probes" is valid data
Literature	Total + 5-year trend + 3-5 key papers	Note if sparse (<50 papers)

Why this matters: Quantified minimums make completeness auditing objective and mechanical, not subjective.

8. Mandatory Completeness Checklist

Problem: Reports have inconsistent sections; some topics get skipped entirely.

Solution: Define mandatory sections that MUST exist, even if populated with "Limited evidence" or "Unknown":

markdown

## Completeness Checklist (ALL Required)

### Identity & Context
- [ ] Official identifiers resolved (all 6 types)
- [ ] Synonyms/aliases documented
- [ ] Naming collisions handled (or "none detected")

### Biology
- [ ] Protein architecture (or "N/A for non-protein")
- [ ] Subcellular localization
- [ ] Expression profile (≥10 tissues with values)
- [ ] Pathway involvement (≥10 pathways)

### Mechanism
- [ ] Core function with evidence grades
- [ ] Model organism data (or "none found")
- [ ] Key assays described

### Disease & Clinical
- [ ] Genetic variants (SNVs and CNVs separated)
- [ ] Constraint scores (all 4, with interpretations)
- [ ] Disease links with evidence grades (≥10 or "limited")

### Druggability
- [ ] Tractability for all modalities
- [ ] Known drugs (or "none")
- [ ] Chemical probes (or "none available")
- [ ] Clinical pipeline (or "none")

### Synthesis (CRITICAL)
- [ ] Research themes (≥3 papers each, or "limited")
- [ ] Open questions/gaps
- [ ] Biological model synthesized
- [ ] Testable hypotheses (≥3)

9. Aggregated Data Gaps Section

Problem: "No data" notes scattered across 14 sections; users can't quickly see what's missing.

Solution: Add a dedicated Data Gaps & Limitations section that consolidates all gaps:

markdown

## 15. Data Gaps & Limitations

| Section | Expected Data | Actual | Reason | Alternative Source |
|---------|---------------|--------|--------|-------------------|
| 6. PPIs | ≥20 interactors | 8 | Novel target, limited studies | Literature review needed |
| 7. Expression | GTEx TPM | None | Versioned ID not recognized | See HPA data |
| 9. Probes | Chemical probes | None | No validated probes exist | Consider tool compound dev |

**Recommendations for Data Gaps**:
1. For PPIs: Query BioGRID with broader parameters; check yeast-2-hybrid studies
2. For Expression: Query GEO directly for tissue-specific datasets

Why this matters: Users can quickly assess data quality and know where to look for more information.

10. Query Strategy Optimization

Problem: Simple keyword searches retrieve too much noise or miss relevant papers.

Solution: Three-step collision-aware query strategy:

markdown

## Query Strategy

### Step 1: High-Precision Seeds
Build a mechanistic core set (15-30 papers):
- "[GENE_SYMBOL]"[Title] AND mechanism
- "[FULL_PROTEIN_NAME]"[Title]
- "UniProt:ACCESSION"

### Step 2: Citation Network Expansion
From seeds, expand via citations:
- Forward: PubMed_get_cited_by, EuropePMC_get_citations
- Related: PubMed_get_related
- Backward: EuropePMC_get_references

### Step 3: Collision-Filtered Broad
Apply negative filters for known collisions:
- "TRAG" AND immune NOT plasmid NOT conjugation
- "JAK" AND kinase NOT "just another"

Citation-first for sparse targets: When keyword search returns <30 papers, prioritize citation expansion from the few good seeds.

11. Tool Failure Handling

Problem: NCBI elink and other APIs can be flaky; skills fail silently.

Solution: Automatic retry with fallback chains:

markdown

## Failure Handling

### Retry Protocol
Attempt 1 → fails → wait 2s → Attempt 2 → fails → wait 5s → Fallback

### Fallback Chains
| Primary | Fallback 1 | Fallback 2 |
|---------|------------|------------|
| PubMed_get_cited_by | EuropePMC_get_citations | OpenAlex citations |
| PubMed_get_related | SemanticScholar | Keyword search |
| GTEx_* | HPA_* | Note as unavailable |
| Unpaywall | EuropePMC OA flag | OpenAlex is_oa |
| ChEMBL_get_target_activities | GtoPdb_get_target_ligands | OpenTargets drugs |
| intact_get_interactions | STRING_get_protein_interactions | OpenTargets interactions |

### Document Failures
In report: "Expression data unavailable (GTEx API timeout after 3 attempts)"

Rule: NEVER silently skip failed tools. Always document in the Data Gaps section.

12. Scalable Output Structure

Problem: Reports with 500+ papers become unreadable; users can't find what they need.

Solution: Separate narrative from data:

Narrative report (~20-50 pages max):

Executive summary
Key findings by theme
Top 20-50 papers highlighted
Conclusions and hypotheses

Bibliography files (unlimited):

```
[topic]_bibliography.json
```
- Full structured data
```
[topic]_bibliography.csv
```
- Tabular for filtering

JSON structure:

json

{
  "pmid": "12345678",
  "doi": "10.1038/xxx",
  "title": "...",
  "evidence_tier": "T1",
  "themes": ["lysosomal_function", "autophagy"],
  "is_core_seed": true,
  "oa_status": "gold"
}

13. Synthesis Sections

Problem: Reports describe what was found but don't synthesize into actionable insights.

Solution: Require synthesis sections:

markdown

## Required Synthesis Sections

### Biological Model (3-5 paragraphs)
Integrate all evidence into a coherent model:
- What does the target do?
- How does it connect to disease?
- What's the key uncertainty?

### Testable Hypotheses (≥3)
| # | Hypothesis | Perturbation | Readout | Expected |
|---|------------|--------------|---------|----------|
| 1 | [Hypothesis] | [Experiment] | [Measure] | [Prediction] |

### Suggested Experiments
Brief description of how to test each hypothesis.

Skill Review Checklist

When reviewing a ToolUniverse skill, check:

Tool Contract

Tool parameters verified via
```
get_tool_info()
```
or documented corrections
Versioned vs unversioned ID handling specified
Foundation data source identified (if available for domain)

Report Quality

Report focuses on content, not search process
Methodology in separate appendix (optional)
Evidence grades applied to claims (T1-T4)
Source attribution on every fact
Sections exist even if "limited evidence"

Query Strategy

Disambiguation phase before search
Collision detection for ambiguous names
High-precision seeds before broad search
Citation expansion option for sparse topics
Negative filters documented

Tool Usage

Annotation tools used (not just literature)
Fallback chains defined
Failure handling with retry
OA handling (full or best-effort)

Completeness

Quantified minimums defined per section
Completeness checklist with checkboxes
Data Gaps section aggregates all missing data
"Negative results" explicitly documented ("no probes" not blank)

Output Structure

Main report is narrative-focused
Bibliography in separate JSON/CSV
Synthesis sections required

User Experience

Progress updates are brief
No raw tool outputs shown
Final report is the deliverable

Common Anti-Patterns to Fix

1. "Search Log" Reports

Bad: "Round 1: Searched PubMed (234 papers), OpenAlex (456 papers)..." Fix: Keep methodology internal; report findings only

2. Missing Disambiguation

Bad: Search "JAK" and get kinase + "just another kinase" papers mixed Fix: Add collision detection; build negative filters

3. No Evidence Grading

Bad: "Multiple studies show..." (which studies? what quality?) Fix: Apply T1-T4 grades; label each claim

4. Empty Sections Omitted

Bad: Skip "Pathogen Involvement" because nothing found Fix: Include section with "None identified in literature search"

5. No Synthesis

Bad: Long list of papers organized by theme Fix: Add biological model + testable hypotheses

6. Monolithic Bibliography

Bad: 200 papers embedded in report narrative Fix: Top 20-50 in report; full list in JSON/CSV

7. Silent Failures

Bad: "Expression data: [blank]" (tool failed, user doesn't know) Fix: "Expression data unavailable (API timeout); see HPA directly"

8. Wrong Tool Parameters (NEW)

Bad:

Reactome_map_uniprot_to_pathways(uniprot_id=...)

returns empty Fix: Verify params via

get_tool_info()

; use correct param

id

9. Missing Versioned IDs (NEW)

Bad: GTEx returns empty for

ENSG00000123456

Fix: Try versioned ID

ENSG00000123456.12

; document which worked

10. No Foundation Layer (NEW)

Bad: Query 15 specialized tools independently, miss data when some fail Fix: Query comprehensive aggregator (e.g., Open Targets) first

11. Scattered "No Data" Notes (NEW)

Bad: "No data" in 5 different sections; user doesn't know overall gaps Fix: Aggregate all gaps in dedicated Data Gaps section with recommendations

12. Aspirational Completeness (NEW)

Bad: "Include PPIs" ✓ (but only 3 interactors listed) Fix: "≥20 PPIs OR explanation why fewer"

Template: Optimized Skill Structure

markdown

---
name: [domain]-research
description: [What it does]. Creates detailed report with evidence grading 
and mandatory completeness. [When to use triggers].
---

# [Domain] Research Strategy

## When to Use
[Trigger scenarios]

## Workflow
Phase -1: Tool Verification → Phase 0: Foundation Data → Phase 1: Disambiguate → Phase 2: Search → Phase 3: Report

## Phase -1: Tool Verification
[Parameter corrections table for tools used in this skill]

## Phase 0: Foundation Data
[Comprehensive aggregator query - e.g., Open Targets for targets]

## Phase 1: Disambiguation (Default ON)
[ID resolution (versioned + unversioned), collision detection, baseline profile]

## Phase 2: Specialized Queries (Internal)
[Query strategy with collision filters, citation expansion, tool fallbacks]

## Phase 3: Report Synthesis
[Progressive writing, evidence grading, mandatory sections]

## Output Files
- `[topic]_report.md` (narrative, always)
- `[topic]_bibliography.json` (data, always)
- `methods_appendix.md` (only if requested)

## Quantified Minimums
[Specific numbers per section - e.g., ≥20 PPIs, top 10 tissues]

## Completeness Checklist
[ALL required sections with checkboxes]

## Data Gaps Section
[Template for aggregating missing data with recommendations]

## Evidence Grading
[T1-T4 definitions with required locations]

## Tool Reference
[Tools by category with fallback chains and parameter notes]

Quick Fixes for Common Complaints

User Complaint	Root Cause	Fix
"Report is too short"	Missing annotation data	Add Phase 1 disambiguation + Phase 0 foundation
"Too much noise"	No collision filtering	Add negative query filters
"Can't tell what's important"	No evidence grading	Add T1-T4 tiers
"Missing sections"	No completeness checklist	Add mandatory sections with minimums
"Too long/unreadable"	Monolithic output	Separate narrative from JSON
"Just a list of papers"	No synthesis	Add biological model + hypotheses
"Shows search process"	Wrong output focus	Report-only; methodology in appendix
"Tool failed, no data"	No fallback handling	Add retry + fallback chains
"Empty results, no error"	Wrong tool parameters	Add Phase -1 param verification
"GTEx returns nothing"	Versioned ID needed	Try `ENSG*.version` format
"Data seems incomplete"	No foundation layer	Add Phase 0 with aggregator
"Can't tell what's missing"	Scattered gaps	Add Data Gaps section

Summary

Seven pillars of optimized ToolUniverse skills:

Verify tool contracts - Check params via
```
get_tool_info()
```
; maintain corrections table
Foundation first - Query comprehensive aggregators before specialized tools
Disambiguate carefully - Resolve IDs (versioned + unversioned), detect collisions, get baseline from annotation DBs
Grade evidence - T1-T4 tiers on all claims; summarize quality per section
Require quantified completeness - Numeric minimums, not just "include X"
Report content, not process - Methodology in appendix only if asked; aggregate gaps in one section
Synthesize - Biological models and testable hypotheses, not just paper lists

Apply these principles to any ToolUniverse research skill for better user experience and actionable output.

devtu-optimize-skills

NPX Install

Tags

SKILL.md Content

Optimizing ToolUniverse Skills

When to Use This Skill

Core Optimization Principles

1. Tool Interface Verification (Pre-flight Check)

2. Foundation Data Layer (Path 0)

3. Versioned Identifier Handling

4. Disambiguation Before Research

5. Report-Only Output (Hide Search Process)

6. Evidence Grading

7. Quantified Completeness (Not Just Categorical)

8. Mandatory Completeness Checklist

9. Aggregated Data Gaps Section

10. Query Strategy Optimization

11. Tool Failure Handling

12. Scalable Output Structure

13. Synthesis Sections

Skill Review Checklist

Tool Contract

Report Quality

Query Strategy

Tool Usage

Completeness

Output Structure

User Experience

Common Anti-Patterns to Fix

1. "Search Log" Reports

2. Missing Disambiguation

3. No Evidence Grading

4. Empty Sections Omitted

5. No Synthesis

6. Monolithic Bibliography

7. Silent Failures

8. Wrong Tool Parameters (NEW)

9. Missing Versioned IDs (NEW)

10. No Foundation Layer (NEW)

11. Scattered "No Data" Notes (NEW)

12. Aspirational Completeness (NEW)

Template: Optimized Skill Structure

Quick Fixes for Common Complaints

Summary