Loading...
Loading...
Use when implementing data analysis pipelines, statistical tests, or bioinformatics workflows in code (Python/R), particularly for genomics, transcriptomics, proteomics, or other -omics data.
npx skill4agent add dangeles/claude bioinformaticianReceive analysis_plan.md from PI
↓
Implement in Jupyter notebook
↓ (copilot reviews continuously)
Deliver completed notebook to PI for interpretationprincipal-investigatorcopilotscanpypydeseq2biopythonassets/notebook-structure-template.ipynb1. Title and Description
- Research question
- Date, author
- Reference to analysis plan
2. Setup
- Imports
- Configuration parameters
- Random seeds for reproducibility
3. Data Loading
- Read data files
- Initial inspection
- Data structure validation
4. Quality Control
- Sample metrics
- Filtering criteria
- QC visualizations
5. Analysis
- Statistical tests
- Transformations
- Model fitting
6. Visualization
- Main figures
- Supplementary plots
7. Export Results
- Save processed data
- Export figures
- Summary statistics
8. Session Info
- Package versions
- Execution time## Biological Context
Differential expression analysis comparing wild-type and mutant neurons identifies genes affected by loss of transcription factor X. Expected upregulation of target genes based on ChIP-seq data (Smith et al. 2020).## Biological Context
In this analysis, we will perform differential expression analysis to compare gene expression between wild-type neurons and neurons with a mutation in transcription factor X. Previous research has shown that transcription factor X plays a critical role in neuronal development by binding to the promoters of many developmentally important genes...1. Title and Scientific Context
- Research question (biological, not just technical)
- Biological hypothesis
- Expected outcome and why it matters
- Relevant background (1-2 sentences)
2. Setup (code)
- Imports, parameters, seeds
3. Data Loading
- Code: Load data
- Biological description of dataset (markdown):
* What organism/tissue/condition
* What genes/features measured
* What biological question dataset addresses
4. Quality Control
- Code: QC metrics, filtering
- Biological interpretation of QC (markdown):
* Are pass rates expected for this data type?
* Do failed samples have biological meaning?
* Red flags from biological perspective?
5. Analysis
- Code: Statistical tests, transformations
- Biological reasoning for each step (markdown):
* Why this method for this question?
* What biological assumption being tested?
* Positive/negative controls?
6. Results
- Code: Generate results
- Biological sanity checks (markdown):
* Do magnitudes make sense?
* Do directions align with biology?
* Any known biology violated?
7. Visualization
- Code: Plots
- Biological interpretation scaffolding (markdown):
* What biological pattern does this show?
* Is this expected or surprising?
* What follow-up questions does this raise?
8. Preliminary Interpretation
- Bioinformatician's biological assessment (markdown):
* Main findings in biological terms
* Caveats and limitations
* Questions for biologist-commentator
9. Handoff to Expert (if needed)
- Structured questions for biologist-commentator (markdown):
* Specific results needing interpretation
* Unexpected findings to validate
* Biological mechanisms to explore
10. Export (code)
- Save data, figures, session info## Biological Context
Comparing [condition A] vs [condition B] to identify genes involved in [biological process]. Expected upregulation of [pathway X] genes based on [mechanism/literature]. Positive controls: [gene1, gene2]. Expected log2FC range: [X-Y] based on [citation].
## Biological Sanity Checks
- [ ] Known pathway genes show expected direction (e.g., gene1 ↑, gene2 ↓)
- [ ] Housekeepers unchanged (actb, gapdh)
- [ ] Magnitudes reasonable (log2FC < 10 for transcriptional regulation)
## Preliminary Interpretation
Top hits include [gene X, Y, Z] involved in [biological process], consistent with [hypothesis/literature]. [Gene W] unexpected - requires expert validation.
**Handoff**: Unexpected downregulation of [gene W] contradicts known role in [process]. Biologist-commentator needed for mechanism assessment.## Biological Context
Clustering [tissue] cells to identify cell types. Expected populations: [celltype1 (markers: a,b,c), celltype2 (markers: d,e,f)]. Reference atlas: [citation if available].
## Cluster Validation
- Cluster 1: [celltype] - markers: [genes] ✓
- Cluster 2: [celltype] - markers: [genes] ✓
- Cluster 3: Novel population - markers: [genes] - needs expert review
**Handoff**: Cluster 3 shows unexpected marker combination [X+Y+Z-]. Biologist-commentator needed for cell type identification and biological significance.## Expert Interpretation Needed
**Finding**: [Specific result with statistics]
**Context**: [1-2 sentence background]
**Issue**: [What's unexpected/unclear and why]
**Question**: [Specific question for expert]
**Validation Done**: [Positive controls: ✓/✗, Literature: consistent/contradicts]## Expert Interpretation Needed
**Finding**: Gene X shows 8-fold upregulation (padj<0.001) in mutant vs WT
**Context**: Gene X is transcriptional repressor, expected downregulation of targets
**Issue**: Target genes also upregulated (contradicts repressor function)
**Question**: Alternative mechanism? Post-transcriptional regulation? Data artifact?
**Validation Done**: Positive controls ✓, replicates consistent ✓, literature shows conflicting resultsSkill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")assets/analysis-checklist.md# %%
# Computational Environment
import sys
import numpy as np
import pandas as pd
import scanpy as sc # or relevant packages
print("=" * 60)
print("COMPUTATIONAL ENVIRONMENT")
print("=" * 60)
print(f"Python: {sys.version}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scanpy: {sc.__version__}") # Replace with your key packages
print("=" * 60)
print("\nFor full environment, see requirements.txt")# For micromamba users (recommended for bioinformatics):
# Export micromamba packages:
micromamba env export > environment.yml
# Export pip-installed packages separately (micromamba export does not include pip packages):
pip freeze > pip-requirements.txt
# For pip users:
pip freeze > requirements.txt
# Document which file to use in notebook## Computational Environment
- **Kernel**: Python 3.11 (bio-analysis-env)
- **Environment file**: `environment.yml` (recreate with `micromamba env create -f environment.yml`)
- **Key packages**: scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0
- **Execution date**: 2026-01-29# %%
# Random seeds for reproducibility
import numpy as np
import random
RANDOM_SEED = 42 # Document choice (convention, replicating published analysis, etc.)
# Core Python/NumPy
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
# Scanpy (single-cell analysis)
import scanpy as sc
sc.settings.seed = RANDOM_SEED
# PyTorch (if using deep learning)
import torch
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(RANDOM_SEED)
# TensorFlow (if using)
import tensorflow as tf
tf.random.set_seed(RANDOM_SEED)
print(f"Random seed set to {RANDOM_SEED} for reproducibility")## Stochastic Operations
This analysis uses:
- UMAP (random initialization, seed=42)
- Leiden clustering (random walk, seed=42)
- 1000-iteration permutation test (seed=42)
All seeds set to 42 for reproducibility.# %%
# Session Information for Reproducibility
import session_info
session_info.show(
dependencies=True,
html=False
)
# Alternative for single-cell workflows:
# import scanpy as sc
# sc.logging.print_versions()
# Alternative for base Python:
# import sys
# import pkg_resources
# print(f"Python: {sys.version}")
# for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:
# print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")# %%
from pathlib import Path
# Define all paths at top of notebook
DATA_DIR = Path("data/raw")
PROCESSED_DIR = Path("data/processed")
RESULTS_DIR = Path("results/analysis_2026-01-29")
FIGURES_DIR = RESULTS_DIR / "figures"
# Create output directories
for directory in [PROCESSED_DIR, RESULTS_DIR, FIGURES_DIR]:
directory.mkdir(parents=True, exist_ok=True)
# Use variables throughout
counts_file = DATA_DIR / "counts_matrix.h5ad"
metadata_file = DATA_DIR / "sample_metadata.csv"
output_file = PROCESSED_DIR / "normalized_counts.h5ad"
figure_file = FIGURES_DIR / "umap_clusters.pdf"
print(f"Data directory: {DATA_DIR.resolve()}")
print(f"Results directory: {RESULTS_DIR.resolve()}")# ❌ BAD (non-reproducible):
adata = sc.read_h5ad("/Users/yourname/project/data/counts.h5ad")
plt.savefig("/Users/yourname/Desktop/figure.pdf")
# ✅ GOOD (reproducible):
adata = sc.read_h5ad(DATA_DIR / "counts.h5ad")
plt.savefig(FIGURES_DIR / "umap_clusters.pdf")## Data Sources
### Input Data
- **File**: `data/raw/GSE123456_counts.h5ad`
- **Source**: GEO accession GSE123456
- **Download date**: 2026-01-15
- **Download command**: `wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456`
- **Original publication**: Smith et al. (2025) Nature 600:123-130
- **Organism**: Homo sapiens
- **Tissue**: Primary cortical neurons
- **n samples**: 50 (25 control, 25 treatment)
- **n features**: 20,000 genes
### Reference Data
- **Genome build**: GRCh38 (hg38)
- **Gene annotations**: GENCODE v42
- **Downloaded**: 2026-01-10 from https://www.gencodegenes.org/environment.ymlrequirements.txtFIGURES_DIRPROCESSED_DIRnotebook-writerfrom pathlib import Path
# Use notebook-writer to create template
cells = [
{'type': 'markdown', 'content': '## Computational Environment\n...'},
{'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'},
{'type': 'markdown', 'content': '## Data Loading\n...'},
# ... analysis cells ...
{'type': 'markdown', 'content': '## Session Info'},
{'type': 'code', 'content': 'import session_info\nsession_info.show()'}
]
# Create reproducible notebook
notebook_path = create_notebook_markdown(
title="Reproducible RNA-seq Analysis",
cells=cells,
output_path=Path("analysis/rnaseq_analysis.md")
)| Issue | Problem | Fix |
|---|---|---|
| Different results on rerun | No random seed set | Set seeds for numpy, random, scanpy, torch |
| Import errors | Missing package versions | Create |
| File not found | Hardcoded paths | Use Path variables defined at top |
| Old package behavior | Package version mismatch | Document versions with |
| Data source vanished | URL changed or removed | Document download date, accession, mirror sites |
| Genome coordinate mismatch | Different genome build | Specify build (GRCh38 vs GRCh37) in notebook |
# Document in code cell
ORGANISM = "Homo sapiens"
GENOME_BUILD = "GRCh38" # or "mm39" for mouse, "dm6" for fly, etc.
ANNOTATION_VERSION = "GENCODE v42" # or "Ensembl 110"
ANNOTATION_DATE = "2026-01-10"
print(f"Analysis configuration:")
print(f" Organism: {ORGANISM}")
print(f" Genome: {GENOME_BUILD}")
print(f" Annotations: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")## External Tools
- **STAR aligner**: v2.7.11a (for read mapping)
- **MACS2**: v2.2.9.1 (for peak calling)
- **bedtools**: v2.31.0 (for interval operations)
All tools available in micromamba environment (see environment.yml).# Document all filtering/QC thresholds
QC_PARAMS = {
'min_genes_per_cell': 200,
'min_cells_per_gene': 3,
'max_pct_mt': 15, # percent mitochondrial reads
'min_counts': 1000,
'highly_variable_genes': 2000,
'n_pcs': 50, # principal components
'umap_neighbors': 15,
'leiden_resolution': 0.8
}
print("Quality control parameters:")
for param, value in QC_PARAMS.items():
print(f" {param}: {value}")# 1. Load counts
# 2. Filter low-abundance genes
# 3. Normalize (DESeq2, TMM, or library size)
# 4. Statistical test (DESeq2, edgeR, limma)
# 5. Multiple testing correction
# 6. Volcano plot + heatmappydeseq2# 1. Load AnnData object
# 2. QC filtering (cells and genes)
# 3. Normalization and log-transform
# 4. Feature selection (highly variable genes)
# 5. Dimensionality reduction (PCA, UMAP)
# 6. Clustering
# 7. Marker gene identification
# 8. Visualizationscanpy# 1. Read FASTA/FASTQ
# 2. Quality filtering
# 3. Alignment or motif search
# 4. Feature extraction
# 5. Statistical summarybiopythonreferences/analysis_workflows.mdreferences/data_structures.mdreferences/statistical_methods.mdreferences/visualization_best_practices.mdscripts/qc_pipeline.pydifferential_expression_template.pydata_loader_helpers.py| Data Type | Primary Skill | When to Use |
|---|---|---|
| Single-cell RNA-seq | | Cell type identification, clustering, trajectory |
| Bulk RNA-seq | | Differential gene expression |
| Sequences | | Alignment, motif search, format conversion |
| Statistical modeling | | Regression, time series, GLMs |
| Pathway analysis | | Gene set enrichment |
bioinformaticiancopilot