bioinformatician
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBioinformatician Skill
生物信息学家技能
Purpose
用途
Implement computational analyses of biological data, including:
- Data loading and quality control
- Statistical analysis
- Bioinformatics pipelines
- Visualization
- Integration with domain-specific tools
实现生物数据的计算分析,包括:
- 数据加载与质量控制
- 统计分析
- 生物信息学流水线
- 可视化
- 与领域特定工具集成
When to Use This Skill
何时使用该技能
Use this skill when you need to:
- Implement an analysis plan in code (from PI)
- Process genomics/transcriptomics/proteomics data
- Perform statistical tests on biological data
- Create publication-quality visualizations
- Build reproducible analysis pipelines
- Integrate multiple bioinformatics tools
在以下场景中使用本技能:
- 用代码实现(来自PI的)分析方案
- 处理基因组/转录组/蛋白质组数据
- 对生物数据执行统计测试
- 创建符合出版要求的可视化图表
- 构建可复现的分析流水线
- 集成多个生物信息学工具
Workflow Integration
工作流集成
Primary Pattern: Receive Plan → Implement → Deliver Notebook
Receive analysis_plan.md from PI
↓
Implement in Jupyter notebook
↓ (copilot reviews continuously)
Deliver completed notebook to PI for interpretationIntegration Points:
- RECEIVES: Analysis plan from
principal-investigator - WORKS WITH: (adversarial code review during implementation)
copilot - CALLS: Domain-specific skills (,
scanpy,pydeseq2, etc.)biopython - OUTPUTS: Jupyter notebooks with analysis code + results
主要模式:接收方案 → 实现 → 交付笔记本
Receive analysis_plan.md from PI
↓
Implement in Jupyter notebook
↓ (copilot reviews continuously)
Deliver completed notebook to PI for interpretation集成点:
- 接收:来自的分析方案
principal-investigator - 协作:与(实现过程中进行对抗性代码审查)
copilot - 调用:领域特定技能(、
scanpy、pydeseq2等)biopython - 输出:包含分析代码+结果的Jupyter notebook
Core Capabilities
核心能力
1. Data Loading and Validation
1. 数据加载与验证
- Read common formats (CSV, TSV, HDF5, Parquet, FASTQ, BAM, VCF)
- Validate data integrity and format
- Handle compressed files
- Memory-efficient loading for large datasets
- 读取常见格式(CSV、TSV、HDF5、Parquet、FASTQ、BAM、VCF)
- 验证数据完整性与格式
- 处理压缩文件
- 针对大型数据集的内存高效加载
2. Quality Control
2. 质量控制
- Sample quality metrics
- Outlier detection
- Batch effect assessment
- Positive/negative control validation
- 样本质量指标
- 异常值检测
- 批次效应评估
- 阳性/阴性对照验证
3. Statistical Analysis
3. 统计分析
- Differential expression/abundance
- Enrichment analysis
- Clustering and dimensionality reduction
- Correlation and regression
- Multiple testing correction
- 差异表达/丰度分析
- 富集分析
- 聚类与降维
- 相关性与回归分析
- 多重检验校正
4. Visualization
4. 可视化
- Publication-quality plots (matplotlib, seaborn, plotly)
- Interactive visualizations
- Consistent styling
- Proper labeling and legends
- 符合出版要求的图表(matplotlib、seaborn、plotly)
- 交互式可视化
- 统一的样式规范
- 规范的标注与图例
5. Pipeline Development
5. 流水线开发
- Modular, reusable code
- Parameter documentation
- Progress logging
- Error handling
- 模块化、可复用代码
- 参数文档
- 进度日志
- 错误处理
Standard Notebook Structure
标准笔记本结构
Use the template in :
assets/notebook-structure-template.ipynb1. Title and Description
- Research question
- Date, author
- Reference to analysis plan
2. Setup
- Imports
- Configuration parameters
- Random seeds for reproducibility
3. Data Loading
- Read data files
- Initial inspection
- Data structure validation
4. Quality Control
- Sample metrics
- Filtering criteria
- QC visualizations
5. Analysis
- Statistical tests
- Transformations
- Model fitting
6. Visualization
- Main figures
- Supplementary plots
7. Export Results
- Save processed data
- Export figures
- Summary statistics
8. Session Info
- Package versions
- Execution time使用中的模板:
assets/notebook-structure-template.ipynb1. 标题与描述
- 研究问题
- 日期、作者
- 分析方案参考
2. 环境搭建
- 导入包
- 配置参数
- 用于复现性的随机种子
3. 数据加载
- 读取数据文件
- 初始检查
- 数据结构验证
4. 质量控制
- 样本指标
- 过滤标准
- QC可视化
5. 分析
- 统计测试
- 数据转换
- 模型拟合
6. 可视化
- 主图
- 补充图表
7. 结果导出
- 保存处理后的数据
- 导出图表
- 汇总统计
8. 会话信息
- 包版本
- 执行时间Biological Literacy Framework
生物素养框架
Writing Style for Biological Context
生物语境下的写作风格
All biological context in notebooks should follow concise scientific prose:
Principles:
- ✅ Brief: 1-3 sentences per section, not paragraphs
- ✅ Clear: Use precise biological terminology
- ✅ Factual: State what/why without excessive detail
- ✅ Publication-ready: Like Methods/Results sections in papers
Example - Good (Concise):
markdown
undefined笔记本中所有生物语境内容应遵循简洁的科学文风:
原则:
- ✅ 简洁:每个部分1-3句话,而非段落
- ✅ 清晰:使用精准的生物学术语
- ✅ 客观:说明内容/原因,避免冗余细节
- ✅ 符合出版规范:类似论文中的方法/结果部分
示例 - 优秀(简洁):
markdown
undefinedBiological Context
生物语境
Differential expression analysis comparing wild-type and mutant neurons identifies genes affected by loss of transcription factor X. Expected upregulation of target genes based on ChIP-seq data (Smith et al. 2020).
**Example - Avoid (Too Verbose)**:
```markdown对比野生型与突变型神经元的差异表达分析,识别受转录因子X缺失影响的基因。基于ChIP-seq数据(Smith等人,2020),预期靶基因上调。
**示例 - 避免(过于冗长)**:
```markdownBiological Context
生物语境
In this analysis, we will perform differential expression analysis to compare gene expression between wild-type neurons and neurons with a mutation in transcription factor X. Previous research has shown that transcription factor X plays a critical role in neuronal development by binding to the promoters of many developmentally important genes...
undefined在本次分析中,我们将执行差异表达分析,对比野生型神经元与转录因子X突变神经元的基因表达。此前研究表明,转录因子X通过结合许多发育相关基因的启动子,在神经元发育中发挥关键作用...
undefinedWhen to Provide Interpretation vs Handoff
何时提供解读 vs 移交
Bioinformatician Handles (routine interpretation):
- Standard results following known biology
- Positive/negative controls behaving as expected
- Results matching literature precedents
- Technical QC assessments with biological implications
- Magnitude/direction sanity checks
Handoff to Biologist-Commentator (expert needed):
- Novel or unexpected findings
- Results contradicting established biology
- Unclear biological mechanisms
- Publication-critical interpretations
- Proposing new hypotheses or models
生物信息学家负责(常规解读):
- 符合已知生物学规律的标准结果
- 阳性/阴性对照表现符合预期
- 与文献先例匹配的结果
- 具有生物学意义的技术QC评估
- 幅度/方向合理性检查
移交至生物学家评论员(需要专家参与):
- 新颖或意外发现
- 与已确立生物学规律矛盾的结果
- 不明确的生物学机制
- 出版关键解读
- 提出新假设或模型
Enhanced Notebook Structure
增强版笔记本结构
Use this structure for biologically-literate notebooks:
1. Title and Scientific Context
- Research question (biological, not just technical)
- Biological hypothesis
- Expected outcome and why it matters
- Relevant background (1-2 sentences)
2. Setup (code)
- Imports, parameters, seeds
3. Data Loading
- Code: Load data
- Biological description of dataset (markdown):
* What organism/tissue/condition
* What genes/features measured
* What biological question dataset addresses
4. Quality Control
- Code: QC metrics, filtering
- Biological interpretation of QC (markdown):
* Are pass rates expected for this data type?
* Do failed samples have biological meaning?
* Red flags from biological perspective?
5. Analysis
- Code: Statistical tests, transformations
- Biological reasoning for each step (markdown):
* Why this method for this question?
* What biological assumption being tested?
* Positive/negative controls?
6. Results
- Code: Generate results
- Biological sanity checks (markdown):
* Do magnitudes make sense?
* Do directions align with biology?
* Any known biology violated?
7. Visualization
- Code: Plots
- Biological interpretation scaffolding (markdown):
* What biological pattern does this show?
* Is this expected or surprising?
* What follow-up questions does this raise?
8. Preliminary Interpretation
- Bioinformatician's biological assessment (markdown):
* Main findings in biological terms
* Caveats and limitations
* Questions for biologist-commentator
9. Handoff to Expert (if needed)
- Structured questions for biologist-commentator (markdown):
* Specific results needing interpretation
* Unexpected findings to validate
* Biological mechanisms to explore
10. Export (code)
- Save data, figures, session info使用此结构创建具备生物素养的笔记本:
1. 标题与科学语境
- 研究问题(生物学层面,而非仅技术层面)
- 生物学假设
- 预期结果及其重要性
- 相关背景(1-2句话)
2. 环境搭建(代码)
- 导入包、参数、种子
3. 数据加载
- 代码:加载数据
- 数据集的生物学描述(markdown):
* 生物/组织/条件
* 测量的基因/特征
* 数据集解决的生物学问题
4. 质量控制
- 代码:QC指标、过滤
- QC的生物学解读(markdown):
* 该数据类型的通过率是否符合预期?
* 不合格样本是否具有生物学意义?
* 生物学视角下的警示信号?
5. 分析
- 代码:统计测试、转换
- 每个步骤的生物学依据(markdown):
* 为何针对该问题使用此方法?
* 测试的生物学假设是什么?
* 阳性/阴性对照?
6. 结果
- 代码:生成结果
- 生物学合理性检查(markdown):
* 幅度是否合理?
* 方向是否与生物学规律一致?
* 是否违反已知生物学规律?
7. 可视化
- 代码:绘图
- 生物学解读框架(markdown):
* 该图表展示了何种生物学模式?
* 符合预期还是意外?
* 引发哪些后续问题?
8. 初步解读
- 生物信息学家的生物学评估(markdown):
* 用生物学语言描述主要发现
* 注意事项与局限性
* 向生物学家评论员提出的问题
9. 移交至专家(若需要)
- 向生物学家评论员提出的结构化问题(markdown):
* 需要解读的特定结果
* 需验证的意外发现
* 需探索的生物学机制
10. 导出(代码)
- 保存数据、图表、会话信息Biological Sanity Check Framework
生物学合理性检查框架
Run these checks before accepting results:
在接受结果前执行以下检查:
Expression/Abundance Checks
表达/丰度检查
- Order of magnitude reasonable? (log2FC > 10 is suspicious)
- Direction matches known biology? (check a few known genes)
- Positive controls behave as expected?
- Negative controls show no signal?
- 数量级合理?(log2FC > 10需警惕)
- 方向与已知生物学规律一致?(检查部分已知基因)
- 阳性对照表现符合预期?
- 阴性对照无信号?
Statistical Checks with Biological Lens
带生物学视角的统计检查
- Top hits include known biology? (literature validation)
- Results robust to threshold changes?
- Batch effects vs real biology separated?
- Multiple testing appropriate for biology? (discovery vs validation)
- 核心结果包含已知生物学内容?(文献验证)
- 结果对阈值变化是否稳健?
- 批次效应与真实生物学信号是否分离?
- 多重检验方法是否适合生物学场景?(发现 vs 验证)
Genomics-Specific
基因组学特定检查
- Chromosome names consistent? (chr1 vs 1)
- Coordinates sensible? (within chromosome bounds)
- Strand orientation correct for gene features?
- Genome build consistent throughout?
- 染色体名称一致?(chr1 vs 1)
- 坐标合理?(在染色体范围内)
- 链方向与基因特征匹配?
- 全程使用的基因组版本一致?
Experimental Design
实验设计
- Sample size adequate for this effect size?
- Replicates biological or technical?
- Confounders identified and addressed?
- Controls appropriate for this experiment type?
If any check fails: Document in notebook, flag for biologist-commentator review
- 样本量对该效应量是否足够?
- 重复是生物学重复还是技术重复?
- 是否识别并处理混杂因素?
- 对照是否适合该实验类型?
若任何检查不通过:在笔记本中记录,标记需生物学家评论员审核
Biological Context Templates
生物语境模板
Template: Differential Expression Analysis
模板:差异表达分析
markdown
undefinedmarkdown
undefinedBiological Context
生物语境
Comparing [condition A] vs [condition B] to identify genes involved in [biological process]. Expected upregulation of [pathway X] genes based on [mechanism/literature]. Positive controls: [gene1, gene2]. Expected log2FC range: [X-Y] based on [citation].
对比[条件A]与[条件B],识别参与[生物学过程]的基因。基于[机制/文献],预期[通路X]基因上调。阳性对照:[基因1, 基因2]。预期log2FC范围:[X-Y],依据[引用文献]。
Biological Sanity Checks
生物学合理性检查
- Known pathway genes show expected direction (e.g., gene1 ↑, gene2 ↓)
- Housekeepers unchanged (actb, gapdh)
- Magnitudes reasonable (log2FC < 10 for transcriptional regulation)
- 已知通路基因方向符合预期(如:基因1 ↑,基因2 ↓)
- 管家基因无变化(actb, gapdh)
- 幅度合理(转录调控的log2FC < 10)
Preliminary Interpretation
初步解读
Top hits include [gene X, Y, Z] involved in [biological process], consistent with [hypothesis/literature]. [Gene W] unexpected - requires expert validation.
Handoff: Unexpected downregulation of [gene W] contradicts known role in [process]. Biologist-commentator needed for mechanism assessment.
undefined核心结果包括参与[生物学过程]的[基因X, Y, Z],与[假设/文献]一致。[基因W]为意外结果 - 需专家验证。
移交:[基因W]的意外下调与其在[过程]中的已知作用矛盾。需生物学家评论员评估机制。
undefinedTemplate: Single-Cell Clustering
模板:单细胞聚类
markdown
undefinedmarkdown
undefinedBiological Context
生物语境
Clustering [tissue] cells to identify cell types. Expected populations: [celltype1 (markers: a,b,c), celltype2 (markers: d,e,f)]. Reference atlas: [citation if available].
对[组织]细胞进行聚类以识别细胞类型。预期群体:[细胞类型1(标记物:a,b,c), 细胞类型2(标记物:d,e,f)]。参考图谱:[若有则引用文献]。
Cluster Validation
聚类验证
- Cluster 1: [celltype] - markers: [genes] ✓
- Cluster 2: [celltype] - markers: [genes] ✓
- Cluster 3: Novel population - markers: [genes] - needs expert review
Handoff: Cluster 3 shows unexpected marker combination [X+Y+Z-]. Biologist-commentator needed for cell type identification and biological significance.
undefined- 聚类1:[细胞类型] - 标记物:[基因] ✓
- 聚类2:[细胞类型] - 标记物:[基因] ✓
- 聚类3:新型群体 - 标记物:[基因] - 需专家审核
移交:聚类3显示意外标记物组合[X+Y+Z-]。需生物学家评论员识别细胞类型并评估生物学意义。
undefinedTemplate: Expert Handoff Format
模板:专家移交格式
Use this concise format when escalating to biologist-commentator:
markdown
undefined升级至生物学家评论员时使用此简洁格式:
markdown
undefinedExpert Interpretation Needed
需要专家解读
Finding: [Specific result with statistics]
Context: [1-2 sentence background]
Issue: [What's unexpected/unclear and why]
Question: [Specific question for expert]
Validation Done: [Positive controls: ✓/✗, Literature: consistent/contradicts]
**Example**:
```markdown发现:[带统计数据的特定结果]
语境:[1-2句话背景]
问题:[意外/不明确的内容及原因]
问题:[向专家提出的特定问题]
已完成验证:[阳性对照:✓/✗,文献:一致/矛盾]
**示例**:
```markdownExpert Interpretation Needed
需要专家解读
Finding: Gene X shows 8-fold upregulation (padj<0.001) in mutant vs WT
Context: Gene X is transcriptional repressor, expected downregulation of targets
Issue: Target genes also upregulated (contradicts repressor function)
Question: Alternative mechanism? Post-transcriptional regulation? Data artifact?
Validation Done: Positive controls ✓, replicates consistent ✓, literature shows conflicting results
undefined发现:突变体与野生型相比,基因X上调8倍(padj<0.001)
语境:基因X是转录抑制因子,预期靶基因下调
问题:靶基因同样上调(与抑制因子功能矛盾)
问题:是否存在替代机制?转录后调控?数据伪影?
已完成验证:阳性对照 ✓,重复结果一致 ✓,文献结果存在矛盾
undefinedBiologist-Commentator Integration Pattern
与生物学家评论员的集成模式
When to Invoke Biologist-Commentator
何时调用生物学家评论员
Pre-Analysis (Method Validation):
python
Skill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")During Analysis (Quick Check):
- Use biological sanity check framework (above)
- Document any red flags
- Continue if checks pass, escalate if fail
Post-Analysis (Expert Interpretation):
python
Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")分析前(方法验证):
python
Skill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")分析中(快速检查):
- 使用上述生物学合理性检查框架
- 记录任何警示信号
- 检查通过则继续,不通过则升级
分析后(专家解读):
python
Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")Handoff Workflow
移交工作流
- Bioinformatician: Run analysis, perform sanity checks, document findings
- Handoff: Create structured handoff section in notebook (see template above)
- Biologist-Commentator: Provides expert interpretation, mechanism insights, validation
- Bioinformatician: Incorporate interpretation into notebook, flag needed validations
- 生物信息学家:执行分析,进行合理性检查,记录发现
- 移交:在笔记本中创建结构化移交部分(见上述模板)
- 生物学家评论员:提供专家解读、机制见解、验证
- 生物信息学家:将解读整合到笔记本中,标记需验证内容
Pre-Flight Checklist
预执行检查清单
Before starting implementation, verify:
- Analysis plan clearly defines objectives
- Data files exist and paths are correct
- Required packages installed
- Expected output format understood
- Random seeds set for reproducibility
Use for complete list.
assets/analysis-checklist.md开始实现前,验证:
- 分析方案明确定义目标
- 数据文件存在且路径正确
- 已安装所需包
- 理解预期输出格式
- 设置了用于复现性的随机种子
使用获取完整清单。
assets/analysis-checklist.mdReproducibility Standards
复现性标准
Critical: Every bioinformatics analysis must be fully reproducible. Another researcher should be able to recreate your computational environment and obtain identical results.
关键要求:每个生物信息学分析必须完全可复现。其他研究人员应能够重建你的计算环境并获得完全相同的结果。
Environment Documentation (Mandatory)
环境文档(必填)
Start every notebook with environment documentation:
python
undefined在每个笔记本开头添加环境文档:
python
undefined%%
%%
Computational Environment
计算环境
import sys
import numpy as np
import pandas as pd
import scanpy as sc # or relevant packages
print("=" * 60)
print("COMPUTATIONAL ENVIRONMENT")
print("=" * 60)
print(f"Python: {sys.version}")
print(f"NumPy: {np.version}")
print(f"Pandas: {pd.version}")
print(f"Scanpy: {sc.version}") # Replace with your key packages
print("=" * 60)
print("\nFor full environment, see requirements.txt")
**Create environment files before starting analysis:**
```bashimport sys
import numpy as np
import pandas as pd
import scanpy as sc # 或相关包
print("=" * 60)
print("COMPUTATIONAL ENVIRONMENT")
print("=" * 60)
print(f"Python: {sys.version}")
print(f"NumPy: {np.version}")
print(f"Pandas: {pd.version}")
print(f"Scanpy: {sc.version}") # 替换为你的核心包
print("=" * 60)
print("\n完整环境请见requirements.txt")
**开始分析前创建环境文件:**
```bashFor micromamba users (recommended for bioinformatics):
针对micromamba用户(生物信息学推荐):
Export micromamba packages:
导出micromamba包:
micromamba env export > environment.yml
micromamba env export > environment.yml
Export pip-installed packages separately (micromamba export does not include pip packages):
单独导出pip安装的包(micromamba导出不包含pip包):
pip freeze > pip-requirements.txt
pip freeze > pip-requirements.txt
For pip users:
针对pip用户:
pip freeze > requirements.txt
pip freeze > requirements.txt
Document which file to use in notebook
在笔记本中说明使用哪个文件
**In notebook markdown cell:**
```markdown
**在笔记本的markdown单元格中:**
```markdownComputational Environment
计算环境
- Kernel: Python 3.11 (bio-analysis-env)
- Environment file: (recreate with
environment.yml)micromamba env create -f environment.yml - Key packages: scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0
- Execution date: 2026-01-29
undefined- 内核:Python 3.11 (bio-analysis-env)
- 环境文件:(使用
environment.yml重建)micromamba env create -f environment.yml - 核心包:scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0
- 执行日期:2026-01-29
undefinedRandom Seed Setting (Mandatory for Stochastic Processes)
随机种子设置(随机过程必填)
Set seeds in setup cell:
python
undefined在环境搭建单元格中设置种子:
python
undefined%%
%%
Random seeds for reproducibility
用于复现性的随机种子
import numpy as np
import random
RANDOM_SEED = 42 # Document choice (convention, replicating published analysis, etc.)
import numpy as np
import random
RANDOM_SEED = 42 # 记录选择原因(惯例、复现已发表分析等)
Core Python/NumPy
核心Python/NumPy
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
Scanpy (single-cell analysis)
Scanpy(单细胞分析)
import scanpy as sc
sc.settings.seed = RANDOM_SEED
import scanpy as sc
sc.settings.seed = RANDOM_SEED
PyTorch (if using deep learning)
PyTorch(若使用深度学习)
import torch
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(RANDOM_SEED)
import torch
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(RANDOM_SEED)
TensorFlow (if using)
TensorFlow(若使用)
import tensorflow as tf
tf.random.set_seed(RANDOM_SEED)
print(f"Random seed set to {RANDOM_SEED} for reproducibility")
**Bioinformatics operations requiring seeds:**
- **Dimensionality reduction**: UMAP, t-SNE, PCA with randomized SVD
- **Clustering**: Leiden, Louvain (graph-based)
- **Sampling**: Random subsampling, bootstrap, cross-validation
- **Imputation**: Stochastic imputation methods
- **Simulation**: Monte Carlo, permutation tests
- **Machine learning**: Random forests, neural networks, k-means initialization
**Document in notebook:**
```markdownimport tensorflow as tf
tf.random.set_seed(RANDOM_SEED)
print(f"已设置随机种子为{RANDOM_SEED}以保证复现性")
**需要设置种子的生物信息学操作:**
- **降维**:UMAP、t-SNE、带随机SVD的PCA
- **聚类**:Leiden、Louvain(基于图)
- **采样**:随机子采样、bootstrap、交叉验证
- **插补**:随机插补方法
- **模拟**:蒙特卡洛、置换检验
- **机器学习**:随机森林、神经网络、k-means初始化
**在笔记本中记录:**
```markdownStochastic Operations
随机操作
This analysis uses:
- UMAP (random initialization, seed=42)
- Leiden clustering (random walk, seed=42)
- 1000-iteration permutation test (seed=42)
All seeds set to 42 for reproducibility.
undefined本次分析使用:
- UMAP(随机初始化,seed=42)
- Leiden聚类(随机游走,seed=42)
- 1000次迭代置换检验(seed=42)
所有种子均设置为42以保证复现性。
undefinedSession Info Output (Mandatory)
会话信息输出(必填)
End every notebook with comprehensive session info:
python
undefined在每个笔记本末尾添加全面的会话信息:
python
undefined%%
%%
Session Information for Reproducibility
用于复现性的会话信息
import session_info
session_info.show(
dependencies=True,
html=False
)
import session_info
session_info.show(
dependencies=True,
html=False
)
Alternative for single-cell workflows:
单细胞工作流的替代方案:
import scanpy as sc
import scanpy as sc
sc.logging.print_versions()
sc.logging.print_versions()
Alternative for base Python:
基础Python的替代方案:
import sys
import sys
import pkg_resources
import pkg_resources
print(f"Python: {sys.version}")
print(f"Python: {sys.version}")
for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:
for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:
print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")
print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")
**What this captures:**
- Python version
- Operating system
- All package versions (including dependencies)
- Execution timestamp
**Why this matters:**
- API changes between package versions
- Statistical method implementations evolve
- Bugs get fixed (results may change)
- Reviewers need to verify methods
**该部分捕获:**
- Python版本
- 操作系统
- 所有包版本(包括依赖)
- 执行时间戳
**重要性:**
- 包版本间的API变化
- 统计方法实现的演进
- 修复的Bug(结果可能变化)
- 审稿人需要验证方法File Path Best Practices
文件路径最佳实践
Use relative paths and variables:
python
undefined使用相对路径与变量:
python
undefined%%
%%
from pathlib import Path
from pathlib import Path
Define all paths at top of notebook
在笔记本顶部定义所有路径
DATA_DIR = Path("data/raw")
PROCESSED_DIR = Path("data/processed")
RESULTS_DIR = Path("results/analysis_2026-01-29")
FIGURES_DIR = RESULTS_DIR / "figures"
DATA_DIR = Path("data/raw")
PROCESSED_DIR = Path("data/processed")
RESULTS_DIR = Path("results/analysis_2026-01-29")
FIGURES_DIR = RESULTS_DIR / "figures"
Create output directories
创建输出目录
for directory in [PROCESSED_DIR, RESULTS_DIR, FIGURES_DIR]:
directory.mkdir(parents=True, exist_ok=True)
for directory in [PROCESSED_DIR, RESULTS_DIR, FIGURES_DIR]:
directory.mkdir(parents=True, exist_ok=True)
Use variables throughout
全程使用变量
counts_file = DATA_DIR / "counts_matrix.h5ad"
metadata_file = DATA_DIR / "sample_metadata.csv"
output_file = PROCESSED_DIR / "normalized_counts.h5ad"
figure_file = FIGURES_DIR / "umap_clusters.pdf"
print(f"Data directory: {DATA_DIR.resolve()}")
print(f"Results directory: {RESULTS_DIR.resolve()}")
**Never use hardcoded absolute paths:**
```pythoncounts_file = DATA_DIR / "counts_matrix.h5ad"
metadata_file = DATA_DIR / "sample_metadata.csv"
output_file = PROCESSED_DIR / "normalized_counts.h5ad"
figure_file = FIGURES_DIR / "umap_clusters.pdf"
print(f"数据目录: {DATA_DIR.resolve()}")
print(f"结果目录: {RESULTS_DIR.resolve()}")
**切勿使用硬编码绝对路径:**
```python❌ BAD (non-reproducible):
❌ 错误(不可复现):
adata = sc.read_h5ad("/Users/yourname/project/data/counts.h5ad")
plt.savefig("/Users/yourname/Desktop/figure.pdf")
adata = sc.read_h5ad("/Users/yourname/project/data/counts.h5ad")
plt.savefig("/Users/yourname/Desktop/figure.pdf")
✅ GOOD (reproducible):
✅ 正确(可复现):
adata = sc.read_h5ad(DATA_DIR / "counts.h5ad")
plt.savefig(FIGURES_DIR / "umap_clusters.pdf")
undefinedadata = sc.read_h5ad(DATA_DIR / "counts.h5ad")
plt.savefig(FIGURES_DIR / "umap_clusters.pdf")
undefinedData Provenance Documentation
数据来源文档
Document data sources in notebook:
markdown
undefined在笔记本中记录数据来源:
markdown
undefinedData Sources
数据来源
Input Data
输入数据
- File:
data/raw/GSE123456_counts.h5ad - Source: GEO accession GSE123456
- Download date: 2026-01-15
- Download command:
wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456 - Original publication: Smith et al. (2025) Nature 600:123-130
- Organism: Homo sapiens
- Tissue: Primary cortical neurons
- n samples: 50 (25 control, 25 treatment)
- n features: 20,000 genes
- 文件:
data/raw/GSE123456_counts.h5ad - 来源:GEO登录号GSE123456
- 下载日期:2026-01-15
- 下载命令:
wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456 - 原始出版物:Smith等人(2025)Nature 600:123-130
- 生物:智人
- 组织:原代皮质神经元
- 样本数:50(25对照,25处理)
- 特征数:20,000个基因
Reference Data
参考数据
- Genome build: GRCh38 (hg38)
- Gene annotations: GENCODE v42
- Downloaded: 2026-01-10 from https://www.gencodegenes.org/
**Why this matters:**
- Data can be updated or removed from repositories
- Genome builds affect coordinate-based analyses
- Sample metadata clarifies experimental design
- Enables others to download identical data- 基因组版本:GRCh38 (hg38)
- 基因注释:GENCODE v42
- 下载日期:2026-01-10,来自https://www.gencodegenes.org/
**重要性:**
- 数据可能在仓库中更新或移除
- 基因组版本影响基于坐标的分析
- 样本元数据明确实验设计
- 便于他人下载相同数据Reproducibility Pre-Flight Checklist
复现性预执行检查清单
Before starting analysis, verify:
- Environment documented (or
environment.ymlexists)requirements.txt - Environment creation documented in notebook
- Random seeds will be set for all stochastic operations
- File paths use variables (no hardcoded absolute paths)
- Data sources documented (where to download, version, date)
- Genome build / reference database versions specified
- Session info cell will be added at end
Before handoff to PI, verify:
- Notebook runs end-to-end without errors (Restart Kernel & Run All)
- Results reproducible (run twice, identical outputs)
- All figures saved to with descriptive names
FIGURES_DIR - All processed data saved to
PROCESSED_DIR - Session info cell executed and output visible
- Execution time reasonable (< 2 hours for routine analyses)
开始分析前,验证:
- 已记录环境(存在或
environment.yml)requirements.txt - 已在笔记本中记录环境创建方法
- 所有随机操作将设置随机种子
- 文件路径使用变量(无硬编码绝对路径)
- 已记录数据来源(下载地址、版本、日期)
- 已指定基因组版本/参考数据库版本
- 将在末尾添加会话信息单元格
移交至PI前,验证:
- 笔记本可从头至尾无错误执行(重启内核并全部运行)
- 结果可复现(运行两次,输出相同)
- 所有图表已保存至,命名具有描述性
FIGURES_DIR - 所有处理后的数据已保存至
PROCESSED_DIR - 会话信息单元格已执行且输出可见
- 执行时间合理(常规分析<2小时)
Integration with notebook-writer Skill
与notebook-writer技能集成
When creating notebooks programmatically, use skill with reproducibility standards:
notebook-writerpython
from pathlib import Path程序化创建笔记本时,使用技能并遵循复现性标准:
notebook-writerpython
from pathlib import PathUse notebook-writer to create template
使用notebook-writer创建模板
cells = [
{'type': 'markdown', 'content': '## Computational Environment\n...'},
{'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'},
{'type': 'markdown', 'content': '## Data Loading\n...'},
# ... analysis cells ...
{'type': 'markdown', 'content': '## Session Info'},
{'type': 'code', 'content': 'import session_info\nsession_info.show()'}
]
cells = [
{'type': 'markdown', 'content': '## 计算环境\n...'},
{'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'},
{'type': 'markdown', 'content': '## 数据加载\n...'},
# ... 分析单元格 ...
{'type': 'markdown', 'content': '## 会话信息'},
{'type': 'code', 'content': 'import session_info\nsession_info.show()'}
]
Create reproducible notebook
创建可复现的笔记本
notebook_path = create_notebook_markdown(
title="Reproducible RNA-seq Analysis",
cells=cells,
output_path=Path("analysis/rnaseq_analysis.md")
)
undefinednotebook_path = create_notebook_markdown(
title="可复现RNA-seq分析",
cells=cells,
output_path=Path("analysis/rnaseq_analysis.md")
)
undefinedCommon Reproducibility Failures and Fixes
常见复现性问题与修复方案
| Issue | Problem | Fix |
|---|---|---|
| Different results on rerun | No random seed set | Set seeds for numpy, random, scanpy, torch |
| Import errors | Missing package versions | Create |
| File not found | Hardcoded paths | Use Path variables defined at top |
| Old package behavior | Package version mismatch | Document versions with |
| Data source vanished | URL changed or removed | Document download date, accession, mirror sites |
| Genome coordinate mismatch | Different genome build | Specify build (GRCh38 vs GRCh37) in notebook |
| 问题 | 影响 | 修复方案 |
|---|---|---|
| 重新运行结果不同 | 未设置随机种子 | 为numpy、random、scanpy、torch设置种子 |
| 导入错误 | 缺失包版本信息 | 创建 |
| 文件未找到 | 硬编码路径 | 使用顶部定义的Path变量 |
| 旧包行为 | 包版本不匹配 | 使用 |
| 数据来源消失 | URL变更或移除 | 记录下载日期、登录号、镜像站点 |
| 基因组坐标不匹配 | 基因组版本不同 | 在笔记本中指定版本(GRCh38 vs GRCh37) |
Bioinformatics-Specific Reproducibility Considerations
生物信息学特定的复现性注意事项
Organism and Reference Versions:
python
undefined生物与参考版本:
python
undefinedDocument in code cell
在代码单元格中记录
ORGANISM = "Homo sapiens"
GENOME_BUILD = "GRCh38" # or "mm39" for mouse, "dm6" for fly, etc.
ANNOTATION_VERSION = "GENCODE v42" # or "Ensembl 110"
ANNOTATION_DATE = "2026-01-10"
print(f"Analysis configuration:")
print(f" Organism: {ORGANISM}")
print(f" Genome: {GENOME_BUILD}")
print(f" Annotations: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")
**Bioinformatics Tools (if used):**
```markdownORGANISM = "Homo sapiens"
GENOME_BUILD = "GRCh38" # 小鼠为"mm39",果蝇为"dm6"等
ANNOTATION_VERSION = "GENCODE v42" # 或"Ensembl 110"
ANNOTATION_DATE = "2026-01-10"
print(f"分析配置:")
print(f" 生物: {ORGANISM}")
print(f" 基因组: {GENOME_BUILD}")
print(f" 注释: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")
**生物信息学工具(若使用):**
```markdownExternal Tools
外部工具
- STAR aligner: v2.7.11a (for read mapping)
- MACS2: v2.2.9.1 (for peak calling)
- bedtools: v2.31.0 (for interval operations)
All tools available in micromamba environment (see environment.yml).
**Data Processing Parameters:**
```python- STAR比对器: v2.7.11a(用于reads比对)
- MACS2: v2.2.9.1(用于峰识别)
- bedtools: v2.31.0(用于区间操作)
所有工具均可在micromamba环境中获取(见environment.yml)。
**数据处理参数:**
```pythonDocument all filtering/QC thresholds
记录所有过滤/QC阈值
QC_PARAMS = {
'min_genes_per_cell': 200,
'min_cells_per_gene': 3,
'max_pct_mt': 15, # percent mitochondrial reads
'min_counts': 1000,
'highly_variable_genes': 2000,
'n_pcs': 50, # principal components
'umap_neighbors': 15,
'leiden_resolution': 0.8
}
print("Quality control parameters:")
for param, value in QC_PARAMS.items():
print(f" {param}: {value}")
undefinedQC_PARAMS = {
'min_genes_per_cell': 200,
'min_cells_per_gene': 3,
'max_pct_mt': 15, # 线粒体reads百分比
'min_counts': 1000,
'highly_variable_genes': 2000,
'n_pcs': 50, # 主成分
'umap_neighbors': 15,
'leiden_resolution': 0.8
}
print("质量控制参数:")
for param, value in QC_PARAMS.items():
print(f" {param}: {value}")
undefinedCode Quality Standards
代码质量标准
During Implementation
实现过程中
- Copilot reviews continuously - expect adversarial feedback
- Write clear comments explaining biological context
- Use descriptive variable names
- Modularize repeated operations into functions
- Log progress for long-running analyses
- Copilot持续审查 - 预期收到批判性反馈
- 编写清晰的注释说明生物语境
- 使用描述性变量名
- 将重复操作模块化为函数
- 为长时间运行的分析记录进度
Testing
测试
- Validate on small test data first
- Check edge cases (empty data, single sample, all zeros)
- Compare to expected results (positive controls)
- Verify reproducibility (run twice, same results)
- 先在小型测试数据上验证
- 检查边缘情况(空数据、单样本、全零)
- 与预期结果对比(阳性对照)
- 验证复现性(运行两次,结果相同)
Common Analysis Patterns
常见分析模式
Pattern 1: Differential Expression (RNA-seq)
模式1:差异表达(RNA-seq)
python
undefinedpython
undefined1. Load counts
1. 加载计数数据
2. Filter low-abundance genes
2. 过滤低丰度基因
3. Normalize (DESeq2, TMM, or library size)
3. 标准化(DESeq2、TMM或文库大小标准化)
4. Statistical test (DESeq2, edgeR, limma)
4. 统计测试(DESeq2、edgeR、limma)
5. Multiple testing correction
5. 多重检验校正
6. Volcano plot + heatmap
6. 火山图 + 热图
→ Use `pydeseq2` skill for implementation details→ 实现细节使用`pydeseq2`技能Pattern 2: Single-Cell Analysis
模式2:单细胞分析
python
undefinedpython
undefined1. Load AnnData object
1. 加载AnnData对象
2. QC filtering (cells and genes)
2. QC过滤(细胞与基因)
3. Normalization and log-transform
3. 标准化与对数转换
4. Feature selection (highly variable genes)
4. 特征选择(高可变基因)
5. Dimensionality reduction (PCA, UMAP)
5. 降维(PCA、UMAP)
6. Clustering
6. 聚类
7. Marker gene identification
7. 标记基因识别
8. Visualization
8. 可视化
→ Use `scanpy` skill for implementation details→ 实现细节使用`scanpy`技能Pattern 3: Sequence Analysis
模式3:序列分析
python
undefinedpython
undefined1. Read FASTA/FASTQ
1. 读取FASTA/FASTQ
2. Quality filtering
2. 质量过滤
3. Alignment or motif search
3. 比对或基序搜索
4. Feature extraction
4. 特征提取
5. Statistical summary
5. 统计汇总
→ Use `biopython` skill for implementation details→ 实现细节使用`biopython`技能References
参考资料
For detailed guidance:
- - Step-by-step workflows for common analyses
references/analysis_workflows.md - - When to use pandas/anndata/Bioconductor
references/data_structures.md - - Which test for which data
references/statistical_methods.md - - Plot selection and styling
references/visualization_best_practices.md
详细指南:
- - 常见分析的分步工作流
references/analysis_workflows.md - - 何时使用pandas/anndata/Bioconductor
references/data_structures.md - - 不同数据对应的测试方法
references/statistical_methods.md - - 图表选择与样式
references/visualization_best_practices.md
Helper Scripts
辅助脚本
Available in :
scripts/- - Automated QC for RNA-seq data
qc_pipeline.py - - Complete DESeq2 pipeline
differential_expression_template.py - - Functions for common file formats
data_loader_helpers.py
Usage: Read these scripts as reference implementations, copy/adapt for your specific analysis, or call directly via Bash if appropriate.
位于目录:
scripts/- - RNA-seq数据的自动化QC
qc_pipeline.py - - 完整的DESeq2流水线
differential_expression_template.py - - 常见文件格式的加载函数
data_loader_helpers.py
使用方式:将这些脚本作为参考实现,根据你的特定分析复制/适配,或在合适时通过Bash直接调用。
Integration with Domain Skills
与领域技能集成
When analysis requires specialized knowledge:
| Data Type | Primary Skill | When to Use |
|---|---|---|
| Single-cell RNA-seq | | Cell type identification, clustering, trajectory |
| Bulk RNA-seq | | Differential gene expression |
| Sequences | | Alignment, motif search, format conversion |
| Statistical modeling | | Regression, time series, GLMs |
| Pathway analysis | | Gene set enrichment |
Pattern:
- Use for overall workflow
bioinformatician - Invoke specialized skill for domain-specific steps
- Integrate results back into main analysis
当分析需要专业知识时:
| 数据类型 | 核心技能 | 使用场景 |
|---|---|---|
| 单细胞RNA-seq | | 细胞类型识别、聚类、轨迹分析 |
| 批量RNA-seq | | 差异基因表达 |
| 序列 | | 比对、基序搜索、格式转换 |
| 统计建模 | | 回归、时间序列、广义线性模型 |
| 通路分析 | | 基因集富集 |
模式:
- 使用管理整体工作流
bioinformatician - 调用专业技能处理领域特定步骤
- 将结果整合回主分析
Copilot Review Integration
Copilot审查集成
During implementation, skill reviews your code:
copilot- Expect critical feedback (adversarial but constructive)
- Fix issues immediately before proceeding
- Iterate until code is robust
- Don't take criticism personally - it catches bugs early
实现过程中,技能将审查你的代码:
copilot- 预期收到批判性反馈(对抗性但有建设性)
- 立即修复问题后再继续
- 迭代直至代码稳健
- 不要将批评个人化 - 它能提前发现Bug
Deliverables
交付物
Complete notebook should include:
Technical Components (existing):
- Code cells: Well-commented, modular analysis
- Visualizations: Publication-ready figures
- Statistics: Complete reporting (test, p-value, effect size, n)
- Exports: Processed data files, figure files
- Session info: Package versions for reproducibility
Biological Components (new):
6. Biological Context Cells (markdown):
- Research question in biological terms
- Hypothesis and expected outcomes
- Biological description of each analysis step
- Relevance to biological question
-
Sanity Check Documentation (markdown):
- Results of biological plausibility checks
- Positive/negative control validation
- Known biology comparison
- Red flags or concerns
-
Preliminary Interpretation (markdown):
- Main findings in biological language
- Consistency with expectations
- Novel or surprising results
- Biological implications
-
Expert Handoff Section (markdown, if needed):
- Structured questions for biologist-commentator
- Specific findings needing interpretation
- Recommended follow-up analyses
- Caveats and limitations
Quality Indicator: Notebook should be readable by biologist who doesn't code
完整的笔记本应包含:
技术组件(已有):
- 代码单元格:注释清晰、模块化的分析代码
- 可视化:符合出版要求的图表
- 统计数据:完整报告(测试方法、p值、效应量、样本数)
- 导出:处理后的数据文件、图表文件
- 会话信息:用于复现性的包版本
生物组件(新增):
6. 生物语境单元格(markdown):
- 生物学层面的研究问题
- 假设与预期结果
- 每个分析步骤的生物学描述
- 与生物学问题的相关性
-
合理性检查文档(markdown):
- 生物学合理性检查结果
- 阳性/阴性对照验证
- 与已知生物学规律的对比
- 警示信号或问题
-
初步解读(markdown):
- 用生物学语言描述主要发现
- 与预期的一致性
- 新颖或意外结果
- 生物学意义
-
专家移交部分(markdown,若需要):
- 向生物学家评论员提出的结构化问题
- 需要解读的特定结果
- 推荐的后续分析
- 注意事项与局限性
质量指标:笔记本应能被不会编码的生物学家读懂
Quality Indicators
质量指标
Your notebook is ready when:
Technical Quality:
- All code executes without errors
- Random seed set, results reproducible
- QC checks passed (positive controls work)
- Visualizations properly labeled
- Statistics completely reported
- Copilot approved code (no outstanding critical issues)
Biological Quality:
- Biological context provided for all major sections (concise, 1-3 sentences)
- Biological sanity checks completed and documented
- Positive/negative controls validated against biological expectations
- Preliminary interpretation written in biological terms
- Handoff to biologist-commentator structured (if unexpected findings)
- Notebook readable by non-coding biologist
Integration Ready:
- Ready for PI to expand interpretations for publication
- Clear which findings are routine vs need expert review
你的笔记本已准备就绪当:
技术质量:
- 所有代码可无错误执行
- 已设置随机种子,结果可复现
- QC检查通过(阳性对照正常)
- 可视化标注规范
- 统计数据报告完整
- 代码已通过Copilot审核(无未解决的关键问题)
生物质量:
- 所有主要部分均提供生物语境(简洁,1-3句话)
- 已完成并记录生物学合理性检查
- 阳性/阴性对照已根据生物学预期验证
- 初步解读使用生物学语言
- 若有意外结果,结构化移交至生物学家评论员
- 笔记本可被非编码生物学家读懂
集成就绪:
- 已准备好供PI扩展解读以用于出版
- 明确区分常规结果与需要专家审核的结果