bioinformatician

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Bioinformatician Skill

生物信息学家技能

Purpose

用途

Implement computational analyses of biological data, including:

Data loading and quality control
Statistical analysis
Bioinformatics pipelines
Visualization
Integration with domain-specific tools

实现生物数据的计算分析，包括：

数据加载与质量控制
统计分析
生物信息学流水线
可视化
与领域特定工具集成

When to Use This Skill

何时使用该技能

Use this skill when you need to:

Implement an analysis plan in code (from PI)
Process genomics/transcriptomics/proteomics data
Perform statistical tests on biological data
Create publication-quality visualizations
Build reproducible analysis pipelines
Integrate multiple bioinformatics tools

在以下场景中使用本技能：

用代码实现（来自PI的）分析方案
处理基因组/转录组/蛋白质组数据
对生物数据执行统计测试
创建符合出版要求的可视化图表
构建可复现的分析流水线
集成多个生物信息学工具

Workflow Integration

工作流集成

Primary Pattern: Receive Plan → Implement → Deliver Notebook

Receive analysis_plan.md from PI
    ↓
Implement in Jupyter notebook
    ↓  (copilot reviews continuously)
Deliver completed notebook to PI for interpretation

Integration Points:

RECEIVES: Analysis plan from
```
principal-investigator
```
WORKS WITH:
```
copilot
```
(adversarial code review during implementation)
CALLS: Domain-specific skills (
```
scanpy
```
,
```
pydeseq2
```
,
```
biopython
```
, etc.)
OUTPUTS: Jupyter notebooks with analysis code + results

主要模式：接收方案 → 实现 → 交付笔记本

Receive analysis_plan.md from PI
    ↓
Implement in Jupyter notebook
    ↓  (copilot reviews continuously)
Deliver completed notebook to PI for interpretation

集成点:

接收：来自
```
principal-investigator
```
的分析方案
协作：与
```
copilot
```
（实现过程中进行对抗性代码审查）
调用：领域特定技能（
```
scanpy
```
、
```
pydeseq2
```
、
```
biopython
```
等）
输出：包含分析代码+结果的Jupyter notebook

Core Capabilities

核心能力

1. Data Loading and Validation

1. 数据加载与验证

Read common formats (CSV, TSV, HDF5, Parquet, FASTQ, BAM, VCF)
Validate data integrity and format
Handle compressed files
Memory-efficient loading for large datasets

读取常见格式（CSV、TSV、HDF5、Parquet、FASTQ、BAM、VCF）
验证数据完整性与格式
处理压缩文件
针对大型数据集的内存高效加载

2. Quality Control

2. 质量控制

Sample quality metrics
Outlier detection
Batch effect assessment
Positive/negative control validation

样本质量指标
异常值检测
批次效应评估
阳性/阴性对照验证

3. Statistical Analysis

3. 统计分析

Differential expression/abundance
Enrichment analysis
Clustering and dimensionality reduction
Correlation and regression
Multiple testing correction

差异表达/丰度分析
富集分析
聚类与降维
相关性与回归分析
多重检验校正

4. Visualization

4. 可视化

Publication-quality plots (matplotlib, seaborn, plotly)
Interactive visualizations
Consistent styling
Proper labeling and legends

符合出版要求的图表（matplotlib、seaborn、plotly）
交互式可视化
统一的样式规范
规范的标注与图例

5. Pipeline Development

5. 流水线开发

Modular, reusable code
Parameter documentation
Progress logging
Error handling

模块化、可复用代码
参数文档
进度日志
错误处理

Standard Notebook Structure

标准笔记本结构

Use the template in

assets/notebook-structure-template.ipynb

1. Title and Description
   - Research question
   - Date, author
   - Reference to analysis plan

2. Setup
   - Imports
   - Configuration parameters
   - Random seeds for reproducibility

3. Data Loading
   - Read data files
   - Initial inspection
   - Data structure validation

4. Quality Control
   - Sample metrics
   - Filtering criteria
   - QC visualizations

5. Analysis
   - Statistical tests
   - Transformations
   - Model fitting

6. Visualization
   - Main figures
   - Supplementary plots

7. Export Results
   - Save processed data
   - Export figures
   - Summary statistics

8. Session Info
   - Package versions
   - Execution time

使用

assets/notebook-structure-template.ipynb

中的模板：

1. 标题与描述
   - 研究问题
   - 日期、作者
   - 分析方案参考

2. 环境搭建
   - 导入包
   - 配置参数
   - 用于复现性的随机种子

3. 数据加载
   - 读取数据文件
   - 初始检查
   - 数据结构验证

4. 质量控制
   - 样本指标
   - 过滤标准
   - QC可视化

5. 分析
   - 统计测试
   - 数据转换
   - 模型拟合

6. 可视化
   - 主图
   - 补充图表

7. 结果导出
   - 保存处理后的数据
   - 导出图表
   - 汇总统计

8. 会话信息
   - 包版本
   - 执行时间

Biological Literacy Framework

生物素养框架

Writing Style for Biological Context

生物语境下的写作风格

All biological context in notebooks should follow concise scientific prose:

Principles:

✅ Brief: 1-3 sentences per section, not paragraphs
✅ Clear: Use precise biological terminology
✅ Factual: State what/why without excessive detail
✅ Publication-ready: Like Methods/Results sections in papers

Example - Good (Concise):

markdown

undefined

笔记本中所有生物语境内容应遵循简洁的科学文风：

原则:

✅ 简洁：每个部分1-3句话，而非段落
✅ 清晰：使用精准的生物学术语
✅ 客观：说明内容/原因，避免冗余细节
✅ 符合出版规范：类似论文中的方法/结果部分

示例 - 优秀（简洁）:

markdown

undefined

Biological Context

生物语境

Differential expression analysis comparing wild-type and mutant neurons identifies genes affected by loss of transcription factor X. Expected upregulation of target genes based on ChIP-seq data (Smith et al. 2020).


**Example - Avoid (Too Verbose)**:
```markdown

对比野生型与突变型神经元的差异表达分析，识别受转录因子X缺失影响的基因。基于ChIP-seq数据（Smith等人，2020），预期靶基因上调。


**示例 - 避免（过于冗长）**:
```markdown

Biological Context

生物语境

In this analysis, we will perform differential expression analysis to compare gene expression between wild-type neurons and neurons with a mutation in transcription factor X. Previous research has shown that transcription factor X plays a critical role in neuronal development by binding to the promoters of many developmentally important genes...

undefined

在本次分析中，我们将执行差异表达分析，对比野生型神经元与转录因子X突变神经元的基因表达。此前研究表明，转录因子X通过结合许多发育相关基因的启动子，在神经元发育中发挥关键作用...

undefined

When to Provide Interpretation vs Handoff

何时提供解读 vs 移交

Bioinformatician Handles (routine interpretation):

Standard results following known biology
Positive/negative controls behaving as expected
Results matching literature precedents
Technical QC assessments with biological implications
Magnitude/direction sanity checks

Handoff to Biologist-Commentator (expert needed):

Novel or unexpected findings
Results contradicting established biology
Unclear biological mechanisms
Publication-critical interpretations
Proposing new hypotheses or models

生物信息学家负责（常规解读）：

符合已知生物学规律的标准结果
阳性/阴性对照表现符合预期
与文献先例匹配的结果
具有生物学意义的技术QC评估
幅度/方向合理性检查

移交至生物学家评论员（需要专家参与）：

新颖或意外发现
与已确立生物学规律矛盾的结果
不明确的生物学机制
出版关键解读
提出新假设或模型

Enhanced Notebook Structure

增强版笔记本结构

Use this structure for biologically-literate notebooks:

1. Title and Scientific Context
   - Research question (biological, not just technical)
   - Biological hypothesis
   - Expected outcome and why it matters
   - Relevant background (1-2 sentences)

2. Setup (code)
   - Imports, parameters, seeds

3. Data Loading
   - Code: Load data
   - Biological description of dataset (markdown):
     * What organism/tissue/condition
     * What genes/features measured
     * What biological question dataset addresses

4. Quality Control
   - Code: QC metrics, filtering
   - Biological interpretation of QC (markdown):
     * Are pass rates expected for this data type?
     * Do failed samples have biological meaning?
     * Red flags from biological perspective?

5. Analysis
   - Code: Statistical tests, transformations
   - Biological reasoning for each step (markdown):
     * Why this method for this question?
     * What biological assumption being tested?
     * Positive/negative controls?

6. Results
   - Code: Generate results
   - Biological sanity checks (markdown):
     * Do magnitudes make sense?
     * Do directions align with biology?
     * Any known biology violated?

7. Visualization
   - Code: Plots
   - Biological interpretation scaffolding (markdown):
     * What biological pattern does this show?
     * Is this expected or surprising?
     * What follow-up questions does this raise?

8. Preliminary Interpretation
   - Bioinformatician's biological assessment (markdown):
     * Main findings in biological terms
     * Caveats and limitations
     * Questions for biologist-commentator

9. Handoff to Expert (if needed)
   - Structured questions for biologist-commentator (markdown):
     * Specific results needing interpretation
     * Unexpected findings to validate
     * Biological mechanisms to explore

10. Export (code)
    - Save data, figures, session info

使用此结构创建具备生物素养的笔记本：

1. 标题与科学语境
   - 研究问题（生物学层面，而非仅技术层面）
   - 生物学假设
   - 预期结果及其重要性
   - 相关背景（1-2句话）

2. 环境搭建（代码）
   - 导入包、参数、种子

3. 数据加载
   - 代码：加载数据
   - 数据集的生物学描述（markdown）：
     * 生物/组织/条件
     * 测量的基因/特征
     * 数据集解决的生物学问题

4. 质量控制
   - 代码：QC指标、过滤
   - QC的生物学解读（markdown）：
     * 该数据类型的通过率是否符合预期？
     * 不合格样本是否具有生物学意义？
     * 生物学视角下的警示信号？

5. 分析
   - 代码：统计测试、转换
   - 每个步骤的生物学依据（markdown）：
     * 为何针对该问题使用此方法？
     * 测试的生物学假设是什么？
     * 阳性/阴性对照？

6. 结果
   - 代码：生成结果
   - 生物学合理性检查（markdown）：
     * 幅度是否合理？
     * 方向是否与生物学规律一致？
     * 是否违反已知生物学规律？

7. 可视化
   - 代码：绘图
   - 生物学解读框架（markdown）：
     * 该图表展示了何种生物学模式？
     * 符合预期还是意外？
     * 引发哪些后续问题？

8. 初步解读
   - 生物信息学家的生物学评估（markdown）：
     * 用生物学语言描述主要发现
     * 注意事项与局限性
     * 向生物学家评论员提出的问题

9. 移交至专家（若需要）
   - 向生物学家评论员提出的结构化问题（markdown）：
     * 需要解读的特定结果
     * 需验证的意外发现
     * 需探索的生物学机制

10. 导出（代码）
    - 保存数据、图表、会话信息

Biological Sanity Check Framework

生物学合理性检查框架

Run these checks before accepting results:

在接受结果前执行以下检查：

Expression/Abundance Checks

表达/丰度检查

Order of magnitude reasonable? (log2FC > 10 is suspicious)
Direction matches known biology? (check a few known genes)
Positive controls behave as expected?
Negative controls show no signal?

数量级合理？（log2FC > 10需警惕）
方向与已知生物学规律一致？（检查部分已知基因）
阳性对照表现符合预期？
阴性对照无信号？

Statistical Checks with Biological Lens

带生物学视角的统计检查

Top hits include known biology? (literature validation)
Results robust to threshold changes?
Batch effects vs real biology separated?
Multiple testing appropriate for biology? (discovery vs validation)

核心结果包含已知生物学内容？（文献验证）
结果对阈值变化是否稳健？
批次效应与真实生物学信号是否分离？
多重检验方法是否适合生物学场景？（发现 vs 验证）

Genomics-Specific

基因组学特定检查

Chromosome names consistent? (chr1 vs 1)
Coordinates sensible? (within chromosome bounds)
Strand orientation correct for gene features?
Genome build consistent throughout?

染色体名称一致？（chr1 vs 1）
坐标合理？（在染色体范围内）
链方向与基因特征匹配？
全程使用的基因组版本一致？

Experimental Design

实验设计

Sample size adequate for this effect size?
Replicates biological or technical?
Confounders identified and addressed?
Controls appropriate for this experiment type?

If any check fails: Document in notebook, flag for biologist-commentator review

样本量对该效应量是否足够？
重复是生物学重复还是技术重复？
是否识别并处理混杂因素？
对照是否适合该实验类型？

若任何检查不通过：在笔记本中记录，标记需生物学家评论员审核

Biological Context Templates

生物语境模板

Template: Differential Expression Analysis

模板：差异表达分析

markdown

undefined

markdown

undefined

Biological Context

生物语境

Comparing [condition A] vs [condition B] to identify genes involved in [biological process]. Expected upregulation of [pathway X] genes based on [mechanism/literature]. Positive controls: [gene1, gene2]. Expected log2FC range: [X-Y] based on [citation].

对比[条件A]与[条件B]，识别参与[生物学过程]的基因。基于[机制/文献]，预期[通路X]基因上调。阳性对照：[基因1, 基因2]。预期log2FC范围：[X-Y]，依据[引用文献]。

Biological Sanity Checks

生物学合理性检查

Known pathway genes show expected direction (e.g., gene1 ↑, gene2 ↓)
Housekeepers unchanged (actb, gapdh)
Magnitudes reasonable (log2FC < 10 for transcriptional regulation)

已知通路基因方向符合预期（如：基因1 ↑，基因2 ↓）
管家基因无变化（actb, gapdh）
幅度合理（转录调控的log2FC < 10）

Preliminary Interpretation

初步解读

Top hits include [gene X, Y, Z] involved in [biological process], consistent with [hypothesis/literature]. [Gene W] unexpected - requires expert validation.

Handoff: Unexpected downregulation of [gene W] contradicts known role in [process]. Biologist-commentator needed for mechanism assessment.

undefined

核心结果包括参与[生物学过程]的[基因X, Y, Z]，与[假设/文献]一致。[基因W]为意外结果 - 需专家验证。

移交：[基因W]的意外下调与其在[过程]中的已知作用矛盾。需生物学家评论员评估机制。

undefined

Template: Single-Cell Clustering

模板：单细胞聚类

markdown

undefined

markdown

undefined

Biological Context

生物语境

Clustering [tissue] cells to identify cell types. Expected populations: [celltype1 (markers: a,b,c), celltype2 (markers: d,e,f)]. Reference atlas: [citation if available].

对[组织]细胞进行聚类以识别细胞类型。预期群体：[细胞类型1（标记物：a,b,c）, 细胞类型2（标记物：d,e,f）]。参考图谱：[若有则引用文献]。

Cluster Validation

聚类验证

Cluster 1: [celltype] - markers: [genes] ✓
Cluster 2: [celltype] - markers: [genes] ✓
Cluster 3: Novel population - markers: [genes] - needs expert review

Handoff: Cluster 3 shows unexpected marker combination [X+Y+Z-]. Biologist-commentator needed for cell type identification and biological significance.

undefined

聚类1：[细胞类型] - 标记物：[基因] ✓
聚类2：[细胞类型] - 标记物：[基因] ✓
聚类3：新型群体 - 标记物：[基因] - 需专家审核

移交：聚类3显示意外标记物组合[X+Y+Z-]。需生物学家评论员识别细胞类型并评估生物学意义。

undefined

Template: Expert Handoff Format

模板：专家移交格式

Use this concise format when escalating to biologist-commentator:

markdown

undefined

升级至生物学家评论员时使用此简洁格式：

markdown

undefined

Expert Interpretation Needed

需要专家解读

Finding: [Specific result with statistics] Context: [1-2 sentence background] Issue: [What's unexpected/unclear and why] Question: [Specific question for expert]

Validation Done: [Positive controls: ✓/✗, Literature: consistent/contradicts]


**Example**:
```markdown

发现：[带统计数据的特定结果] 语境：[1-2句话背景] 问题：[意外/不明确的内容及原因] 问题：[向专家提出的特定问题]

已完成验证：[阳性对照：✓/✗，文献：一致/矛盾]


**示例**:
```markdown

Expert Interpretation Needed

需要专家解读

Finding: Gene X shows 8-fold upregulation (padj<0.001) in mutant vs WT Context: Gene X is transcriptional repressor, expected downregulation of targets Issue: Target genes also upregulated (contradicts repressor function) Question: Alternative mechanism? Post-transcriptional regulation? Data artifact?

Validation Done: Positive controls ✓, replicates consistent ✓, literature shows conflicting results

undefined

发现：突变体与野生型相比，基因X上调8倍（padj<0.001）语境：基因X是转录抑制因子，预期靶基因下调问题：靶基因同样上调（与抑制因子功能矛盾）问题：是否存在替代机制？转录后调控？数据伪影？

已完成验证：阳性对照 ✓，重复结果一致 ✓，文献结果存在矛盾

undefined

Biologist-Commentator Integration Pattern

与生物学家评论员的集成模式

When to Invoke Biologist-Commentator

何时调用生物学家评论员

Pre-Analysis (Method Validation):

python

Skill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")

During Analysis (Quick Check):

Use biological sanity check framework (above)
Document any red flags
Continue if checks pass, escalate if fail

Post-Analysis (Expert Interpretation):

python

Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")

分析前（方法验证）：

python

Skill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")

分析中（快速检查）：

使用上述生物学合理性检查框架
记录任何警示信号
检查通过则继续，不通过则升级

分析后（专家解读）：

python

Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")

Handoff Workflow

移交工作流

Bioinformatician: Run analysis, perform sanity checks, document findings
Handoff: Create structured handoff section in notebook (see template above)
Biologist-Commentator: Provides expert interpretation, mechanism insights, validation
Bioinformatician: Incorporate interpretation into notebook, flag needed validations

生物信息学家：执行分析，进行合理性检查，记录发现
移交：在笔记本中创建结构化移交部分（见上述模板）
生物学家评论员：提供专家解读、机制见解、验证
生物信息学家：将解读整合到笔记本中，标记需验证内容

Pre-Flight Checklist

预执行检查清单

Before starting implementation, verify:

Analysis plan clearly defines objectives
Data files exist and paths are correct
Required packages installed
Expected output format understood
Random seeds set for reproducibility

Use

assets/analysis-checklist.md

for complete list.

开始实现前，验证：

分析方案明确定义目标
数据文件存在且路径正确
已安装所需包
理解预期输出格式
设置了用于复现性的随机种子

使用

assets/analysis-checklist.md

获取完整清单。

Reproducibility Standards

复现性标准

Critical: Every bioinformatics analysis must be fully reproducible. Another researcher should be able to recreate your computational environment and obtain identical results.

关键要求：每个生物信息学分析必须完全可复现。其他研究人员应能够重建你的计算环境并获得完全相同的结果。

Environment Documentation (Mandatory)

环境文档（必填）

Start every notebook with environment documentation:

python

undefined

在每个笔记本开头添加环境文档：

python

undefined

%%

Computational Environment

计算环境

import sys import numpy as np import pandas as pd import scanpy as sc # or relevant packages

print("=" * 60) print("COMPUTATIONAL ENVIRONMENT") print("=" * 60) print(f"Python: {sys.version}") print(f"NumPy: {np.version}") print(f"Pandas: {pd.version}") print(f"Scanpy: {sc.version}") # Replace with your key packages print("=" * 60) print("\nFor full environment, see requirements.txt")


**Create environment files before starting analysis:**

```bash

import sys import numpy as np import pandas as pd import scanpy as sc # 或相关包

print("=" * 60) print("COMPUTATIONAL ENVIRONMENT") print("=" * 60) print(f"Python: {sys.version}") print(f"NumPy: {np.version}") print(f"Pandas: {pd.version}") print(f"Scanpy: {sc.version}") # 替换为你的核心包 print("=" * 60) print("\n完整环境请见requirements.txt")


**开始分析前创建环境文件：**

```bash

For micromamba users (recommended for bioinformatics):

针对micromamba用户（生物信息学推荐）：

Export micromamba packages:

导出micromamba包：

micromamba env export > environment.yml

Export pip-installed packages separately (micromamba export does not include pip packages):

单独导出pip安装的包（micromamba导出不包含pip包）：

pip freeze > pip-requirements.txt

For pip users:

针对pip用户：

pip freeze > requirements.txt

Document which file to use in notebook

在笔记本中说明使用哪个文件


**In notebook markdown cell:**
```markdown


**在笔记本的markdown单元格中：**
```markdown

Computational Environment

计算环境

Kernel: Python 3.11 (bio-analysis-env)

Environment file:

environment.yml

(recreate with

micromamba env create -f environment.yml

)

Key packages: scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0
Execution date: 2026-01-29

undefined

内核：Python 3.11 (bio-analysis-env)

环境文件：

environment.yml

（使用

micromamba env create -f environment.yml

重建）

核心包：scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0
执行日期：2026-01-29

undefined

Random Seed Setting (Mandatory for Stochastic Processes)

随机种子设置（随机过程必填）

Set seeds in setup cell:

python

undefined

在环境搭建单元格中设置种子：

python

undefined

%%

Random seeds for reproducibility

用于复现性的随机种子

import numpy as np import random

RANDOM_SEED = 42 # Document choice (convention, replicating published analysis, etc.)

import numpy as np import random

RANDOM_SEED = 42 # 记录选择原因（惯例、复现已发表分析等）

Core Python/NumPy

核心Python/NumPy

np.random.seed(RANDOM_SEED) random.seed(RANDOM_SEED)

Scanpy (single-cell analysis)

Scanpy（单细胞分析）

import scanpy as sc sc.settings.seed = RANDOM_SEED

PyTorch (if using deep learning)

PyTorch（若使用深度学习）

import torch torch.manual_seed(RANDOM_SEED) if torch.cuda.is_available(): torch.cuda.manual_seed_all(RANDOM_SEED)

TensorFlow (if using)

TensorFlow（若使用）

import tensorflow as tf tf.random.set_seed(RANDOM_SEED)

print(f"Random seed set to {RANDOM_SEED} for reproducibility")


**Bioinformatics operations requiring seeds:**
- **Dimensionality reduction**: UMAP, t-SNE, PCA with randomized SVD
- **Clustering**: Leiden, Louvain (graph-based)
- **Sampling**: Random subsampling, bootstrap, cross-validation
- **Imputation**: Stochastic imputation methods
- **Simulation**: Monte Carlo, permutation tests
- **Machine learning**: Random forests, neural networks, k-means initialization

**Document in notebook:**
```markdown

import tensorflow as tf tf.random.set_seed(RANDOM_SEED)

print(f"已设置随机种子为{RANDOM_SEED}以保证复现性")


**需要设置种子的生物信息学操作：**
- **降维**：UMAP、t-SNE、带随机SVD的PCA
- **聚类**：Leiden、Louvain（基于图）
- **采样**：随机子采样、bootstrap、交叉验证
- **插补**：随机插补方法
- **模拟**：蒙特卡洛、置换检验
- **机器学习**：随机森林、神经网络、k-means初始化

**在笔记本中记录：**
```markdown

Stochastic Operations

随机操作

This analysis uses:

UMAP (random initialization, seed=42)
Leiden clustering (random walk, seed=42)
1000-iteration permutation test (seed=42)

All seeds set to 42 for reproducibility.

undefined

本次分析使用：

UMAP（随机初始化，seed=42）
Leiden聚类（随机游走，seed=42）
1000次迭代置换检验（seed=42）

所有种子均设置为42以保证复现性。

undefined

Session Info Output (Mandatory)

会话信息输出（必填）

End every notebook with comprehensive session info:

python

undefined

在每个笔记本末尾添加全面的会话信息：

python

undefined

%%

Session Information for Reproducibility

用于复现性的会话信息

import session_info

session_info.show( dependencies=True, html=False )

import session_info

session_info.show( dependencies=True, html=False )

Alternative for single-cell workflows:

单细胞工作流的替代方案：

import scanpy as sc

sc.logging.print_versions()

Alternative for base Python:

基础Python的替代方案：

import sys

import pkg_resources

print(f"Python: {sys.version}")

for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:

print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")


**What this captures:**
- Python version
- Operating system
- All package versions (including dependencies)
- Execution timestamp

**Why this matters:**
- API changes between package versions
- Statistical method implementations evolve
- Bugs get fixed (results may change)
- Reviewers need to verify methods


**该部分捕获：**
- Python版本
- 操作系统
- 所有包版本（包括依赖）
- 执行时间戳

**重要性：**
- 包版本间的API变化
- 统计方法实现的演进
- 修复的Bug（结果可能变化）
- 审稿人需要验证方法

File Path Best Practices

文件路径最佳实践

Use relative paths and variables:

python

undefined

使用相对路径与变量：

python

undefined

%%

from pathlib import Path

Define all paths at top of notebook

在笔记本顶部定义所有路径

DATA_DIR = Path("data/raw") PROCESSED_DIR = Path("data/processed") RESULTS_DIR = Path("results/analysis_2026-01-29") FIGURES_DIR = RESULTS_DIR / "figures"

Create output directories

创建输出目录

for directory in [PROCESSED_DIR, RESULTS_DIR, FIGURES_DIR]: directory.mkdir(parents=True, exist_ok=True)

Use variables throughout

全程使用变量

counts_file = DATA_DIR / "counts_matrix.h5ad" metadata_file = DATA_DIR / "sample_metadata.csv" output_file = PROCESSED_DIR / "normalized_counts.h5ad" figure_file = FIGURES_DIR / "umap_clusters.pdf"

print(f"Data directory: {DATA_DIR.resolve()}") print(f"Results directory: {RESULTS_DIR.resolve()}")


**Never use hardcoded absolute paths:**
```python

counts_file = DATA_DIR / "counts_matrix.h5ad" metadata_file = DATA_DIR / "sample_metadata.csv" output_file = PROCESSED_DIR / "normalized_counts.h5ad" figure_file = FIGURES_DIR / "umap_clusters.pdf"

print(f"数据目录: {DATA_DIR.resolve()}") print(f"结果目录: {RESULTS_DIR.resolve()}")


**切勿使用硬编码绝对路径：**
```python

❌ BAD (non-reproducible):

❌ 错误（不可复现）：

adata = sc.read_h5ad("/Users/yourname/project/data/counts.h5ad") plt.savefig("/Users/yourname/Desktop/figure.pdf")

✅ GOOD (reproducible):

✅ 正确（可复现）：

adata = sc.read_h5ad(DATA_DIR / "counts.h5ad") plt.savefig(FIGURES_DIR / "umap_clusters.pdf")

undefined

adata = sc.read_h5ad(DATA_DIR / "counts.h5ad") plt.savefig(FIGURES_DIR / "umap_clusters.pdf")

undefined

Data Provenance Documentation

数据来源文档

Document data sources in notebook:

markdown

undefined

在笔记本中记录数据来源：

markdown

undefined

Data Sources

数据来源

Input Data

输入数据

File:
```
data/raw/GSE123456_counts.h5ad
```
Source: GEO accession GSE123456
Download date: 2026-01-15

Download command:

wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456

Original publication: Smith et al. (2025) Nature 600:123-130
Organism: Homo sapiens
Tissue: Primary cortical neurons
n samples: 50 (25 control, 25 treatment)
n features: 20,000 genes

文件：
```
data/raw/GSE123456_counts.h5ad
```
来源：GEO登录号GSE123456
下载日期：2026-01-15

下载命令：

wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456

原始出版物：Smith等人（2025）Nature 600:123-130
生物：智人
组织：原代皮质神经元
样本数：50（25对照，25处理）
特征数：20,000个基因

Reference Data

参考数据

Genome build: GRCh38 (hg38)
Gene annotations: GENCODE v42
Downloaded: 2026-01-10 from https://www.gencodegenes.org/


**Why this matters:**
- Data can be updated or removed from repositories
- Genome builds affect coordinate-based analyses
- Sample metadata clarifies experimental design
- Enables others to download identical data

基因组版本：GRCh38 (hg38)
基因注释：GENCODE v42
下载日期：2026-01-10，来自https://www.gencodegenes.org/


**重要性：**
- 数据可能在仓库中更新或移除
- 基因组版本影响基于坐标的分析
- 样本元数据明确实验设计
- 便于他人下载相同数据

Reproducibility Pre-Flight Checklist

复现性预执行检查清单

Before starting analysis, verify:

Environment documented (
```
environment.yml
```
or
```
requirements.txt
```
exists)
Environment creation documented in notebook
Random seeds will be set for all stochastic operations
File paths use variables (no hardcoded absolute paths)
Data sources documented (where to download, version, date)
Genome build / reference database versions specified
Session info cell will be added at end

Before handoff to PI, verify:

Notebook runs end-to-end without errors (Restart Kernel & Run All)
Results reproducible (run twice, identical outputs)
All figures saved to
```
FIGURES_DIR
```
with descriptive names
All processed data saved to
```
PROCESSED_DIR
```
Session info cell executed and output visible
Execution time reasonable (< 2 hours for routine analyses)

开始分析前，验证：

已记录环境（存在
```
environment.yml
```
或
```
requirements.txt
```
）
已在笔记本中记录环境创建方法
所有随机操作将设置随机种子
文件路径使用变量（无硬编码绝对路径）
已记录数据来源（下载地址、版本、日期）
已指定基因组版本/参考数据库版本
将在末尾添加会话信息单元格

移交至PI前，验证：

笔记本可从头至尾无错误执行（重启内核并全部运行）
结果可复现（运行两次，输出相同）
所有图表已保存至
```
FIGURES_DIR
```
，命名具有描述性
所有处理后的数据已保存至
```
PROCESSED_DIR
```
会话信息单元格已执行且输出可见
执行时间合理（常规分析<2小时）

Integration with notebook-writer Skill

与notebook-writer技能集成

When creating notebooks programmatically, use

notebook-writer

skill with reproducibility standards:

python

from pathlib import Path

程序化创建笔记本时，使用

notebook-writer

技能并遵循复现性标准：

python

from pathlib import Path

Use notebook-writer to create template

使用notebook-writer创建模板

cells = [ {'type': 'markdown', 'content': '## Computational Environment\n...'}, {'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'}, {'type': 'markdown', 'content': '## Data Loading\n...'}, # ... analysis cells ... {'type': 'markdown', 'content': '## Session Info'}, {'type': 'code', 'content': 'import session_info\nsession_info.show()'} ]

cells = [ {'type': 'markdown', 'content': '## 计算环境\n...'}, {'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'}, {'type': 'markdown', 'content': '## 数据加载\n...'}, # ... 分析单元格 ... {'type': 'markdown', 'content': '## 会话信息'}, {'type': 'code', 'content': 'import session_info\nsession_info.show()'} ]

Create reproducible notebook

创建可复现的笔记本

notebook_path = create_notebook_markdown( title="Reproducible RNA-seq Analysis", cells=cells, output_path=Path("analysis/rnaseq_analysis.md") )

undefined

notebook_path = create_notebook_markdown( title="可复现RNA-seq分析", cells=cells, output_path=Path("analysis/rnaseq_analysis.md") )

undefined

Common Reproducibility Failures and Fixes

常见复现性问题与修复方案

Issue	Problem	Fix
Different results on rerun	No random seed set	Set seeds for numpy, random, scanpy, torch
Import errors	Missing package versions	Create `requirements.txt` or `environment.yml`
File not found	Hardcoded paths	Use Path variables defined at top
Old package behavior	Package version mismatch	Document versions with `session_info.show()`
Data source vanished	URL changed or removed	Document download date, accession, mirror sites
Genome coordinate mismatch	Different genome build	Specify build (GRCh38 vs GRCh37) in notebook

问题	影响	修复方案
重新运行结果不同	未设置随机种子	为numpy、random、scanpy、torch设置种子
导入错误	缺失包版本信息	创建 `requirements.txt` 或 `environment.yml`
文件未找到	硬编码路径	使用顶部定义的Path变量
旧包行为	包版本不匹配	使用 `session_info.show()` 记录版本
数据来源消失	URL变更或移除	记录下载日期、登录号、镜像站点
基因组坐标不匹配	基因组版本不同	在笔记本中指定版本（GRCh38 vs GRCh37）

Bioinformatics-Specific Reproducibility Considerations

生物信息学特定的复现性注意事项

Organism and Reference Versions:

python

undefined

生物与参考版本：

python

undefined

Document in code cell

在代码单元格中记录

ORGANISM = "Homo sapiens" GENOME_BUILD = "GRCh38" # or "mm39" for mouse, "dm6" for fly, etc. ANNOTATION_VERSION = "GENCODE v42" # or "Ensembl 110" ANNOTATION_DATE = "2026-01-10"

print(f"Analysis configuration:") print(f" Organism: {ORGANISM}") print(f" Genome: {GENOME_BUILD}") print(f" Annotations: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")


**Bioinformatics Tools (if used):**
```markdown

ORGANISM = "Homo sapiens" GENOME_BUILD = "GRCh38" # 小鼠为"mm39"，果蝇为"dm6"等 ANNOTATION_VERSION = "GENCODE v42" # 或"Ensembl 110" ANNOTATION_DATE = "2026-01-10"

print(f"分析配置:") print(f" 生物: {ORGANISM}") print(f" 基因组: {GENOME_BUILD}") print(f" 注释: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")


**生物信息学工具（若使用）：**
```markdown

External Tools

外部工具

STAR aligner: v2.7.11a (for read mapping)
MACS2: v2.2.9.1 (for peak calling)
bedtools: v2.31.0 (for interval operations)

All tools available in micromamba environment (see environment.yml).


**Data Processing Parameters:**
```python

STAR比对器: v2.7.11a（用于reads比对）
MACS2: v2.2.9.1（用于峰识别）
bedtools: v2.31.0（用于区间操作）

所有工具均可在micromamba环境中获取（见environment.yml）。


**数据处理参数：**
```python

Document all filtering/QC thresholds

记录所有过滤/QC阈值

QC_PARAMS = { 'min_genes_per_cell': 200, 'min_cells_per_gene': 3, 'max_pct_mt': 15, # percent mitochondrial reads 'min_counts': 1000, 'highly_variable_genes': 2000, 'n_pcs': 50, # principal components 'umap_neighbors': 15, 'leiden_resolution': 0.8 }

print("Quality control parameters:") for param, value in QC_PARAMS.items(): print(f" {param}: {value}")

undefined

QC_PARAMS = { 'min_genes_per_cell': 200, 'min_cells_per_gene': 3, 'max_pct_mt': 15, # 线粒体reads百分比 'min_counts': 1000, 'highly_variable_genes': 2000, 'n_pcs': 50, # 主成分 'umap_neighbors': 15, 'leiden_resolution': 0.8 }

print("质量控制参数:") for param, value in QC_PARAMS.items(): print(f" {param}: {value}")

undefined

Code Quality Standards

代码质量标准

During Implementation

实现过程中

Copilot reviews continuously - expect adversarial feedback
Write clear comments explaining biological context
Use descriptive variable names
Modularize repeated operations into functions
Log progress for long-running analyses

Copilot持续审查 - 预期收到批判性反馈
编写清晰的注释说明生物语境
使用描述性变量名
将重复操作模块化为函数
为长时间运行的分析记录进度

Testing

测试

Validate on small test data first
Check edge cases (empty data, single sample, all zeros)
Compare to expected results (positive controls)
Verify reproducibility (run twice, same results)

先在小型测试数据上验证
检查边缘情况（空数据、单样本、全零）
与预期结果对比（阳性对照）
验证复现性（运行两次，结果相同）

Common Analysis Patterns

常见分析模式

Pattern 1: Differential Expression (RNA-seq)

模式1：差异表达（RNA-seq）

python

undefined

python

undefined

1. Load counts

1. 加载计数数据

2. Filter low-abundance genes

2. 过滤低丰度基因

3. Normalize (DESeq2, TMM, or library size)

3. 标准化（DESeq2、TMM或文库大小标准化）

4. Statistical test (DESeq2, edgeR, limma)

4. 统计测试（DESeq2、edgeR、limma）

5. Multiple testing correction

5. 多重检验校正

6. Volcano plot + heatmap

6. 火山图 + 热图

→ Use `pydeseq2` skill for implementation details

→ 实现细节使用`pydeseq2`技能

Pattern 2: Single-Cell Analysis

模式2：单细胞分析

python

undefined

python

undefined

1. Load AnnData object

1. 加载AnnData对象

2. QC filtering (cells and genes)

2. QC过滤（细胞与基因）

3. Normalization and log-transform

3. 标准化与对数转换

4. Feature selection (highly variable genes)

4. 特征选择（高可变基因）

5. Dimensionality reduction (PCA, UMAP)

5. 降维（PCA、UMAP）

6. Clustering

6. 聚类

7. Marker gene identification

7. 标记基因识别

8. Visualization

8. 可视化

→ Use `scanpy` skill for implementation details

→ 实现细节使用`scanpy`技能

Pattern 3: Sequence Analysis

模式3：序列分析

python

undefined

python

undefined

1. Read FASTA/FASTQ

1. 读取FASTA/FASTQ

2. Quality filtering

2. 质量过滤

3. Alignment or motif search

3. 比对或基序搜索

4. Feature extraction

4. 特征提取

5. Statistical summary

5. 统计汇总

→ Use `biopython` skill for implementation details

→ 实现细节使用`biopython`技能

References

参考资料

For detailed guidance:

```
references/analysis_workflows.md
```
- Step-by-step workflows for common analyses
```
references/data_structures.md
```
- When to use pandas/anndata/Bioconductor
```
references/statistical_methods.md
```
- Which test for which data

references/visualization_best_practices.md

- Plot selection and styling

详细指南：

```
references/analysis_workflows.md
```
- 常见分析的分步工作流
```
references/data_structures.md
```
- 何时使用pandas/anndata/Bioconductor
```
references/statistical_methods.md
```
- 不同数据对应的测试方法

references/visualization_best_practices.md

- 图表选择与样式

Helper Scripts

辅助脚本

Available in

scripts/

```
qc_pipeline.py
```
- Automated QC for RNA-seq data
```
differential_expression_template.py
```
- Complete DESeq2 pipeline
```
data_loader_helpers.py
```
- Functions for common file formats

Usage: Read these scripts as reference implementations, copy/adapt for your specific analysis, or call directly via Bash if appropriate.

位于

scripts/

```
qc_pipeline.py
```
- RNA-seq数据的自动化QC
```
differential_expression_template.py
```
- 完整的DESeq2流水线
```
data_loader_helpers.py
```
- 常见文件格式的加载函数

使用方式：将这些脚本作为参考实现，根据你的特定分析复制/适配，或在合适时通过Bash直接调用。

Integration with Domain Skills

与领域技能集成

When analysis requires specialized knowledge:

Data Type	Primary Skill	When to Use
Single-cell RNA-seq	`scanpy`	Cell type identification, clustering, trajectory
Bulk RNA-seq	`pydeseq2`	Differential gene expression
Sequences	`biopython`	Alignment, motif search, format conversion
Statistical modeling	`statsmodels`	Regression, time series, GLMs
Pathway analysis	`gseapy` or manual	Gene set enrichment

Pattern:

Use
```
bioinformatician
```
for overall workflow
Invoke specialized skill for domain-specific steps
Integrate results back into main analysis

当分析需要专业知识时：

数据类型	核心技能	使用场景
单细胞RNA-seq	`scanpy`	细胞类型识别、聚类、轨迹分析
批量RNA-seq	`pydeseq2`	差异基因表达
序列	`biopython`	比对、基序搜索、格式转换
统计建模	`statsmodels`	回归、时间序列、广义线性模型
通路分析	`gseapy` 或手动	基因集富集

模式:

使用
```
bioinformatician
```
管理整体工作流
调用专业技能处理领域特定步骤
将结果整合回主分析

Copilot Review Integration

Copilot审查集成

During implementation,

copilot

skill reviews your code:

Expect critical feedback (adversarial but constructive)
Fix issues immediately before proceeding
Iterate until code is robust
Don't take criticism personally - it catches bugs early

实现过程中，

copilot

技能将审查你的代码：

预期收到批判性反馈（对抗性但有建设性）
立即修复问题后再继续
迭代直至代码稳健
不要将批评个人化 - 它能提前发现Bug

Deliverables

交付物

Complete notebook should include:

Technical Components (existing):

Code cells: Well-commented, modular analysis
Visualizations: Publication-ready figures
Statistics: Complete reporting (test, p-value, effect size, n)
Exports: Processed data files, figure files
Session info: Package versions for reproducibility

Biological Components (new): 6. Biological Context Cells (markdown):

Research question in biological terms
Hypothesis and expected outcomes
Biological description of each analysis step
Relevance to biological question

Sanity Check Documentation (markdown):
- Results of biological plausibility checks
- Positive/negative control validation
- Known biology comparison
- Red flags or concerns
Preliminary Interpretation (markdown):
- Main findings in biological language
- Consistency with expectations
- Novel or surprising results
- Biological implications
Expert Handoff Section (markdown, if needed):
- Structured questions for biologist-commentator
- Specific findings needing interpretation
- Recommended follow-up analyses
- Caveats and limitations

Quality Indicator: Notebook should be readable by biologist who doesn't code

完整的笔记本应包含：

技术组件（已有）：

代码单元格：注释清晰、模块化的分析代码
可视化：符合出版要求的图表
统计数据：完整报告（测试方法、p值、效应量、样本数）
导出：处理后的数据文件、图表文件
会话信息：用于复现性的包版本

生物组件（新增）： 6. 生物语境单元格（markdown）：

生物学层面的研究问题
假设与预期结果
每个分析步骤的生物学描述
与生物学问题的相关性

合理性检查文档（markdown）：
- 生物学合理性检查结果
- 阳性/阴性对照验证
- 与已知生物学规律的对比
- 警示信号或问题
初步解读（markdown）：
- 用生物学语言描述主要发现
- 与预期的一致性
- 新颖或意外结果
- 生物学意义
专家移交部分（markdown，若需要）：
- 向生物学家评论员提出的结构化问题
- 需要解读的特定结果
- 推荐的后续分析
- 注意事项与局限性

质量指标：笔记本应能被不会编码的生物学家读懂