jupyter-notebook-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Jupyter Notebook Analysis Patterns

Jupyter Notebook 分析模式

Expert knowledge for creating comprehensive, statistically rigorous Jupyter notebook analyses.
创建全面、具备统计严谨性的Jupyter Notebook分析的专业指南。

When to Use This Skill

何时使用此技能

  • Creating multi-cell Jupyter notebooks for data analysis
  • Adding correlation analyses with statistical testing
  • Implementing outlier removal strategies
  • Building series of related visualizations (10+ figures)
  • Analyzing large datasets with multiple characteristics
  • 用于数据分析的多单元格Jupyter Notebook创建
  • 添加带有统计检验的相关性分析
  • 实施异常值移除策略
  • 构建一系列相关可视化图表(10张以上)
  • 分析具有多种特征的大型数据集

Common Pitfalls

常见陷阱

Variable Shadowing in Loops

循环中的变量覆盖

Problem: Using common variable names like
data
as loop variables overwrites global variables:
python
undefined
问题:使用
data
这类常见变量名作为循环变量会覆盖全局变量:
python
undefined

BAD - Shadows global 'data' variable

BAD - Shadows global 'data' variable

for i, (sp, data) in enumerate(species_by_gc_content[:10], 1): val = data['gc_content'] print(f'{sp}: {val}')

After this loop, `data` is no longer your dataset list - it's the last species dict!

**Solution**: Use descriptive loop variable names:

```python
for i, (sp, data) in enumerate(species_by_gc_content[:10], 1): val = data['gc_content'] print(f'{sp}: {val}')

循环结束后,`data`不再是你的数据集列表,而是最后一个物种的字典!

**解决方案**:使用具有描述性的循环变量名:

```python

GOOD - Uses specific name

GOOD - Uses specific name

for i, (sp, sp_data) in enumerate(species_by_gc_content[:10], 1): val = sp_data['gc_content'] print(f'{sp}: {val}')

**Detection**: If you see errors like "Type: <class 'dict'>" when expecting a list, check for variable shadowing in recent cells.

**Prevention**:
- Never use generic names (`data`, `item`, `value`) as loop variables
- Use prefixed names (`sp_data`, `row_data`, `inv_data`)
- Add validation cells that check variable types
- Run "Restart & Run All" regularly to catch issues early

**Common shadowing patterns to avoid**:
```python
for data in dataset:          # Shadows 'data'
for i, data in enumerate():   # Shadows 'data'
for key, data in dict.items() # Shadows 'data'
for i, (sp, sp_data) in enumerate(species_by_gc_content[:10], 1): val = sp_data['gc_content'] print(f'{sp}: {val}')

**检测方法**:当你期望得到列表却出现类似"Type: <class 'dict'>"的错误时,检查最近单元格中的变量覆盖情况。

**预防措施**:
- 切勿使用通用名称(如`data`、`item`、`value`)作为循环变量
- 使用带前缀的名称(如`sp_data`、`row_data`、`inv_data`)
- 添加检查变量类型的验证单元格
- 定期运行“重启并全部运行”以尽早发现问题

**需要避免的常见变量覆盖模式**:
```python
for data in dataset:          # Shadows 'data'
for i, data in enumerate():   # Shadows 'data'
for key, data in dict.items() # Shadows 'data'

Verify Column Names Before Processing

处理前验证列名

Problem: Assuming column names without checking actual DataFrame structure leads to immediate failures. Column names may use different capitalization, spacing, or naming conventions than expected.
Example error:
python
undefined
问题:未检查DataFrame实际结构就假设列名会直接导致失败。列名可能与预期的大小写、空格或命名规则不同。
示例错误:
python
undefined

Assumed column name

Assumed column name

df_filtered = df[df['scientific_name'] == target] # KeyError!
df_filtered = df[df['scientific_name'] == target] # KeyError!

Actual column name was 'Scientific Name' (capitalized with space)

Actual column name was 'Scientific Name' (capitalized with space)


**Solution**: Always check actual columns first:
```python
import pandas as pd
df = pd.read_csv('data.csv')

**解决方案**:始终先检查实际列名:
```python
import pandas as pd
df = pd.read_csv('data.csv')

ALWAYS print columns before processing

ALWAYS print columns before processing

print("Available columns:") print(df.columns.tolist())
print("Available columns:") print(df.columns.tolist())

Then write filtering code with correct names

Then write filtering code with correct names

df_filtered = df[df['Scientific Name'] == target_species] # Correct

**Best practice for data processing scripts:**
```python
df_filtered = df[df['Scientific Name'] == target_species] # Correct

**数据处理脚本最佳实践**:
```python

At the start of your script

At the start of your script

def verify_required_columns(df, required_cols): """Verify DataFrame has required columns.""" missing = [col for col in required_cols if col not in df.columns] if missing: print(f"ERROR: Missing columns: {missing}") print(f"Available columns: {df.columns.tolist()}") sys.exit(1)
def verify_required_columns(df, required_cols): """Verify DataFrame has required columns.""" missing = [col for col in required_cols if col not in df.columns] if missing: print(f"ERROR: Missing columns: {missing}") print(f"Available columns: {df.columns.tolist()}") sys.exit(1)

Use it

Use it

required = ['Scientific Name', 'tolid', 'accession'] verify_required_columns(df, required)

**Common column name variations to watch for:**
- `scientific_name` vs `Scientific Name` vs `ScientificName`
- `species_id` vs `species` vs `Species ID`
- `genome_size` vs `Genome size` vs `GenomeSize`

**Debugging tip**: Include column listing in all data processing scripts:
```python
required = ['Scientific Name', 'tolid', 'accession'] verify_required_columns(df, required)

**需要注意的常见列名变体**:
- `scientific_name` vs `Scientific Name` vs `ScientificName`
- `species_id` vs `species` vs `Species ID`
- `genome_size` vs `Genome size` vs `GenomeSize`

**调试技巧**:在所有数据处理脚本中添加列名列出代码:
```python

Add at script start for easy debugging

Add at script start for easy debugging

if '--debug' in sys.argv or len(df.columns) < 10: print(f"Columns ({len(df.columns)}): {df.columns.tolist()}")
undefined
if '--debug' in sys.argv or len(df.columns) < 10: print(f"Columns ({len(df.columns)}): {df.columns.tolist()}")
undefined

Outlier Handling Best Practices

异常值处理最佳实践

Two-Stage Outlier Removal

两阶段异常值移除

For analyses correlating characteristics across aggregated entities (e.g., species-level summaries):
  1. Stage 1: Count-based outliers (IQR method)
    • Remove entities with abnormally high sample counts
    • Prevents over-represented entities from skewing correlations
    • Apply BEFORE other analyses
    python
    import numpy as np
    workflow_counts = [entity_data[id]['workflow_count'] for id in entity_data.keys()]
    q1 = np.percentile(workflow_counts, 25)
    q3 = np.percentile(workflow_counts, 75)
    iqr = q3 - q1
    upper_bound = q3 + 1.5 * iqr
    
    outliers = [id for id in entity_data.keys()
                if entity_data[id]['workflow_count'] > upper_bound]
    for id in outliers:
        del entity_data[id]
  2. Stage 2: Value-based outliers (percentile)
    • Remove extreme values for visualization clarity
    • Apply ONLY to visualization data, not statistics
    • Typically top 5% for highly skewed distributions
    python
    values = [entity_data[id]['metric'] for id in entity_data.keys()]
    threshold = np.percentile(values, 95)
    viz_entities = [id for id in entity_data.keys()
                    if entity_data[id]['metric'] <= threshold]
    
    # Use viz_entities for plotting
    # Use full entity_data.keys() for statistics
对于跨聚合实体(如物种级汇总)的特征相关性分析:
  1. 第一阶段:基于计数的异常值(IQR方法)
    • 移除样本计数异常高的实体
    • 防止过度代表的实体扭曲相关性
    • 在其他分析之前应用
    python
    import numpy as np
    workflow_counts = [entity_data[id]['workflow_count'] for id in entity_data.keys()]
    q1 = np.percentile(workflow_counts, 25)
    q3 = np.percentile(workflow_counts, 75)
    iqr = q3 - q1
    upper_bound = q3 + 1.5 * iqr
    
    outliers = [id for id in entity_data.keys()
                if entity_data[id]['workflow_count'] > upper_bound]
    for id in outliers:
        del entity_data[id]
  2. 第二阶段:基于数值的异常值(百分位数法)
    • 移除极端值以提升可视化清晰度
    • 仅应用于可视化数据,不应用于统计计算
    • 对于高度偏态分布,通常移除前5%
    python
    values = [entity_data[id]['metric'] for id in entity_data.keys()]
    threshold = np.percentile(values, 95)
    viz_entities = [id for id in entity_data.keys()
                    if entity_data[id]['metric'] <= threshold]
    
    # Use viz_entities for plotting
    # Use full entity_data.keys() for statistics

Characteristic-Specific Outlier Removal

针对特定特征的异常值移除

When analyzing genome characteristics vs metrics, remove outliers for the characteristic being analyzed:
python
undefined
在分析基因组特征与指标的关系时,移除所分析特征的异常值:
python
undefined

After removing workflow count outliers, also remove heterozygosity outliers

After removing workflow count outliers, also remove heterozygosity outliers

heterozygosity_values = [species_data[sp]['heterozygosity'] for sp in species_data.keys()]
het_q1 = np.percentile(heterozygosity_values, 25) het_q3 = np.percentile(heterozygosity_values, 75) het_iqr = het_q3 - het_q1 het_upper_bound = het_q3 + 1.5 * het_iqr
het_outliers = [sp for sp in species_data.keys() if species_data[sp]['heterozygosity'] > het_upper_bound]
for sp in het_outliers: del species_data[sp]
print(f'Removed {len(het_outliers)} heterozygosity outliers (>{het_upper_bound:.2f}%)') print(f'New heterozygosity range: {min(vals):.2f}% - {max(vals):.2f}%')

**Apply separately for each characteristic**:
- Genome size outliers for genome size analysis
- Heterozygosity outliers for heterozygosity analysis
- Repeat content outliers for repeat content analysis
heterozygosity_values = [species_data[sp]['heterozygosity'] for sp in species_data.keys()]
het_q1 = np.percentile(heterozygosity_values, 25) het_q3 = np.percentile(heterozygosity_values, 75) het_iqr = het_q3 - het_q1 het_upper_bound = het_q3 + 1.5 * het_iqr
het_outliers = [sp for sp in species_data.keys() if species_data[sp]['heterozygosity'] > het_upper_bound]
for sp in het_outliers: del species_data[sp]
print(f'Removed {len(het_outliers)} heterozygosity outliers (>{het_upper_bound:.2f}%)') print(f'New heterozygosity range: {min(vals):.2f}% - {max(vals):.2f}%')

**为每个特征单独应用**:
- 基因组大小分析时移除基因组大小异常值
- 杂合度分析时移除杂合度异常值
- 重复序列含量分析时移除重复序列含量异常值

When to Skip Outlier Removal

何时跳过异常值移除

  • Memory usage plots when investigating over-allocation patterns
  • Comparison plots (allocated vs used) where outliers reveal problems
  • User explicitly requests to see all data
  • Data is already limited (< 10 points)
Document clearly in plot titles and code comments which outlier removal is applied.
###IQR-Based Outlier Removal for Visualization
Standard Method: 1.5×IQR (Interquartile Range)
Implementation:
python
undefined
  • 调查内存过度分配模式时的内存使用情况图
  • 异常值能揭示问题的对比图(已分配vs已使用)
  • 用户明确要求查看所有数据
  • 数据量有限(少于10个数据点)
在图表标题和代码注释中清晰记录所应用的异常值移除方法。

Calculate IQR

基于IQR的可视化异常值移除

Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1
标准方法: 1.5×IQR(四分位距)
实现代码:
python
undefined

Define outlier boundaries (standard: 1.5×IQR)

Calculate IQR

lower_bound = Q1 - 1.5IQR upper_bound = Q3 + 1.5IQR
Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1

Filter outliers

Define outlier boundaries (standard: 1.5×IQR)

outlier_mask = (data >= lower_bound) & (data <= upper_bound) data_filtered = data[outlier_mask] n_outliers = (~outlier_mask).sum()
lower_bound = Q1 - 1.5IQR upper_bound = Q3 + 1.5IQR

IMPORTANT: Report outliers removed

Filter outliers

print(f"Removed {n_outliers} outliers for visualization")
outlier_mask = (data >= lower_bound) & (data <= upper_bound) data_filtered = data[outlier_mask] n_outliers = (~outlier_mask).sum()

Add to figure: f"({n_outliers} outliers removed)"

IMPORTANT: Report outliers removed


**Multi-dimensional Outlier Removal**:
```python
print(f"Removed {n_outliers} outliers for visualization")

For scatter plots with two dimensions (e.g., size ratio AND absolute size)

Add to figure: f"({n_outliers} outliers removed)"

outlier_mask = ( (ratio >= Q1_ratio - 1.5IQR_ratio) & (ratio <= Q3_ratio + 1.5IQR_ratio) & (size >= Q1_size - 1.5IQR_size) & (size <= Q3_size + 1.5IQR_size) )

**Best Practice**: Always report number of outliers removed in figure statistics or caption.

**When to Use**: For visualization clarity when extreme values compress the main distribution. Not for removing "bad" data - use for display only.

**多维异常值移除**:
```python

Statistical Rigor

For scatter plots with two dimensions (e.g., size ratio AND absolute size)

Required for Correlation Analyses

  1. Pearson correlation with p-values:
    python
    from scipy import stats
    correlation, p_value = stats.pearsonr(x_values, y_values)
    sig_text = 'significant' if p_value < 0.05 else 'not significant'
  2. Report both metrics:
    • Correlation coefficient (r) - strength and direction
    • P-value - statistical significance (α=0.05)
    • Sample size (n)
  3. Display on plots:
    python
    ax.text(0.98, 0.02,
            f'r = {correlation:.3f}\np = {p_value:.2e}\n({sig_text})\nn = {len(data)} species',
            transform=ax.transAxes, ...)
outlier_mask = ( (ratio >= Q1_ratio - 1.5IQR_ratio) & (ratio <= Q3_ratio + 1.5IQR_ratio) & (size >= Q1_size - 1.5IQR_size) & (size <= Q3_size + 1.5IQR_size) )

**最佳实践**: 始终在图表统计信息或说明中报告移除的异常值数量。

**适用场景**: 当极端值压缩了主要数据分布时,用于提升可视化清晰度。不用于移除“坏数据”,仅用于展示目的。

Adding Mann-Whitney U Tests to Figures

统计严谨性

相关性分析的必备要求

When to Use: Comparing continuous metrics between two groups (e.g., Dual vs Pri/alt curation)
Standard Implementation:
python
from scipy import stats
  1. 带p值的Pearson相关性:
    python
    from scipy import stats
    correlation, p_value = stats.pearsonr(x_values, y_values)
    sig_text = 'significant' if p_value < 0.05 else 'not significant'
  2. 同时报告两个指标:
    • 相关系数(r) - 强度和方向
    • p值 - 统计显著性(α=0.05)
    • 样本量(n)
  3. 在图表上显示:
    python
    ax.text(0.98, 0.02,
            f'r = {correlation:.3f}\np = {p_value:.2e}\n({sig_text})\nn = {len(data)} species',
            transform=ax.transAxes, ...)

Calculate test

在图表中添加Mann-Whitney U检验

data_group1 = df[df['group'] == 'Group1']['metric'] data_group2 = df[df['group'] == 'Group2']['metric']
if len(data_group1) > 0 and len(data_group2) > 0: stat, pval = stats.mannwhitneyu(data_group1, data_group2, alternative='two-sided') else: pval = np.nan
适用场景: 比较两组连续指标(如双重注释vs初级/替代注释)
标准实现:
python
from scipy import stats

Add to stats text

Calculate test

if not np.isnan(pval): stats_text += f"\nMann-Whitney p: {pval:.2e}"

**Display in Figures**: Include p-value in statistics box with format `Mann-Whitney p: 1.23e-04`

**Consistency**: Ensure all quantitative comparison figures include this test for statistical rigor.
data_group1 = df[df['group'] == 'Group1']['metric'] data_group2 = df[df['group'] == 'Group2']['metric']
if len(data_group1) > 0 and len(data_group2) > 0: stat, pval = stats.mannwhitneyu(data_group1, data_group2, alternative='two-sided') else: pval = np.nan

Large-Scale Analysis Structure

Add to stats text

Control Analyses: Checking for Confounding

When comparing methods (e.g., Method A vs Method B), always check if observed differences could be explained by characteristics of the samples rather than the methods themselves.
Critical control analysis:
python
import pandas as pd
from scipy import stats

def check_confounding(df, method_col, characteristics):
    """
    Compare sample characteristics between methods to check for confounding.
    
    Args:
        df: DataFrame with samples
        method_col: Column indicating method ('Method_A', 'Method_B')
        characteristics: List of column names to compare
    
    Returns:
        DataFrame with statistical comparison
    """
    results = []
    
    for char in characteristics:
        # Get data for each method
        method_a = df[df[method_col] == 'Method_A'][char].dropna()
        method_b = df[df[method_col] == 'Method_B'][char].dropna()
        
        if len(method_a) < 5 or len(method_b) < 5:
            continue
        
        # Statistical test
        stat, pval = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
        
        # Calculate effect size (% difference in medians)
        pooled_median = pd.concat([method_a, method_b]).median()
        effect_pct = (method_a.median() - method_b.median()) / pooled_median * 100
        
        results.append({
            'Characteristic': char,
            'Method_A_median': method_a.median(),
            'Method_A_n': len(method_a),
            'Method_B_median': method_b.median(),
            'Method_B_n': len(method_b),
            'p_value': pval,
            'effect_pct': effect_pct,
            'significant': pval < 0.05
        })
    
    return pd.DataFrame(results)
if not np.isnan(pval): stats_text += f"\nMann-Whitney p: {pval:.2e}"

**在图表中显示**: 将p值包含在统计信息框中,格式为`Mann-Whitney p: 1.23e-04`

**一致性**: 确保所有定量对比图表都包含此检验以保证统计严谨性。

Example usage

大规模分析结构

对照分析:检查混淆因素

characteristics = ['genome_size', 'gc_content', 'heterozygosity', 'repeat_content', 'sequencing_coverage']
confounding_check = check_confounding(df, 'curation_method', characteristics) print(confounding_check)

**Interpretation guide**:
- **No significant differences**: Methods compared equivalent samples → valid comparison
- **Method A has "easier" samples** (smaller genomes, lower complexity): Quality differences may be due to sample properties, not method
- **Method A has "harder" samples** (larger genomes, higher complexity): Strengthens conclusion that Method A is better despite challenges
- **Limited data** (n<10): Cannot rule out confounding, note as limitation

**Present in notebook**:
```markdown
当比较方法(如方法A vs方法B)时,始终检查观察到的差异是否可由样本特征而非方法本身解释。
关键对照分析:
python
import pandas as pd
from scipy import stats

def check_confounding(df, method_col, characteristics):
    """
    Compare sample characteristics between methods to check for confounding.
    
    Args:
        df: DataFrame with samples
        method_col: Column indicating method ('Method_A', 'Method_B')
        characteristics: List of column names to compare
    
    Returns:
        DataFrame with statistical comparison
    """
    results = []
    
    for char in characteristics:
        # Get data for each method
        method_a = df[df[method_col] == 'Method_A'][char].dropna()
        method_b = df[df[method_col] == 'Method_B'][char].dropna()
        
        if len(method_a) < 5 or len(method_b) < 5:
            continue
        
        # Statistical test
        stat, pval = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
        
        # Calculate effect size (% difference in medians)
        pooled_median = pd.concat([method_a, method_b]).median()
        effect_pct = (method_a.median() - method_b.median()) / pooled_median * 100
        
        results.append({
            'Characteristic': char,
            'Method_A_median': method_a.median(),
            'Method_A_n': len(method_a),
            'Method_B_median': method_b.median(),
            'Method_B_n': len(method_b),
            'p_value': pval,
            'effect_pct': effect_pct,
            'significant': pval < 0.05
        })
    
    return pd.DataFrame(results)

Genome Characteristics Comparison

Example usage

Control Analysis: Are quality differences due to method or sample properties?
[Table comparing characteristics]
Conclusion:
  • If no differences → Valid method comparison
  • If Method A works with harder samples → Strengthens conclusions
  • If Method A works with easier samples → Potential confounding

**Why critical**: Reviewers will ask this question. Preemptive control analysis demonstrates scientific rigor and prevents major revisions.
characteristics = ['genome_size', 'gc_content', 'heterozygosity', 'repeat_content', 'sequencing_coverage']
confounding_check = check_confounding(df, 'curation_method', characteristics) print(confounding_check)

**解读指南**:
- **无显著差异**: 比较的方法使用了等效样本 → 对比有效
- **方法A使用了“更简单”的样本**(更小的基因组、更低的复杂度): 质量差异可能源于样本属性,而非方法
- **方法A使用了“更难”的样本**(更大的基因组、更高的复杂度): 更能说明方法A更好,尽管面临挑战
- **数据有限**(n<10): 无法排除混淆因素,需注明为局限性

**在Notebook中呈现**:
```markdown

Organizing 60+ Cell Notebooks

基因组特征对比

  1. Section headers (markdown cells):
    • Main sections: "## CPU Runtime Analysis", "## Memory Analysis"
    • Subsections: "### Genome Size vs CPU Runtime"
  2. Cell pairing pattern:
    • Markdown header + code cell for each analysis
    • Keeps related content together
    • Easier to navigate and debug
  3. Consistent naming:
    • Figure files:
      fig18_genome_size_vs_cpu_hours.png
    • Variables:
      species_data
      ,
      genome_sizes_full
      ,
      genome_sizes_viz
    • Functions:
      safe_float_convert()
      defined consistently
  4. Progressive enhancement:
    • Start with basic analyses
    • Add enriched data (Cell 7 pattern)
    • Build increasingly complex correlations
    • End with multivariate analyses (PCA)
对照分析: 质量差异是源于方法还是样本属性?
[特征对比表格]
结论:
  • 若无差异 → 方法对比有效
  • 若方法A在更难样本上表现良好 → 结论更具说服力
  • 若方法A在更简单样本上表现良好 → 存在潜在混淆因素

**为何关键**: 审稿人会提出这个问题。提前进行对照分析展示了科学严谨性,避免重大修改。

Template Generation Pattern

管理60+单元格的Notebook

For creating multiple similar analysis cells:
python
undefined
  1. 章节标题(markdown单元格):
    • 主章节: "## CPU运行时间分析", "## 内存分析"
    • 子章节: "### 基因组大小vs CPU运行时间"
  2. 单元格配对模式:
    • 每个分析对应一个markdown标题 + 一个代码单元格
    • 保持相关内容在一起
    • 更易于导航和调试
  3. 一致命名:
    • 图文件:
      fig18_genome_size_vs_cpu_hours.png
    • 变量:
      species_data
      ,
      genome_sizes_full
      ,
      genome_sizes_viz
    • 函数:
      safe_float_convert()
      定义保持一致
  4. 渐进式增强:
    • 从基础分析开始
    • 添加丰富数据(单元格7模式)
    • 构建日益复杂的相关性分析
    • 最终进行多变量分析(PCA)

Create template with placeholder variables

模板生成模式

template = ''' if len(data_with_species) > 0: print('Analyzing {display} vs {metric}...\n')
# Aggregate data per species
species_data = {{}}

for inv in data_with_species:
    {name} = safe_float_convert(inv.get('{name}'))
    if {name} is None:
        continue
    # ... analysis code
'''
用于创建多个类似的分析单元格:
python
undefined

Generate multiple cells from characteristics list

Create template with placeholder variables

characteristics = [ {'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'}, {'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'}, # ... ]
for char in characteristics: code = template.format(**char) # Write to notebook or temp file
undefined
template = ''' if len(data_with_species) > 0: print('Analyzing {display} vs {metric}...\n')
# Aggregate data per species
species_data = {{}}

for inv in data_with_species:
    {name} = safe_float_convert(inv.get('{name}'))
    if {name} is None:
        continue
    # ... analysis code
'''

Helper Function Pattern

Generate multiple cells from characteristics list

Define once, reuse throughout:
python
def safe_float_convert(value):
    """Convert string to float, handling comma separators"""
    if not value or not str(value).strip():
        return None
    try:
        return float(str(value).replace(',', ''))
    except (ValueError, TypeError):
        return None
Include in Cell 7 (enrichment) and reference: "# Helper function (same as Cell 7)"
characteristics = [ {'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'}, {'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'}, # ... ]
for char in characteristics: code = template.format(**char) # Write to notebook or temp file
undefined

Publication-Quality Figures

辅助函数模式

Standard settings:
  • DPI: 300
  • Figure size: (12, 8) for single plots, (16, 7) for side-by-side
  • Grid:
    alpha=0.3, linestyle='--'
  • Point size: Proportional to sample count (
    s=[50 + count*20 for count in counts]
    )
  • Colormap: 'viridis' for workflow counts
定义一次,全程复用:
python
def safe_float_convert(value):
    """Convert string to float, handling comma separators"""
    if not value or not str(value).strip():
        return None
    try:
        return float(str(value).replace(',', ''))
    except (ValueError, TypeError):
        return None
在单元格7(数据增强)中定义,并在其他单元格引用: "# 辅助函数(与单元格7相同)"

Publication-Ready Font Sizes

出版级图表

Problem: Default matplotlib fonts are designed for screen viewing, not print publication.
Solution: Use larger, bold fonts for print readability.
Recommended sizes (for standard 10-12 cm wide figures):
ElementDefaultPublicationCode
Title11-12pt18pt (bold)
fontsize=18, fontweight='bold'
Axis labels10-11pt16pt (bold)
fontsize=16, fontweight='bold'
Tick labels9-10pt14pt
tick_params(labelsize=14)
Legend8-10pt12pt
legend(fontsize=12)
Annotations8-10pt11-13pt
fontsize=12
Data points20-3660-100
s=80
(scatter)
Implementation example:
python
fig, ax = plt.subplots(figsize=(10, 8))
标准设置:
  • DPI: 300
  • 图表尺寸: 单图(12, 8), 并排图(16, 7)
  • 网格:
    alpha=0.3, linestyle='--'
  • 点大小: 与样本数量成正比(
    s=[50 + count*20 for count in counts]
    )
  • 颜色映射: 工作流计数使用'viridis'

Plot data

出版级字体大小

ax.scatter(x, y, s=80, alpha=0.6) # Larger points
问题: 默认matplotlib字体是为屏幕查看设计的,而非印刷出版。
解决方案: 使用更大、加粗的字体以提升印刷可读性。
推荐尺寸(适用于标准10-12厘米宽的图表):
元素默认值出版级设置代码实现
标题11-12pt18pt(加粗)
fontsize=18, fontweight='bold'
坐标轴标签10-11pt16pt(加粗)
fontsize=16, fontweight='bold'
刻度标签9-10pt14pt
tick_params(labelsize=14)
图例8-10pt12pt
legend(fontsize=12)
注释8-10pt11-13pt
fontsize=12
数据点20-3660-100
s=80
(散点图)
实现示例:
python
fig, ax = plt.subplots(figsize=(10, 8))

Titles and labels - BOLD

Plot data

ax.set_title('Your Title Here', fontsize=18, fontweight='bold') ax.set_xlabel('X Axis Label', fontsize=16, fontweight='bold') ax.set_ylabel('Y Axis Label', fontsize=16, fontweight='bold')
ax.scatter(x, y, s=80, alpha=0.6) # Larger points

Tick labels

Titles and labels - BOLD

ax.tick_params(axis='both', which='major', labelsize=14)
ax.set_title('Your Title Here', fontsize=18, fontweight='bold') ax.set_xlabel('X Axis Label', fontsize=16, fontweight='bold') ax.set_ylabel('Y Axis Label', fontsize=16, fontweight='bold')

Legend

Tick labels

ax.legend(fontsize=12, loc='best')
ax.tick_params(axis='both', which='major', labelsize=14)

Stats box

Legend

stats_text = "Statistics:\nMean: 42.5" ax.text(0.02, 0.98, stats_text, transform=ax.transAxes, fontsize=13, family='monospace', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))
ax.legend(fontsize=12, loc='best')

Reference lines - thicker

Stats box

ax.axhline(y=1.0, linewidth=2.5, linestyle='--', alpha=0.6)

**Quick check**: If you have to squint to read the figure on screen at 100% zoom, fonts are too small for print.

**Special cases**:
- Multi-panel figures: Increase 10-15% more
- Posters: Increase 50-100% more
- Presentations: Increase 30-50% more
stats_text = "Statistics:\nMean: 42.5" ax.text(0.02, 0.98, stats_text, transform=ax.transAxes, fontsize=13, family='monospace', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))

Accessibility: Colorblind-Safe Palettes

Reference lines - thicker

Problem: Standard color schemes (green vs blue, red vs green) are difficult or impossible to distinguish for people with color vision deficiencies, affecting ~8% of males and ~0.5% of females.
Solution: Use colorblind-safe palettes from validated sources.
IBM Color Blind Safe Palette (Recommended):
python
undefined
ax.axhline(y=1.0, linewidth=2.5, linestyle='--', alpha=0.6)

**快速检查**: 如果在100%缩放的屏幕上需要眯眼才能看清图表,说明字体对于印刷来说太小。

**特殊情况**:
- 多面板图表: 再增加10-15%
- 海报: 再增加50-100%
- 演示文稿: 再增加30-50%

For comparing two groups/conditions

可访问性:色弱友好配色方案

colors = { 'Group_A': '#0173B2', # Blue 'Group_B': '#DE8F05' # Orange }

**Why this works**:
- ✅ Maximum contrast for all color vision types (deuteranopia, protanopia, tritanopia, achromatopsia)
- ✅ Professional appearance for scientific publications
- ✅ Clear distinction even in grayscale printing
- ✅ Cultural neutrality (no red/green traffic light associations)

**Other colorblind-safe combinations**:
- Blue + Orange (best overall)
- Blue + Red (good for most types)
- Blue + Yellow (good but lower contrast)

**Avoid**:
- ❌ Green + Red (most common color blindness)
- ❌ Green + Blue (confusing for many)
- ❌ Blue + Purple (too similar)

**Implementation in matplotlib**:
```python
import matplotlib.pyplot as plt
问题: 标准配色方案(绿vs蓝、红vs绿)对于色觉缺陷人群来说难以区分或无法区分,这类人群约占男性的8%,女性的0.5%。
解决方案: 使用经过验证的色弱友好配色方案。
IBM色弱友好配色方案(推荐):
python
undefined

Define colorblind-safe palette

For comparing two groups/conditions

CB_COLORS = { 'blue': '#0173B2', 'orange': '#DE8F05', 'green': '#029E73', 'red': '#D55E00', 'purple': '#CC78BC', 'brown': '#CA9161' }
colors = { 'Group_A': '#0173B2', # Blue 'Group_B': '#DE8F05' # Orange }

**为何有效**:
- ✅ 对所有色觉类型(绿色盲、红色盲、蓝色盲、全色盲)都有最大对比度
- ✅ 适合科学出版物的专业外观
- ✅ 即使灰度打印也能清晰区分
- ✅ 文化中立(无红绿灯的红/绿关联)

**其他色弱友好组合**:
- 蓝+橙(整体最佳)
- 蓝+红(适用于大多数类型)
- 蓝+黄(效果不错但对比度较低)

**避免**:
- ❌ 绿+红(最常见的色盲类型)
- ❌ 绿+蓝(对很多人来说易混淆)
- ❌ 蓝+紫(过于相似)

**在matplotlib中实现**:
```python
import matplotlib.pyplot as plt

Use in plots

Define colorblind-safe palette

plt.scatter(x, y, color=CB_COLORS['blue'], label='Treatment') plt.scatter(x2, y2, color=CB_COLORS['orange'], label='Control')

**Testing your colors**:
- Use online simulators: https://www.color-blindness.com/coblis-color-blindness-simulator/
- Check in grayscale: Convert figure to grayscale to ensure distinguishability
CB_COLORS = { 'blue': '#0173B2', 'orange': '#DE8F05', 'green': '#029E73', 'red': '#D55E00', 'purple': '#CC78BC', 'brown': '#CA9161' }

Handling Severe Data Imbalance in Comparisons

Use in plots

Problem: Comparing groups with very different sample sizes (e.g., 84 vs 10) can lead to misleading conclusions.
Solution: Add prominent warnings both visually and in documentation.
Visual warning on figure:
python
import matplotlib.pyplot as plt
plt.scatter(x, y, color=CB_COLORS['blue'], label='Treatment') plt.scatter(x2, y2, color=CB_COLORS['orange'], label='Control')

**测试配色**:
- 使用在线模拟器: https://www.color-blindness.com/coblis-color-blindness-simulator/
- 检查灰度效果: 将图表转换为灰度以确保可区分性

After creating your plot

处理比较中的严重数据不平衡

n_group_a = len(df[df['group'] == 'A']) n_group_b = len(df[df['group'] == 'B']) total_a = 200 total_b = 350
warning_text = f"⚠️ DATA LIMITATION\n" warning_text += f"Data availability:\n" warning_text += f" Group A: {n_group_a}/{total_a} ({n_group_a/total_a100:.1f}%)\n" warning_text += f" Group B: {n_group_b}/{total_b} ({n_group_b/total_b100:.1f}%)\n" warning_text += f"Severe imbalance limits\nstatistical comparability"
ax.text(0.98, 0.02, warning_text, transform=ax.transAxes, fontsize=11, verticalalignment='bottom', horizontalalignment='right', bbox=dict(boxstyle='round', facecolor='red', alpha=0.2, edgecolor='red', linewidth=2), family='monospace', color='darkred', fontweight='bold')
问题: 比较样本量差异极大的组(如84 vs 10)会导致结论误导。
解决方案: 在视觉和文档中添加显著警告。
图表上的视觉警告:
python
import matplotlib.pyplot as plt

Update title to indicate limitation

After creating your plot

ax.set_title('Your Title\n(SUPPLEMENTARY - Limited Data Availability)', fontsize=14, fontweight='bold')

**Text warning in notebook/paper**:
```markdown
**⚠️ CRITICAL DATA LIMITATION**: This figure suffers from severe data availability bias:
- Group A: 84/200 (42%)
- Group B: 10/350 (3%)

This **8-fold imbalance** severely limits statistical comparability. The 10 Group B 
samples are unlikely to be representative of all 350. 

**Interpretation**: Comparisons should be interpreted with extreme caution. This 
figure is provided for completeness but should be considered **supplementary**.
Guidelines for sample size imbalance:
  • < 2× imbalance: Generally acceptable, note in caption
  • 2-5× imbalance: Add note about limitations
  • > 5× imbalance: Add prominent warnings (visual + text)
  • > 10× imbalance: Consider excluding figure or supplementary-only
Alternative: If possible, subset the larger group to match sample size:
python
undefined
n_group_a = len(df[df['group'] == 'A']) n_group_b = len(df[df['group'] == 'B']) total_a = 200 total_b = 350
warning_text = f"⚠️ DATA LIMITATION\n" warning_text += f"Data availability:\n" warning_text += f" Group A: {n_group_a}/{total_a} ({n_group_a/total_a100:.1f}%)\n" warning_text += f" Group B: {n_group_b}/{total_b} ({n_group_b/total_b100:.1f}%)\n" warning_text += f"Severe imbalance limits\nstatistical comparability"
ax.text(0.98, 0.02, warning_text, transform=ax.transAxes, fontsize=11, verticalalignment='bottom', horizontalalignment='right', bbox=dict(boxstyle='round', facecolor='red', alpha=0.2, edgecolor='red', linewidth=2), family='monospace', color='darkred', fontweight='bold')

Random subset to balance groups

Update title to indicate limitation

if n_group_a > n_group_b * 2: group_a_subset = df[df['group'] == 'A'].sample(n=n_group_b * 2, random_state=42) # Use subset for balanced comparison
undefined
ax.set_title('Your Title\n(SUPPLEMENTARY - Limited Data Availability)', fontsize=14, fontweight='bold')

**Notebook/论文中的文字警告**:
```markdown
**⚠️ 关键数据限制**: 本图表存在严重的数据可用性偏差:
- A组: 84/200 (42%)
- B组: 10/350 (3%)

这种**8倍不平衡**严重限制了统计可比性。10个B组样本不太可能代表全部350个样本。

**解读**: 比较应极其谨慎。本图表仅为完整性提供,应视为**补充内容**。
样本量不平衡指南:
  • <2倍不平衡: 通常可接受,在说明中注明
  • 2-5倍不平衡: 添加关于局限性的说明
  • >5倍不平衡: 添加显著警告(视觉+文字)
  • >10倍不平衡: 考虑排除图表或仅作为补充内容
替代方案: 若可能,对较小组进行抽样以匹配样本量:
python
undefined

Creating Analysis Notebooks for Scientific Publications

Random subset to balance groups

When creating Jupyter notebooks to accompany manuscript figures:
if n_group_a > n_group_b * 2: group_a_subset = df[df['group'] == 'A'].sample(n=n_group_b * 2, random_state=42) # Use subset for balanced comparison
undefined

Structure Pattern

为科学出版物创建分析Notebook

  1. Title and metadata - Date, dataset info, sample sizes
  2. Overview - Context from paper abstract/intro
  3. Figure-by-figure analysis:
    • Code cell to display image
    • Detailed figure legend (publication-ready)
    • Comprehensive analysis paragraph explaining:
      • What the metric measures
      • Statistical results
      • Mechanistic explanation
      • Biological/technical implications
  4. Methods section - Complete reproducibility information
  5. Conclusions - Summary of findings
当创建随手稿图表附带的Jupyter Notebook时:

Table of Contents

结构模式

For analysis notebooks >10 cells, add a navigable table of contents at the top:
Benefits:
  • Quick navigation to specific analyses
  • Clear overview of notebook structure
  • Professional presentation
  • Easier for collaborators
Implementation (Markdown cell):
markdown
undefined
  1. 标题和元数据 - 日期、数据集信息、样本量
  2. 概述 - 来自论文摘要/引言的背景
  3. 逐图分析:
    • 显示图像的代码单元格
    • 详细的图表图例(出版级)
    • 全面的分析段落,解释:
      • 指标测量的内容
      • 统计结果
      • 机制解释
      • 生物学/技术意义
  4. 方法部分 - 完整的可复现性信息
  5. 结论 - 发现总结

Analysis Name

目录

Table of Contents

对于超过10个单元格的分析Notebook,在顶部添加可导航的目录:
好处:
  • 快速导航到特定分析
  • 清晰展示Notebook结构
  • 专业呈现
  • 方便协作者使用
实现(Markdown单元格):
markdown
undefined

Data Loading

分析名称

目录

[Your code/analysis]

  1. 数据加载
  2. 数据质量指标
  3. 图1: 完整性
  4. 图2: 连续性
  5. 图3: 支架验证 ...
  6. 方法
  7. 参考文献


**章节标题**(Markdown单元格):
```markdown

Data Quality Metrics

数据加载

[Your code/analysis]

**Auto-generation**: For large notebooks, consider generating TOC programmatically:
```python
from IPython.display import Markdown

sections = ['Introduction', 'Data Loading', 'Analysis', ...]
toc = "## Table of Contents\n\n"
for i, section in enumerate(sections, 1):
    anchor = section.lower().replace(' ', '-')
    toc += f"{i}. [{section}](#{anchor})\n"

display(Markdown(toc))
[你的代码/分析]

Methods Documentation

数据质量指标

Always include a Methods section documenting:
  • Data sources with accession numbers
  • Key algorithms and formulas
  • Statistical approaches
  • Software versions
  • Special adjustments (e.g., sex chromosome correction)
  • Literature citations
Example:
markdown
undefined
[你的代码/分析]

**自动生成**: 对于大型Notebook,考虑以编程方式生成目录:
```python
from IPython.display import Markdown

sections = ['Introduction', 'Data Loading', 'Analysis', ...]
toc = "## Table of Contents\n\n"
for i, section in enumerate(sections, 1):
    anchor = section.lower().replace(' ', '-')
    toc += f"{i}. [{section}](#{anchor})\n"

display(Markdown(toc))

Methods

方法文档

Karyotype Data

Karyotype data (diploid 2n and haploid n chromosome numbers) was manually curated from peer-reviewed literature for 97 species representing 17.8% of the VGP Phase 1 dataset (n = 545 assemblies).
始终包含一个方法部分,记录:
  • 数据来源及登录号
  • 关键算法和公式
  • 统计方法
  • 软件版本
  • 特殊调整(如性染色体校正)
  • 文献引用
示例:
markdown
undefined

Sex Chromosome Adjustment

方法

核型数据

When both sex chromosomes are present in the main haplotype, the expected number of chromosome-level scaffolds is:
expected_scaffolds = n + 1
For example:
  • Asian elephant: 2n=56, n=28, has X+Y → expected 29 scaffolds
  • White-throated sparrow: 2n=82, n=41, has Z+W → expected 42 scaffolds
This adjustment accounts for the biological reality that X and Y (or Z and W) are distinct chromosomes.
undefined
核型数据(二倍体2n和单倍体n染色体数)是从同行评审文献中手动整理的,涵盖97个物种,占VGP第一阶段数据集的17.8%(n=545个组装)。

Writing Style Matching

性染色体校正

To match manuscript style:
  • Read draft paper PDF to extract tone and terminology
  • Use same technical vocabulary
  • Match paragraph structure (observation → mechanism → implication)
  • Include specific details (tool names, file formats, software versions)
  • Use first-person plural ("we") if paper does
  • Maintain consistent bullet point/list formatting
当主单倍型中同时存在两条性染色体时,染色体级支架的预期数量为:
expected_scaffolds = n + 1
例如:
  • 亚洲象: 2n=56, n=28, 有X+Y → 预期29个支架
  • 白喉带鹀: 2n=82, n=41, 有Z+W → 预期42个支架
该校正考虑了X和Y(或Z和W)是不同染色体的生物学现实。
undefined

Example Code Pattern

写作风格匹配

python
undefined
为匹配手稿风格:
  • 阅读论文草稿PDF以提取语气和术语
  • 使用相同的技术词汇
  • 匹配段落结构(观察→机制→意义)
  • 包含具体细节(工具名称、文件格式、软件版本)
  • 若论文使用第一人称复数(“我们”),则沿用
  • 保持一致的项目符号/列表格式

Display figure

示例代码模式

from IPython.display import Image, display from pathlib import Path
FIG_DIR = Path('figures/analysis_name') display(Image(filename=str(FIG_DIR / 'figure_01.png')))
undefined
python
undefined

Figure Legend Format

Display figure

Figure N. [Short title]. [Complete description of panels and what's shown]. [Statistical tests used]. [Sample sizes]. [Scale information]. [Color coding].
from IPython.display import Image, display from pathlib import Path
FIG_DIR = Path('figures/analysis_name') display(Image(filename=str(FIG_DIR / 'figure_01.png')))
undefined

Analysis Paragraph Structure

图表图例格式

  1. What it measures - Define the metric/comparison
  2. Statistical result - Quantitative findings with p-values
  3. Mechanistic explanation - Why this result occurs
  4. Implications - What this means for conclusions
图N. [简短标题]。 [面板完整描述及展示内容]。[使用的统计检验]。[样本量]。[比例信息]。[颜色编码]。

Methods Section Must Include

分析段落结构

  • Dataset source and filtering criteria
  • Metric definitions
  • Outlier handling approach
  • Statistical methods with justification
  • Software versions and tools
  • Reproducibility information
  • Known limitations
This approach creates notebooks that serve both as analysis documentation and as supplementary material for publications.
  1. 测量内容 - 定义指标/比较
  2. 统计结果 - 带p值的定量发现
  3. 机制解释 - 结果产生的原因
  4. 意义 - 对结论的影响

Environment Setup

方法部分必须包含

For CLI-based workflows (Claude Code, SSH sessions):
bash
undefined
  • 数据集来源和过滤标准
  • 指标定义
  • 异常值处理方法
  • 统计方法及理由
  • 软件版本和工具
  • 可复现性信息
  • 已知局限性
这种方法创建的Notebook既作为分析文档,也作为出版物的补充材料。

Run in background with token authentication

环境设置

/path/to/conda/envs/ENV_NAME/bin/jupyter lab --no-browser --port=8888

**Parameters**:
- `--no-browser`: Don't auto-open browser (for remote sessions)
- `--port=8888`: Specify port (default, can change if occupied)
- Run in background: Use `run_in_background=true` in Bash tool

**Access URL format**:

**To stop later**:
- Find shell ID from BashOutput tool
- Use KillShell with that ID

**Installation if missing**:
```bash
/path/to/conda/envs/ENV_NAME/bin/pip install jupyterlab
对于基于CLI的工作流(Claude Code、SSH会话):
bash
undefined

Notebook Size Management

Run in background with token authentication

For notebooks > 256 KB:
  • Use
    jq
    to read specific cells:
    cat notebook.ipynb | jq '.cells[10:20]'
  • Count cells:
    cat notebook.ipynb | jq '.cells | length'
  • Check sections:
    cat notebook.ipynb | jq '.cells[75:81] | .[].source[:2]'
/path/to/conda/envs/ENV_NAME/bin/jupyter lab --no-browser --port=8888

**参数**:
- `--no-browser`: 不自动打开浏览器(用于远程会话)
- `--port=8888`: 指定端口(默认,若被占用可更改)
- 后台运行: 在Bash工具中使用`run_in_background=true`

**访问URL格式**:

**后续停止**:
- 从BashOutput工具中查找Shell ID
- 使用KillShell和该ID

**若缺失则安装**:
```bash
/path/to/conda/envs/ENV_NAME/bin/pip install jupyterlab

Data Enrichment Pattern

Notebook大小管理

When linking external metadata with analysis data:
python
undefined
对于大于256 KB的Notebook:
  • 使用
    jq
    读取特定单元格:
    cat notebook.ipynb | jq '.cells[10:20]'
  • 统计单元格数量:
    cat notebook.ipynb | jq '.cells | length'
  • 检查章节:
    cat notebook.ipynb | jq '.cells[75:81] | .[].source[:2]'

Cell 6: Load genome metadata

数据增强模式

import csv genome_data = [] with open('genome_metadata.tsv') as f: reader = csv.DictReader(f, delimiter='\t') genome_data = list(reader)
genome_lookup = {} for row in genome_data: species_id = row['species_id'] if species_id not in genome_lookup: genome_lookup[species_id] = [] genome_lookup[species_id].append(row)
当将外部元数据与分析数据关联时:
python
undefined

Cell 7: Enrich workflow data with genome characteristics

Cell 6: Load genome metadata

for inv in data: species_id = inv.get('species_id')
if species_id and species_id in genome_lookup:
    genome_info = genome_lookup[species_id][0]

    # Add genome characteristics
    inv['genome_size'] = genome_info.get('Genome size', '')
    inv['heterozygosity'] = genome_info.get('Heterozygosity', '')
    # ... other characteristics
else:
    # Set to None for missing data
    inv['genome_size'] = None
    inv['heterozygosity'] = None
import csv genome_data = [] with open('genome_metadata.tsv') as f: reader = csv.DictReader(f, delimiter='\t') genome_data = list(reader)
genome_lookup = {} for row in genome_data: species_id = row['species_id'] if species_id not in genome_lookup: genome_lookup[species_id] = [] genome_lookup[species_id].append(row)

Create filtered dataset

Cell 7: Enrich workflow data with genome characteristics

data_with_species = [inv for inv in data if inv.get('species_id') and inv.get('genome_size')]
undefined
for inv in data: species_id = inv.get('species_id')
if species_id and species_id in genome_lookup:
    genome_info = genome_lookup[species_id][0]

    # Add genome characteristics
    inv['genome_size'] = genome_info.get('Genome size', '')
    inv['heterozygosity'] = genome_info.get('Heterozygosity', '')
    # ... other characteristics
else:
    # Set to None for missing data
    inv['genome_size'] = None
    inv['heterozygosity'] = None

Data Backup Strategy

Create filtered dataset

The Problem

Long-running data enrichment projects risk:
  • Losing days of work from accidental overwrites
  • Unable to revert to previous data states
  • No documentation of what changed when
  • Running out of disk space from manual backups
data_with_species = [inv for inv in data if inv.get('species_id') and inv.get('genome_size')]
undefined

Solution: Automated Two-Tier Backup System

数据备份策略

问题

Architecture:
  1. Daily backups - Rolling 7-day window (auto-cleanup)
  2. Milestone backups - Permanent, compressed (gzip ~80% reduction)
  3. CHANGELOG - Automatic documentation of all changes
Implementation:
bash
undefined
长时间运行的数据增强项目面临以下风险:
  • 意外覆盖导致数天工作丢失
  • 无法恢复到之前的数据状态
  • 无变更记录
  • 手动备份导致磁盘空间不足

Daily backup (start of each work session)

解决方案: 自动化两层备份系统

./backup_table.sh
架构:
  1. 每日备份 - 滚动7天窗口(自动清理)
  2. 里程碑备份 - 永久、压缩(gzip约80%压缩率)
  3. CHANGELOG - 自动记录所有变更
实现:
bash
undefined

Milestone backup (after major changes)

Daily backup (start of each work session)

./backup_table.sh milestone "added genomescope data for 21 species"
./backup_table.sh

List all backups

Milestone backup (after major changes)

./backup_table.sh list
./backup_table.sh milestone "added genomescope data for 21 species"

Restore from backup (with safety backup)

List all backups

./backup_table.sh restore 2026-01-23

**Directory structure:**
backups/ ├── daily/ # Rolling 7-day backups (~770KB each) │ ├── backup_2026-01-17.csv │ └── backup_2026-01-23.csv ├── milestones/ # Permanent compressed backups (~200KB each) │ ├── milestone_2026-01-20_initial_enrichment.csv.gz │ └── milestone_2026-01-23_recovered_accessions.csv.gz ├── CHANGELOG.md # Auto-generated change log └── README.md # User documentation

**Storage efficiency:**
- Daily backups: ~5.4 MB (7 days × 770KB)
- Milestone backups: ~200KB each compressed (80% size reduction)
- Total: <10 MB for complete project history
- Old daily backups auto-delete after 7 days

**When to create milestones:**
- After adding new data sources (GenomeScope, karyotypes, etc.)
- Before major data transformations
- When completing analysis sections
- Before submitting/publishing

**Global installer available:**
```bash
./backup_table.sh list

Install backup system in any repository

Restore from backup (with safety backup)

install-backup-system -f your_data_file.csv

**Key features:**
- Never overwrites without confirmation
- Creates safety backup before restore
- Complete audit trail in CHANGELOG
- Color-coded terminal output
- Handles both CSV and TSV files

**Benefits for data analysis:**
- **Data provenance** - CHANGELOG documents every modification
- **Confidence to experiment** - Easy rollback encourages trying approaches
- **Professional workflow** - Matches publication standards
- **Collaboration-ready** - Team members can understand data history
./backup_table.sh restore 2026-01-23

**目录结构:**
backups/ ├── daily/ # Rolling 7-day backups (~770KB each) │ ├── backup_2026-01-17.csv │ └── backup_2026-01-23.csv ├── milestones/ # Permanent compressed backups (~200KB each) │ ├── milestone_2026-01-20_initial_enrichment.csv.gz │ └── milestone_2026-01-23_recovered_accessions.csv.gz ├── CHANGELOG.md # Auto-generated change log └── README.md # User documentation

**存储效率:**
- 每日备份: ~5.4 MB(7天 × 770KB)
- 里程碑备份: 每个约200KB压缩后(80%大小缩减)
- 总计: 完整项目历史占用<10 MB
- 旧的每日备份7天后自动删除

**何时创建里程碑:**
- 添加新数据源后(GenomeScope、核型等)
- 重大数据转换前
- 完成分析章节时
- 提交/出版前

**全局安装程序可用:**
```bash

Debugging Data Availability

Install backup system in any repository

Before creating correlation plots, verify data overlap:
python
undefined
install-backup-system -f your_data_file.csv

**关键特性:**
- 无确认则永不覆盖
- 恢复前创建安全备份
- CHANGELOG中有完整审计追踪
- 彩色终端输出
- 支持CSV和TSV文件

**数据分析优势:**
- **数据来源** - CHANGELOG记录每次修改
- **实验信心** - 轻松回滚鼓励尝试不同方法
- **专业工作流** - 符合出版标准
- **协作就绪** - 团队成员可了解数据历史

Check how many entities have both metrics

调试数据可用性

species_with_metric_a = set(inv.get('species_id') for inv in data if inv.get('metric_a')) species_with_metric_b = set(inv.get('species_id') for inv in data if inv.get('metric_b'))
overlap = species_with_metric_a.intersection(species_with_metric_b) print(f"Species with both metrics: {len(overlap)}")
if len(overlap) < 10: print("⚠️ Warning: Limited data for correlation analysis") print(f" Metric A: {len(species_with_metric_a)} species") print(f" Metric B: {len(species_with_metric_b)} species") print(f" Overlap: {len(overlap)} species")
undefined
创建相关性图表前,验证数据重叠:
python
undefined

Variable State Validation

Check how many entities have both metrics

When debugging notebook errors, add validation cells to check variable integrity:
python
undefined
species_with_metric_a = set(inv.get('species_id') for inv in data if inv.get('metric_a')) species_with_metric_b = set(inv.get('species_id') for inv in data if inv.get('metric_b'))
overlap = species_with_metric_a.intersection(species_with_metric_b) print(f"Species with both metrics: {len(overlap)}")
if len(overlap) < 10: print("⚠️ Warning: Limited data for correlation analysis") print(f" Metric A: {len(species_with_metric_a)} species") print(f" Metric B: {len(species_with_metric_b)} species") print(f" Overlap: {len(overlap)} species")
undefined

Validation cell - place before error-prone sections

变量状态验证

print('=== VARIABLE VALIDATION ===') print(f'Type of data: {type(data)}') print(f'Is data a list? {isinstance(data, list)}')
if isinstance(data, list): print(f'Length: {len(data)}') if len(data) > 0: print(f'First item type: {type(data[0])}') print(f'First item keys: {list(data[0].keys())[:10]}') elif isinstance(data, dict): print(f'⚠️ WARNING: data is a dict, not a list!') print(f'Dict keys: {list(data.keys())[:10]}') print(f'This suggests variable shadowing occurred.')

**When to use**:
- After "Restart & Run All" produces errors
- When error messages suggest wrong variable type
- Before cells that fail intermittently
- In notebooks with 50+ cells

**Best practice**: Include automatic validation in cells that depend on critical global variables.
调试Notebook错误时,添加验证单元格以检查变量完整性:
python
undefined

Programmatic Notebook Manipulation

Validation cell - place before error-prone sections

When inserting cells into large notebooks:
python
import json
print('=== VARIABLE VALIDATION ===') print(f'Type of data: {type(data)}') print(f'Is data a list? {isinstance(data, list)}')
if isinstance(data, list): print(f'Length: {len(data)}') if len(data) > 0: print(f'First item type: {type(data[0])}') print(f'First item keys: {list(data[0].keys())[:10]}') elif isinstance(data, dict): print(f'⚠️ WARNING: data is a dict, not a list!') print(f'Dict keys: {list(data.keys())[:10]}') print(f'This suggests variable shadowing occurred.')

**何时使用**:
- “重启并全部运行”产生错误后
- 错误信息表明变量类型错误时
- 在间歇性失败的单元格之前
- 在50+单元格的Notebook中

**最佳实践**: 在依赖关键全局变量的单元格中包含自动验证。

Read notebook

程序化Notebook操作

with open('notebook.ipynb', 'r') as f: notebook = json.load(f)
向大型Notebook中插入单元格时:
python
import json

Create new cell

Read notebook

new_cell = { "cell_type": "code", "execution_count": None, "metadata": {}, "outputs": [], "source": [line + '\n' for line in code.split('\n')] }
with open('notebook.ipynb', 'r') as f: notebook = json.load(f)

Insert at position

Create new cell

insert_position = 50 notebook['cells'] = (notebook['cells'][:insert_position] + [new_cell] + notebook['cells'][insert_position:])
new_cell = { "cell_type": "code", "execution_count": None, "metadata": {}, "outputs": [], "source": [line + '\n' for line in code.split('\n')] }

Write back

Insert at position

with open('notebook.ipynb', 'w') as f: json.dump(notebook, f, indent=1)
undefined
insert_position = 50 notebook['cells'] = (notebook['cells'][:insert_position] + [new_cell] + notebook['cells'][insert_position:])

Synchronizing Figure Code and Notebook Documentation

Write back

Pattern: Code changes to figure generation → Must update notebook text
Common Scenario: Updated figure filtering/outlier removal/statistical tests
Workflow:
  1. Update figure generation Python script
  2. Regenerate figures
  3. CRITICAL: Update Jupyter notebook markdown cells documenting the figure
  4. Use
    NotebookEdit
    tool (NOT
    Edit
    tool) for
    .ipynb
    files
Example:
python
undefined
with open('notebook.ipynb', 'w') as f: json.dump(notebook, f, indent=1)
undefined

After adding Mann-Whitney test to figure generation:

同步图表代码与Notebook文档

NotebookEdit( notebook_path="/path/to/notebook.ipynb", cell_id="cell-14", # Found via grep or Read cell_type="markdown", new_source="Updated description mentioning Mann-Whitney test..." )

**Finding Figure Cells**:
```bash
模式: 图表生成代码变更 → 必须更新Notebook文本
常见场景: 更新图表过滤/异常值移除/统计检验
工作流:
  1. 更新图表生成Python脚本
  2. 重新生成图表
  3. 关键: 更新Jupyter Notebook中记录图表的markdown单元格
  4. .ipynb
    文件使用
    NotebookEdit
    工具(而非
    Edit
    工具)
示例:
python
undefined

Locate figure references

After adding Mann-Whitney test to figure generation:

grep -n "figure_name.png" notebook.ipynb
NotebookEdit( notebook_path="/path/to/notebook.ipynb", cell_id="cell-14", # Found via grep or Read cell_type="markdown", new_source="Updated description mentioning Mann-Whitney test..." )

**查找图表单元格**:
```bash

Or use Glob + Grep

Locate figure references

grep -n "Figure 4" notebook.ipynb

**Why Critical**: Outdated documentation causes confusion. Notebook text saying "Limited data" when data is now complete, or not mentioning new statistical tests, misleads readers.
grep -n "figure_name.png" notebook.ipynb

Best Practices Summary

Or use Glob + Grep

  1. Always check data availability before creating analyses
  2. Document outlier removal clearly in titles and comments
  3. Use consistent naming for variables and figures
  4. Include statistical testing for all correlations
  5. Separate visualization from statistics when filtering outliers
  6. Create templates for repetitive analyses
  7. Use helper functions consistently across cells
  8. Organize with markdown headers for navigation
  9. Test with small datasets before running full analyses
  10. Save intermediate results for expensive computations
grep -n "Figure 4" notebook.ipynb

**为何关键**: 过时的文档会导致混淆。Notebook文本说“数据有限”但数据现已完整,或未提及新的统计检验,会误导读者。

Common Tasks

最佳实践总结

Removing Panels from Multi-Panel Figures

Scenario: Convert 2-panel figure to 1-panel after removing unavailable data.
Steps:
  1. Update subplot layout:
    python
    # Before: 2 panels
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # After: 1 panel
    fig, ax = plt.subplots(1, 1, figsize=(10, 6))
  2. Remove panel code: Delete all code for removed panel (ax2)
  3. Update figure filename:
    python
    # Before
    plt.savefig('06_scaffold_l50_l90_comparison.png')
    
    # After
    plt.savefig('06_scaffold_l50_comparison.png')
  4. Update notebook references:
    • Image display:
      display(Image(...'06_scaffold_l50_comparison.png'))
    • Title: Remove references to removed data
    • Description: Add note about why panel is excluded
  5. Clean up old files:
    bash
    rm figures/*_l50_l90_*.png
  1. 创建分析前始终检查数据可用性
  2. 在标题和注释中清晰记录异常值移除方法
  3. 对变量和图表使用一致命名
  4. 所有相关性分析都包含统计检验
  5. 过滤异常值时将可视化与统计计算分离
  6. 为重复分析创建模板
  7. 在单元格间一致使用辅助函数
  8. 使用markdown标题进行组织以方便导航
  9. 运行完整分析前先用小数据集测试
  10. 为耗时计算保存中间结果

常见任务

从多面板图表中移除面板

场景: 移除不可用数据后,将2面板图表转换为1面板。
步骤:
  1. 更新子图布局:
    python
    # Before: 2 panels
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # After: 1 panel
    fig, ax = plt.subplots(1, 1, figsize=(10, 6))
  2. 移除面板代码: 删除所有已移除面板(ax2)的代码
  3. 更新图表文件名:
    python
    # Before
    plt.savefig('06_scaffold_l50_l90_comparison.png')
    
    # After
    plt.savefig('06_scaffold_l50_comparison.png')
  4. 更新Notebook引用:
    • 图像显示:
      display(Image(...'06_scaffold_l50_comparison.png'))
    • 标题: 移除对已移除数据的引用
    • 描述: 添加关于为何排除面板的说明
  5. 清理旧文件:
    bash
    rm figures/*_l50_l90_*.png