jupyter-notebook-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseJupyter Notebook Analysis Patterns
Jupyter Notebook 分析模式
Expert knowledge for creating comprehensive, statistically rigorous Jupyter notebook analyses.
创建全面、具备统计严谨性的Jupyter Notebook分析的专业指南。
When to Use This Skill
何时使用此技能
- Creating multi-cell Jupyter notebooks for data analysis
- Adding correlation analyses with statistical testing
- Implementing outlier removal strategies
- Building series of related visualizations (10+ figures)
- Analyzing large datasets with multiple characteristics
- 用于数据分析的多单元格Jupyter Notebook创建
- 添加带有统计检验的相关性分析
- 实施异常值移除策略
- 构建一系列相关可视化图表(10张以上)
- 分析具有多种特征的大型数据集
Common Pitfalls
常见陷阱
Variable Shadowing in Loops
循环中的变量覆盖
Problem: Using common variable names like as loop variables overwrites global variables:
datapython
undefined问题:使用这类常见变量名作为循环变量会覆盖全局变量:
datapython
undefinedBAD - Shadows global 'data' variable
BAD - Shadows global 'data' variable
for i, (sp, data) in enumerate(species_by_gc_content[:10], 1):
val = data['gc_content']
print(f'{sp}: {val}')
After this loop, `data` is no longer your dataset list - it's the last species dict!
**Solution**: Use descriptive loop variable names:
```pythonfor i, (sp, data) in enumerate(species_by_gc_content[:10], 1):
val = data['gc_content']
print(f'{sp}: {val}')
循环结束后,`data`不再是你的数据集列表,而是最后一个物种的字典!
**解决方案**:使用具有描述性的循环变量名:
```pythonGOOD - Uses specific name
GOOD - Uses specific name
for i, (sp, sp_data) in enumerate(species_by_gc_content[:10], 1):
val = sp_data['gc_content']
print(f'{sp}: {val}')
**Detection**: If you see errors like "Type: <class 'dict'>" when expecting a list, check for variable shadowing in recent cells.
**Prevention**:
- Never use generic names (`data`, `item`, `value`) as loop variables
- Use prefixed names (`sp_data`, `row_data`, `inv_data`)
- Add validation cells that check variable types
- Run "Restart & Run All" regularly to catch issues early
**Common shadowing patterns to avoid**:
```python
for data in dataset: # Shadows 'data'
for i, data in enumerate(): # Shadows 'data'
for key, data in dict.items() # Shadows 'data'for i, (sp, sp_data) in enumerate(species_by_gc_content[:10], 1):
val = sp_data['gc_content']
print(f'{sp}: {val}')
**检测方法**:当你期望得到列表却出现类似"Type: <class 'dict'>"的错误时,检查最近单元格中的变量覆盖情况。
**预防措施**:
- 切勿使用通用名称(如`data`、`item`、`value`)作为循环变量
- 使用带前缀的名称(如`sp_data`、`row_data`、`inv_data`)
- 添加检查变量类型的验证单元格
- 定期运行“重启并全部运行”以尽早发现问题
**需要避免的常见变量覆盖模式**:
```python
for data in dataset: # Shadows 'data'
for i, data in enumerate(): # Shadows 'data'
for key, data in dict.items() # Shadows 'data'Verify Column Names Before Processing
处理前验证列名
Problem: Assuming column names without checking actual DataFrame structure leads to immediate failures. Column names may use different capitalization, spacing, or naming conventions than expected.
Example error:
python
undefined问题:未检查DataFrame实际结构就假设列名会直接导致失败。列名可能与预期的大小写、空格或命名规则不同。
示例错误:
python
undefinedAssumed column name
Assumed column name
df_filtered = df[df['scientific_name'] == target] # KeyError!
df_filtered = df[df['scientific_name'] == target] # KeyError!
Actual column name was 'Scientific Name' (capitalized with space)
Actual column name was 'Scientific Name' (capitalized with space)
**Solution**: Always check actual columns first:
```python
import pandas as pd
df = pd.read_csv('data.csv')
**解决方案**:始终先检查实际列名:
```python
import pandas as pd
df = pd.read_csv('data.csv')ALWAYS print columns before processing
ALWAYS print columns before processing
print("Available columns:")
print(df.columns.tolist())
print("Available columns:")
print(df.columns.tolist())
Then write filtering code with correct names
Then write filtering code with correct names
df_filtered = df[df['Scientific Name'] == target_species] # Correct
**Best practice for data processing scripts:**
```pythondf_filtered = df[df['Scientific Name'] == target_species] # Correct
**数据处理脚本最佳实践**:
```pythonAt the start of your script
At the start of your script
def verify_required_columns(df, required_cols):
"""Verify DataFrame has required columns."""
missing = [col for col in required_cols if col not in df.columns]
if missing:
print(f"ERROR: Missing columns: {missing}")
print(f"Available columns: {df.columns.tolist()}")
sys.exit(1)
def verify_required_columns(df, required_cols):
"""Verify DataFrame has required columns."""
missing = [col for col in required_cols if col not in df.columns]
if missing:
print(f"ERROR: Missing columns: {missing}")
print(f"Available columns: {df.columns.tolist()}")
sys.exit(1)
Use it
Use it
required = ['Scientific Name', 'tolid', 'accession']
verify_required_columns(df, required)
**Common column name variations to watch for:**
- `scientific_name` vs `Scientific Name` vs `ScientificName`
- `species_id` vs `species` vs `Species ID`
- `genome_size` vs `Genome size` vs `GenomeSize`
**Debugging tip**: Include column listing in all data processing scripts:
```pythonrequired = ['Scientific Name', 'tolid', 'accession']
verify_required_columns(df, required)
**需要注意的常见列名变体**:
- `scientific_name` vs `Scientific Name` vs `ScientificName`
- `species_id` vs `species` vs `Species ID`
- `genome_size` vs `Genome size` vs `GenomeSize`
**调试技巧**:在所有数据处理脚本中添加列名列出代码:
```pythonAdd at script start for easy debugging
Add at script start for easy debugging
if '--debug' in sys.argv or len(df.columns) < 10:
print(f"Columns ({len(df.columns)}): {df.columns.tolist()}")
undefinedif '--debug' in sys.argv or len(df.columns) < 10:
print(f"Columns ({len(df.columns)}): {df.columns.tolist()}")
undefinedOutlier Handling Best Practices
异常值处理最佳实践
Two-Stage Outlier Removal
两阶段异常值移除
For analyses correlating characteristics across aggregated entities (e.g., species-level summaries):
-
Stage 1: Count-based outliers (IQR method)
- Remove entities with abnormally high sample counts
- Prevents over-represented entities from skewing correlations
- Apply BEFORE other analyses
pythonimport numpy as np workflow_counts = [entity_data[id]['workflow_count'] for id in entity_data.keys()] q1 = np.percentile(workflow_counts, 25) q3 = np.percentile(workflow_counts, 75) iqr = q3 - q1 upper_bound = q3 + 1.5 * iqr outliers = [id for id in entity_data.keys() if entity_data[id]['workflow_count'] > upper_bound] for id in outliers: del entity_data[id] -
Stage 2: Value-based outliers (percentile)
- Remove extreme values for visualization clarity
- Apply ONLY to visualization data, not statistics
- Typically top 5% for highly skewed distributions
pythonvalues = [entity_data[id]['metric'] for id in entity_data.keys()] threshold = np.percentile(values, 95) viz_entities = [id for id in entity_data.keys() if entity_data[id]['metric'] <= threshold] # Use viz_entities for plotting # Use full entity_data.keys() for statistics
对于跨聚合实体(如物种级汇总)的特征相关性分析:
-
第一阶段:基于计数的异常值(IQR方法)
- 移除样本计数异常高的实体
- 防止过度代表的实体扭曲相关性
- 在其他分析之前应用
pythonimport numpy as np workflow_counts = [entity_data[id]['workflow_count'] for id in entity_data.keys()] q1 = np.percentile(workflow_counts, 25) q3 = np.percentile(workflow_counts, 75) iqr = q3 - q1 upper_bound = q3 + 1.5 * iqr outliers = [id for id in entity_data.keys() if entity_data[id]['workflow_count'] > upper_bound] for id in outliers: del entity_data[id] -
第二阶段:基于数值的异常值(百分位数法)
- 移除极端值以提升可视化清晰度
- 仅应用于可视化数据,不应用于统计计算
- 对于高度偏态分布,通常移除前5%
pythonvalues = [entity_data[id]['metric'] for id in entity_data.keys()] threshold = np.percentile(values, 95) viz_entities = [id for id in entity_data.keys() if entity_data[id]['metric'] <= threshold] # Use viz_entities for plotting # Use full entity_data.keys() for statistics
Characteristic-Specific Outlier Removal
针对特定特征的异常值移除
When analyzing genome characteristics vs metrics, remove outliers for the characteristic being analyzed:
python
undefined在分析基因组特征与指标的关系时,移除所分析特征的异常值:
python
undefinedAfter removing workflow count outliers, also remove heterozygosity outliers
After removing workflow count outliers, also remove heterozygosity outliers
heterozygosity_values = [species_data[sp]['heterozygosity'] for sp in species_data.keys()]
het_q1 = np.percentile(heterozygosity_values, 25)
het_q3 = np.percentile(heterozygosity_values, 75)
het_iqr = het_q3 - het_q1
het_upper_bound = het_q3 + 1.5 * het_iqr
het_outliers = [sp for sp in species_data.keys()
if species_data[sp]['heterozygosity'] > het_upper_bound]
for sp in het_outliers:
del species_data[sp]
print(f'Removed {len(het_outliers)} heterozygosity outliers (>{het_upper_bound:.2f}%)')
print(f'New heterozygosity range: {min(vals):.2f}% - {max(vals):.2f}%')
**Apply separately for each characteristic**:
- Genome size outliers for genome size analysis
- Heterozygosity outliers for heterozygosity analysis
- Repeat content outliers for repeat content analysisheterozygosity_values = [species_data[sp]['heterozygosity'] for sp in species_data.keys()]
het_q1 = np.percentile(heterozygosity_values, 25)
het_q3 = np.percentile(heterozygosity_values, 75)
het_iqr = het_q3 - het_q1
het_upper_bound = het_q3 + 1.5 * het_iqr
het_outliers = [sp for sp in species_data.keys()
if species_data[sp]['heterozygosity'] > het_upper_bound]
for sp in het_outliers:
del species_data[sp]
print(f'Removed {len(het_outliers)} heterozygosity outliers (>{het_upper_bound:.2f}%)')
print(f'New heterozygosity range: {min(vals):.2f}% - {max(vals):.2f}%')
**为每个特征单独应用**:
- 基因组大小分析时移除基因组大小异常值
- 杂合度分析时移除杂合度异常值
- 重复序列含量分析时移除重复序列含量异常值When to Skip Outlier Removal
何时跳过异常值移除
- Memory usage plots when investigating over-allocation patterns
- Comparison plots (allocated vs used) where outliers reveal problems
- User explicitly requests to see all data
- Data is already limited (< 10 points)
Document clearly in plot titles and code comments which outlier removal is applied.
###IQR-Based Outlier Removal for Visualization
Standard Method: 1.5×IQR (Interquartile Range)
Implementation:
python
undefined- 调查内存过度分配模式时的内存使用情况图
- 异常值能揭示问题的对比图(已分配vs已使用)
- 用户明确要求查看所有数据
- 数据量有限(少于10个数据点)
在图表标题和代码注释中清晰记录所应用的异常值移除方法。
Calculate IQR
基于IQR的可视化异常值移除
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
标准方法: 1.5×IQR(四分位距)
实现代码:
python
undefinedDefine outlier boundaries (standard: 1.5×IQR)
Calculate IQR
lower_bound = Q1 - 1.5IQR
upper_bound = Q3 + 1.5IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
Filter outliers
Define outlier boundaries (standard: 1.5×IQR)
outlier_mask = (data >= lower_bound) & (data <= upper_bound)
data_filtered = data[outlier_mask]
n_outliers = (~outlier_mask).sum()
lower_bound = Q1 - 1.5IQR
upper_bound = Q3 + 1.5IQR
IMPORTANT: Report outliers removed
Filter outliers
print(f"Removed {n_outliers} outliers for visualization")
outlier_mask = (data >= lower_bound) & (data <= upper_bound)
data_filtered = data[outlier_mask]
n_outliers = (~outlier_mask).sum()
Add to figure: f"({n_outliers} outliers removed)"
IMPORTANT: Report outliers removed
**Multi-dimensional Outlier Removal**:
```pythonprint(f"Removed {n_outliers} outliers for visualization")
For scatter plots with two dimensions (e.g., size ratio AND absolute size)
Add to figure: f"({n_outliers} outliers removed)"
outlier_mask = (
(ratio >= Q1_ratio - 1.5IQR_ratio) &
(ratio <= Q3_ratio + 1.5IQR_ratio) &
(size >= Q1_size - 1.5IQR_size) &
(size <= Q3_size + 1.5IQR_size)
)
**Best Practice**: Always report number of outliers removed in figure statistics or caption.
**When to Use**: For visualization clarity when extreme values compress the main distribution. Not for removing "bad" data - use for display only.
**多维异常值移除**:
```pythonStatistical Rigor
For scatter plots with two dimensions (e.g., size ratio AND absolute size)
Required for Correlation Analyses
—
-
Pearson correlation with p-values:python
from scipy import stats correlation, p_value = stats.pearsonr(x_values, y_values) sig_text = 'significant' if p_value < 0.05 else 'not significant' -
Report both metrics:
- Correlation coefficient (r) - strength and direction
- P-value - statistical significance (α=0.05)
- Sample size (n)
-
Display on plots:python
ax.text(0.98, 0.02, f'r = {correlation:.3f}\np = {p_value:.2e}\n({sig_text})\nn = {len(data)} species', transform=ax.transAxes, ...)
outlier_mask = (
(ratio >= Q1_ratio - 1.5IQR_ratio) &
(ratio <= Q3_ratio + 1.5IQR_ratio) &
(size >= Q1_size - 1.5IQR_size) &
(size <= Q3_size + 1.5IQR_size)
)
**最佳实践**: 始终在图表统计信息或说明中报告移除的异常值数量。
**适用场景**: 当极端值压缩了主要数据分布时,用于提升可视化清晰度。不用于移除“坏数据”,仅用于展示目的。Adding Mann-Whitney U Tests to Figures
统计严谨性
—
相关性分析的必备要求
When to Use: Comparing continuous metrics between two groups (e.g., Dual vs Pri/alt curation)
Standard Implementation:
python
from scipy import stats-
带p值的Pearson相关性:python
from scipy import stats correlation, p_value = stats.pearsonr(x_values, y_values) sig_text = 'significant' if p_value < 0.05 else 'not significant' -
同时报告两个指标:
- 相关系数(r) - 强度和方向
- p值 - 统计显著性(α=0.05)
- 样本量(n)
-
在图表上显示:python
ax.text(0.98, 0.02, f'r = {correlation:.3f}\np = {p_value:.2e}\n({sig_text})\nn = {len(data)} species', transform=ax.transAxes, ...)
Calculate test
在图表中添加Mann-Whitney U检验
data_group1 = df[df['group'] == 'Group1']['metric']
data_group2 = df[df['group'] == 'Group2']['metric']
if len(data_group1) > 0 and len(data_group2) > 0:
stat, pval = stats.mannwhitneyu(data_group1, data_group2, alternative='two-sided')
else:
pval = np.nan
适用场景: 比较两组连续指标(如双重注释vs初级/替代注释)
标准实现:
python
from scipy import statsAdd to stats text
Calculate test
if not np.isnan(pval):
stats_text += f"\nMann-Whitney p: {pval:.2e}"
**Display in Figures**: Include p-value in statistics box with format `Mann-Whitney p: 1.23e-04`
**Consistency**: Ensure all quantitative comparison figures include this test for statistical rigor.data_group1 = df[df['group'] == 'Group1']['metric']
data_group2 = df[df['group'] == 'Group2']['metric']
if len(data_group1) > 0 and len(data_group2) > 0:
stat, pval = stats.mannwhitneyu(data_group1, data_group2, alternative='two-sided')
else:
pval = np.nan
Large-Scale Analysis Structure
Add to stats text
Control Analyses: Checking for Confounding
—
When comparing methods (e.g., Method A vs Method B), always check if observed differences could be explained by characteristics of the samples rather than the methods themselves.
Critical control analysis:
python
import pandas as pd
from scipy import stats
def check_confounding(df, method_col, characteristics):
"""
Compare sample characteristics between methods to check for confounding.
Args:
df: DataFrame with samples
method_col: Column indicating method ('Method_A', 'Method_B')
characteristics: List of column names to compare
Returns:
DataFrame with statistical comparison
"""
results = []
for char in characteristics:
# Get data for each method
method_a = df[df[method_col] == 'Method_A'][char].dropna()
method_b = df[df[method_col] == 'Method_B'][char].dropna()
if len(method_a) < 5 or len(method_b) < 5:
continue
# Statistical test
stat, pval = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
# Calculate effect size (% difference in medians)
pooled_median = pd.concat([method_a, method_b]).median()
effect_pct = (method_a.median() - method_b.median()) / pooled_median * 100
results.append({
'Characteristic': char,
'Method_A_median': method_a.median(),
'Method_A_n': len(method_a),
'Method_B_median': method_b.median(),
'Method_B_n': len(method_b),
'p_value': pval,
'effect_pct': effect_pct,
'significant': pval < 0.05
})
return pd.DataFrame(results)if not np.isnan(pval):
stats_text += f"\nMann-Whitney p: {pval:.2e}"
**在图表中显示**: 将p值包含在统计信息框中,格式为`Mann-Whitney p: 1.23e-04`
**一致性**: 确保所有定量对比图表都包含此检验以保证统计严谨性。Example usage
大规模分析结构
—
对照分析:检查混淆因素
characteristics = ['genome_size', 'gc_content', 'heterozygosity',
'repeat_content', 'sequencing_coverage']
confounding_check = check_confounding(df, 'curation_method', characteristics)
print(confounding_check)
**Interpretation guide**:
- **No significant differences**: Methods compared equivalent samples → valid comparison
- **Method A has "easier" samples** (smaller genomes, lower complexity): Quality differences may be due to sample properties, not method
- **Method A has "harder" samples** (larger genomes, higher complexity): Strengthens conclusion that Method A is better despite challenges
- **Limited data** (n<10): Cannot rule out confounding, note as limitation
**Present in notebook**:
```markdown当比较方法(如方法A vs方法B)时,始终检查观察到的差异是否可由样本特征而非方法本身解释。
关键对照分析:
python
import pandas as pd
from scipy import stats
def check_confounding(df, method_col, characteristics):
"""
Compare sample characteristics between methods to check for confounding.
Args:
df: DataFrame with samples
method_col: Column indicating method ('Method_A', 'Method_B')
characteristics: List of column names to compare
Returns:
DataFrame with statistical comparison
"""
results = []
for char in characteristics:
# Get data for each method
method_a = df[df[method_col] == 'Method_A'][char].dropna()
method_b = df[df[method_col] == 'Method_B'][char].dropna()
if len(method_a) < 5 or len(method_b) < 5:
continue
# Statistical test
stat, pval = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
# Calculate effect size (% difference in medians)
pooled_median = pd.concat([method_a, method_b]).median()
effect_pct = (method_a.median() - method_b.median()) / pooled_median * 100
results.append({
'Characteristic': char,
'Method_A_median': method_a.median(),
'Method_A_n': len(method_a),
'Method_B_median': method_b.median(),
'Method_B_n': len(method_b),
'p_value': pval,
'effect_pct': effect_pct,
'significant': pval < 0.05
})
return pd.DataFrame(results)Genome Characteristics Comparison
Example usage
Control Analysis: Are quality differences due to method or sample properties?
[Table comparing characteristics]
Conclusion:
- If no differences → Valid method comparison
- If Method A works with harder samples → Strengthens conclusions
- If Method A works with easier samples → Potential confounding
**Why critical**: Reviewers will ask this question. Preemptive control analysis demonstrates scientific rigor and prevents major revisions.characteristics = ['genome_size', 'gc_content', 'heterozygosity',
'repeat_content', 'sequencing_coverage']
confounding_check = check_confounding(df, 'curation_method', characteristics)
print(confounding_check)
**解读指南**:
- **无显著差异**: 比较的方法使用了等效样本 → 对比有效
- **方法A使用了“更简单”的样本**(更小的基因组、更低的复杂度): 质量差异可能源于样本属性,而非方法
- **方法A使用了“更难”的样本**(更大的基因组、更高的复杂度): 更能说明方法A更好,尽管面临挑战
- **数据有限**(n<10): 无法排除混淆因素,需注明为局限性
**在Notebook中呈现**:
```markdownOrganizing 60+ Cell Notebooks
基因组特征对比
-
Section headers (markdown cells):
- Main sections: "## CPU Runtime Analysis", "## Memory Analysis"
- Subsections: "### Genome Size vs CPU Runtime"
-
Cell pairing pattern:
- Markdown header + code cell for each analysis
- Keeps related content together
- Easier to navigate and debug
-
Consistent naming:
- Figure files:
fig18_genome_size_vs_cpu_hours.png - Variables: ,
species_data,genome_sizes_fullgenome_sizes_viz - Functions: defined consistently
safe_float_convert()
- Figure files:
-
Progressive enhancement:
- Start with basic analyses
- Add enriched data (Cell 7 pattern)
- Build increasingly complex correlations
- End with multivariate analyses (PCA)
对照分析: 质量差异是源于方法还是样本属性?
[特征对比表格]
结论:
- 若无差异 → 方法对比有效
- 若方法A在更难样本上表现良好 → 结论更具说服力
- 若方法A在更简单样本上表现良好 → 存在潜在混淆因素
**为何关键**: 审稿人会提出这个问题。提前进行对照分析展示了科学严谨性,避免重大修改。Template Generation Pattern
管理60+单元格的Notebook
For creating multiple similar analysis cells:
python
undefined-
章节标题(markdown单元格):
- 主章节: "## CPU运行时间分析", "## 内存分析"
- 子章节: "### 基因组大小vs CPU运行时间"
-
单元格配对模式:
- 每个分析对应一个markdown标题 + 一个代码单元格
- 保持相关内容在一起
- 更易于导航和调试
-
一致命名:
- 图文件:
fig18_genome_size_vs_cpu_hours.png - 变量: ,
species_data,genome_sizes_fullgenome_sizes_viz - 函数: 定义保持一致
safe_float_convert()
- 图文件:
-
渐进式增强:
- 从基础分析开始
- 添加丰富数据(单元格7模式)
- 构建日益复杂的相关性分析
- 最终进行多变量分析(PCA)
Create template with placeholder variables
模板生成模式
template = '''
if len(data_with_species) > 0:
print('Analyzing {display} vs {metric}...\n')
# Aggregate data per species
species_data = {{}}
for inv in data_with_species:
{name} = safe_float_convert(inv.get('{name}'))
if {name} is None:
continue
# ... analysis code'''
用于创建多个类似的分析单元格:
python
undefinedGenerate multiple cells from characteristics list
Create template with placeholder variables
characteristics = [
{'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'},
{'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'},
# ...
]
for char in characteristics:
code = template.format(**char)
# Write to notebook or temp file
undefinedtemplate = '''
if len(data_with_species) > 0:
print('Analyzing {display} vs {metric}...\n')
# Aggregate data per species
species_data = {{}}
for inv in data_with_species:
{name} = safe_float_convert(inv.get('{name}'))
if {name} is None:
continue
# ... analysis code'''
Helper Function Pattern
Generate multiple cells from characteristics list
Define once, reuse throughout:
python
def safe_float_convert(value):
"""Convert string to float, handling comma separators"""
if not value or not str(value).strip():
return None
try:
return float(str(value).replace(',', ''))
except (ValueError, TypeError):
return NoneInclude in Cell 7 (enrichment) and reference: "# Helper function (same as Cell 7)"
characteristics = [
{'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'},
{'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'},
# ...
]
for char in characteristics:
code = template.format(**char)
# Write to notebook or temp file
undefinedPublication-Quality Figures
辅助函数模式
Standard settings:
- DPI: 300
- Figure size: (12, 8) for single plots, (16, 7) for side-by-side
- Grid:
alpha=0.3, linestyle='--' - Point size: Proportional to sample count ()
s=[50 + count*20 for count in counts] - Colormap: 'viridis' for workflow counts
定义一次,全程复用:
python
def safe_float_convert(value):
"""Convert string to float, handling comma separators"""
if not value or not str(value).strip():
return None
try:
return float(str(value).replace(',', ''))
except (ValueError, TypeError):
return None在单元格7(数据增强)中定义,并在其他单元格引用: "# 辅助函数(与单元格7相同)"
Publication-Ready Font Sizes
出版级图表
Problem: Default matplotlib fonts are designed for screen viewing, not print publication.
Solution: Use larger, bold fonts for print readability.
Recommended sizes (for standard 10-12 cm wide figures):
| Element | Default | Publication | Code |
|---|---|---|---|
| Title | 11-12pt | 18pt (bold) | |
| Axis labels | 10-11pt | 16pt (bold) | |
| Tick labels | 9-10pt | 14pt | |
| Legend | 8-10pt | 12pt | |
| Annotations | 8-10pt | 11-13pt | |
| Data points | 20-36 | 60-100 | |
Implementation example:
python
fig, ax = plt.subplots(figsize=(10, 8))标准设置:
- DPI: 300
- 图表尺寸: 单图(12, 8), 并排图(16, 7)
- 网格:
alpha=0.3, linestyle='--' - 点大小: 与样本数量成正比()
s=[50 + count*20 for count in counts] - 颜色映射: 工作流计数使用'viridis'
Plot data
出版级字体大小
ax.scatter(x, y, s=80, alpha=0.6) # Larger points
问题: 默认matplotlib字体是为屏幕查看设计的,而非印刷出版。
解决方案: 使用更大、加粗的字体以提升印刷可读性。
推荐尺寸(适用于标准10-12厘米宽的图表):
| 元素 | 默认值 | 出版级设置 | 代码实现 |
|---|---|---|---|
| 标题 | 11-12pt | 18pt(加粗) | |
| 坐标轴标签 | 10-11pt | 16pt(加粗) | |
| 刻度标签 | 9-10pt | 14pt | |
| 图例 | 8-10pt | 12pt | |
| 注释 | 8-10pt | 11-13pt | |
| 数据点 | 20-36 | 60-100 | |
实现示例:
python
fig, ax = plt.subplots(figsize=(10, 8))Titles and labels - BOLD
Plot data
ax.set_title('Your Title Here', fontsize=18, fontweight='bold')
ax.set_xlabel('X Axis Label', fontsize=16, fontweight='bold')
ax.set_ylabel('Y Axis Label', fontsize=16, fontweight='bold')
ax.scatter(x, y, s=80, alpha=0.6) # Larger points
Tick labels
Titles and labels - BOLD
ax.tick_params(axis='both', which='major', labelsize=14)
ax.set_title('Your Title Here', fontsize=18, fontweight='bold')
ax.set_xlabel('X Axis Label', fontsize=16, fontweight='bold')
ax.set_ylabel('Y Axis Label', fontsize=16, fontweight='bold')
Legend
Tick labels
ax.legend(fontsize=12, loc='best')
ax.tick_params(axis='both', which='major', labelsize=14)
Stats box
Legend
stats_text = "Statistics:\nMean: 42.5"
ax.text(0.02, 0.98, stats_text, transform=ax.transAxes,
fontsize=13, family='monospace',
bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))
ax.legend(fontsize=12, loc='best')
Reference lines - thicker
Stats box
ax.axhline(y=1.0, linewidth=2.5, linestyle='--', alpha=0.6)
**Quick check**: If you have to squint to read the figure on screen at 100% zoom, fonts are too small for print.
**Special cases**:
- Multi-panel figures: Increase 10-15% more
- Posters: Increase 50-100% more
- Presentations: Increase 30-50% morestats_text = "Statistics:\nMean: 42.5"
ax.text(0.02, 0.98, stats_text, transform=ax.transAxes,
fontsize=13, family='monospace',
bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))
Accessibility: Colorblind-Safe Palettes
Reference lines - thicker
Problem: Standard color schemes (green vs blue, red vs green) are difficult or impossible to distinguish for people with color vision deficiencies, affecting ~8% of males and ~0.5% of females.
Solution: Use colorblind-safe palettes from validated sources.
IBM Color Blind Safe Palette (Recommended):
python
undefinedax.axhline(y=1.0, linewidth=2.5, linestyle='--', alpha=0.6)
**快速检查**: 如果在100%缩放的屏幕上需要眯眼才能看清图表,说明字体对于印刷来说太小。
**特殊情况**:
- 多面板图表: 再增加10-15%
- 海报: 再增加50-100%
- 演示文稿: 再增加30-50%For comparing two groups/conditions
可访问性:色弱友好配色方案
colors = {
'Group_A': '#0173B2', # Blue
'Group_B': '#DE8F05' # Orange
}
**Why this works**:
- ✅ Maximum contrast for all color vision types (deuteranopia, protanopia, tritanopia, achromatopsia)
- ✅ Professional appearance for scientific publications
- ✅ Clear distinction even in grayscale printing
- ✅ Cultural neutrality (no red/green traffic light associations)
**Other colorblind-safe combinations**:
- Blue + Orange (best overall)
- Blue + Red (good for most types)
- Blue + Yellow (good but lower contrast)
**Avoid**:
- ❌ Green + Red (most common color blindness)
- ❌ Green + Blue (confusing for many)
- ❌ Blue + Purple (too similar)
**Implementation in matplotlib**:
```python
import matplotlib.pyplot as plt问题: 标准配色方案(绿vs蓝、红vs绿)对于色觉缺陷人群来说难以区分或无法区分,这类人群约占男性的8%,女性的0.5%。
解决方案: 使用经过验证的色弱友好配色方案。
IBM色弱友好配色方案(推荐):
python
undefinedDefine colorblind-safe palette
For comparing two groups/conditions
CB_COLORS = {
'blue': '#0173B2',
'orange': '#DE8F05',
'green': '#029E73',
'red': '#D55E00',
'purple': '#CC78BC',
'brown': '#CA9161'
}
colors = {
'Group_A': '#0173B2', # Blue
'Group_B': '#DE8F05' # Orange
}
**为何有效**:
- ✅ 对所有色觉类型(绿色盲、红色盲、蓝色盲、全色盲)都有最大对比度
- ✅ 适合科学出版物的专业外观
- ✅ 即使灰度打印也能清晰区分
- ✅ 文化中立(无红绿灯的红/绿关联)
**其他色弱友好组合**:
- 蓝+橙(整体最佳)
- 蓝+红(适用于大多数类型)
- 蓝+黄(效果不错但对比度较低)
**避免**:
- ❌ 绿+红(最常见的色盲类型)
- ❌ 绿+蓝(对很多人来说易混淆)
- ❌ 蓝+紫(过于相似)
**在matplotlib中实现**:
```python
import matplotlib.pyplot as pltUse in plots
Define colorblind-safe palette
plt.scatter(x, y, color=CB_COLORS['blue'], label='Treatment')
plt.scatter(x2, y2, color=CB_COLORS['orange'], label='Control')
**Testing your colors**:
- Use online simulators: https://www.color-blindness.com/coblis-color-blindness-simulator/
- Check in grayscale: Convert figure to grayscale to ensure distinguishabilityCB_COLORS = {
'blue': '#0173B2',
'orange': '#DE8F05',
'green': '#029E73',
'red': '#D55E00',
'purple': '#CC78BC',
'brown': '#CA9161'
}
Handling Severe Data Imbalance in Comparisons
Use in plots
Problem: Comparing groups with very different sample sizes (e.g., 84 vs 10) can lead to misleading conclusions.
Solution: Add prominent warnings both visually and in documentation.
Visual warning on figure:
python
import matplotlib.pyplot as pltplt.scatter(x, y, color=CB_COLORS['blue'], label='Treatment')
plt.scatter(x2, y2, color=CB_COLORS['orange'], label='Control')
**测试配色**:
- 使用在线模拟器: https://www.color-blindness.com/coblis-color-blindness-simulator/
- 检查灰度效果: 将图表转换为灰度以确保可区分性After creating your plot
处理比较中的严重数据不平衡
n_group_a = len(df[df['group'] == 'A'])
n_group_b = len(df[df['group'] == 'B'])
total_a = 200
total_b = 350
warning_text = f"⚠️ DATA LIMITATION\n"
warning_text += f"Data availability:\n"
warning_text += f" Group A: {n_group_a}/{total_a} ({n_group_a/total_a100:.1f}%)\n"
warning_text += f" Group B: {n_group_b}/{total_b} ({n_group_b/total_b100:.1f}%)\n"
warning_text += f"Severe imbalance limits\nstatistical comparability"
ax.text(0.98, 0.02, warning_text, transform=ax.transAxes,
fontsize=11, verticalalignment='bottom', horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='red', alpha=0.2,
edgecolor='red', linewidth=2),
family='monospace', color='darkred', fontweight='bold')
问题: 比较样本量差异极大的组(如84 vs 10)会导致结论误导。
解决方案: 在视觉和文档中添加显著警告。
图表上的视觉警告:
python
import matplotlib.pyplot as pltUpdate title to indicate limitation
After creating your plot
ax.set_title('Your Title\n(SUPPLEMENTARY - Limited Data Availability)',
fontsize=14, fontweight='bold')
**Text warning in notebook/paper**:
```markdown
**⚠️ CRITICAL DATA LIMITATION**: This figure suffers from severe data availability bias:
- Group A: 84/200 (42%)
- Group B: 10/350 (3%)
This **8-fold imbalance** severely limits statistical comparability. The 10 Group B
samples are unlikely to be representative of all 350.
**Interpretation**: Comparisons should be interpreted with extreme caution. This
figure is provided for completeness but should be considered **supplementary**.Guidelines for sample size imbalance:
- < 2× imbalance: Generally acceptable, note in caption
- 2-5× imbalance: Add note about limitations
- > 5× imbalance: Add prominent warnings (visual + text)
- > 10× imbalance: Consider excluding figure or supplementary-only
Alternative: If possible, subset the larger group to match sample size:
python
undefinedn_group_a = len(df[df['group'] == 'A'])
n_group_b = len(df[df['group'] == 'B'])
total_a = 200
total_b = 350
warning_text = f"⚠️ DATA LIMITATION\n"
warning_text += f"Data availability:\n"
warning_text += f" Group A: {n_group_a}/{total_a} ({n_group_a/total_a100:.1f}%)\n"
warning_text += f" Group B: {n_group_b}/{total_b} ({n_group_b/total_b100:.1f}%)\n"
warning_text += f"Severe imbalance limits\nstatistical comparability"
ax.text(0.98, 0.02, warning_text, transform=ax.transAxes,
fontsize=11, verticalalignment='bottom', horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='red', alpha=0.2,
edgecolor='red', linewidth=2),
family='monospace', color='darkred', fontweight='bold')
Random subset to balance groups
Update title to indicate limitation
if n_group_a > n_group_b * 2:
group_a_subset = df[df['group'] == 'A'].sample(n=n_group_b * 2, random_state=42)
# Use subset for balanced comparison
undefinedax.set_title('Your Title\n(SUPPLEMENTARY - Limited Data Availability)',
fontsize=14, fontweight='bold')
**Notebook/论文中的文字警告**:
```markdown
**⚠️ 关键数据限制**: 本图表存在严重的数据可用性偏差:
- A组: 84/200 (42%)
- B组: 10/350 (3%)
这种**8倍不平衡**严重限制了统计可比性。10个B组样本不太可能代表全部350个样本。
**解读**: 比较应极其谨慎。本图表仅为完整性提供,应视为**补充内容**。样本量不平衡指南:
- <2倍不平衡: 通常可接受,在说明中注明
- 2-5倍不平衡: 添加关于局限性的说明
- >5倍不平衡: 添加显著警告(视觉+文字)
- >10倍不平衡: 考虑排除图表或仅作为补充内容
替代方案: 若可能,对较小组进行抽样以匹配样本量:
python
undefinedCreating Analysis Notebooks for Scientific Publications
Random subset to balance groups
When creating Jupyter notebooks to accompany manuscript figures:
if n_group_a > n_group_b * 2:
group_a_subset = df[df['group'] == 'A'].sample(n=n_group_b * 2, random_state=42)
# Use subset for balanced comparison
undefinedStructure Pattern
为科学出版物创建分析Notebook
- Title and metadata - Date, dataset info, sample sizes
- Overview - Context from paper abstract/intro
- Figure-by-figure analysis:
- Code cell to display image
- Detailed figure legend (publication-ready)
- Comprehensive analysis paragraph explaining:
- What the metric measures
- Statistical results
- Mechanistic explanation
- Biological/technical implications
- Methods section - Complete reproducibility information
- Conclusions - Summary of findings
当创建随手稿图表附带的Jupyter Notebook时:
Table of Contents
结构模式
For analysis notebooks >10 cells, add a navigable table of contents at the top:
Benefits:
- Quick navigation to specific analyses
- Clear overview of notebook structure
- Professional presentation
- Easier for collaborators
Implementation (Markdown cell):
markdown
undefined- 标题和元数据 - 日期、数据集信息、样本量
- 概述 - 来自论文摘要/引言的背景
- 逐图分析:
- 显示图像的代码单元格
- 详细的图表图例(出版级)
- 全面的分析段落,解释:
- 指标测量的内容
- 统计结果
- 机制解释
- 生物学/技术意义
- 方法部分 - 完整的可复现性信息
- 结论 - 发现总结
Analysis Name
目录
Table of Contents
—
- Data Loading
- Data Quality Metrics
- Figure 1: Completeness
- Figure 2: Contiguity
- Figure 3: Scaffold Validation ...
- Methods
- References
**Section Headers** (Markdown cells):
```markdown对于超过10个单元格的分析Notebook,在顶部添加可导航的目录:
好处:
- 快速导航到特定分析
- 清晰展示Notebook结构
- 专业呈现
- 方便协作者使用
实现(Markdown单元格):
markdown
undefinedData Loading
分析名称
—
目录
[Your code/analysis]
Data Quality Metrics
数据加载
[Your code/analysis]
**Auto-generation**: For large notebooks, consider generating TOC programmatically:
```python
from IPython.display import Markdown
sections = ['Introduction', 'Data Loading', 'Analysis', ...]
toc = "## Table of Contents\n\n"
for i, section in enumerate(sections, 1):
anchor = section.lower().replace(' ', '-')
toc += f"{i}. [{section}](#{anchor})\n"
display(Markdown(toc))[你的代码/分析]
Methods Documentation
数据质量指标
Always include a Methods section documenting:
- Data sources with accession numbers
- Key algorithms and formulas
- Statistical approaches
- Software versions
- Special adjustments (e.g., sex chromosome correction)
- Literature citations
Example:
markdown
undefined[你的代码/分析]
**自动生成**: 对于大型Notebook,考虑以编程方式生成目录:
```python
from IPython.display import Markdown
sections = ['Introduction', 'Data Loading', 'Analysis', ...]
toc = "## Table of Contents\n\n"
for i, section in enumerate(sections, 1):
anchor = section.lower().replace(' ', '-')
toc += f"{i}. [{section}](#{anchor})\n"
display(Markdown(toc))Methods
方法文档
Karyotype Data
—
Karyotype data (diploid 2n and haploid n chromosome numbers) was manually curated from peer-reviewed literature for 97 species representing 17.8% of the VGP Phase 1 dataset (n = 545 assemblies).
始终包含一个方法部分,记录:
- 数据来源及登录号
- 关键算法和公式
- 统计方法
- 软件版本
- 特殊调整(如性染色体校正)
- 文献引用
示例:
markdown
undefinedSex Chromosome Adjustment
方法
—
核型数据
When both sex chromosomes are present in the main haplotype, the expected number of chromosome-level scaffolds is:
expected_scaffolds = n + 1
For example:
- Asian elephant: 2n=56, n=28, has X+Y → expected 29 scaffolds
- White-throated sparrow: 2n=82, n=41, has Z+W → expected 42 scaffolds
This adjustment accounts for the biological reality that X and Y (or Z and W) are distinct chromosomes.
undefined核型数据(二倍体2n和单倍体n染色体数)是从同行评审文献中手动整理的,涵盖97个物种,占VGP第一阶段数据集的17.8%(n=545个组装)。
Writing Style Matching
性染色体校正
To match manuscript style:
- Read draft paper PDF to extract tone and terminology
- Use same technical vocabulary
- Match paragraph structure (observation → mechanism → implication)
- Include specific details (tool names, file formats, software versions)
- Use first-person plural ("we") if paper does
- Maintain consistent bullet point/list formatting
当主单倍型中同时存在两条性染色体时,染色体级支架的预期数量为:
expected_scaffolds = n + 1
例如:
- 亚洲象: 2n=56, n=28, 有X+Y → 预期29个支架
- 白喉带鹀: 2n=82, n=41, 有Z+W → 预期42个支架
该校正考虑了X和Y(或Z和W)是不同染色体的生物学现实。
undefinedExample Code Pattern
写作风格匹配
python
undefined为匹配手稿风格:
- 阅读论文草稿PDF以提取语气和术语
- 使用相同的技术词汇
- 匹配段落结构(观察→机制→意义)
- 包含具体细节(工具名称、文件格式、软件版本)
- 若论文使用第一人称复数(“我们”),则沿用
- 保持一致的项目符号/列表格式
Display figure
示例代码模式
from IPython.display import Image, display
from pathlib import Path
FIG_DIR = Path('figures/analysis_name')
display(Image(filename=str(FIG_DIR / 'figure_01.png')))
undefinedpython
undefinedFigure Legend Format
Display figure
Figure N. [Short title]. [Complete description of panels and what's shown]. [Statistical tests used]. [Sample sizes]. [Scale information]. [Color coding].
from IPython.display import Image, display
from pathlib import Path
FIG_DIR = Path('figures/analysis_name')
display(Image(filename=str(FIG_DIR / 'figure_01.png')))
undefinedAnalysis Paragraph Structure
图表图例格式
- What it measures - Define the metric/comparison
- Statistical result - Quantitative findings with p-values
- Mechanistic explanation - Why this result occurs
- Implications - What this means for conclusions
图N. [简短标题]。 [面板完整描述及展示内容]。[使用的统计检验]。[样本量]。[比例信息]。[颜色编码]。
Methods Section Must Include
分析段落结构
- Dataset source and filtering criteria
- Metric definitions
- Outlier handling approach
- Statistical methods with justification
- Software versions and tools
- Reproducibility information
- Known limitations
This approach creates notebooks that serve both as analysis documentation and as supplementary material for publications.
- 测量内容 - 定义指标/比较
- 统计结果 - 带p值的定量发现
- 机制解释 - 结果产生的原因
- 意义 - 对结论的影响
Environment Setup
方法部分必须包含
For CLI-based workflows (Claude Code, SSH sessions):
bash
undefined- 数据集来源和过滤标准
- 指标定义
- 异常值处理方法
- 统计方法及理由
- 软件版本和工具
- 可复现性信息
- 已知局限性
这种方法创建的Notebook既作为分析文档,也作为出版物的补充材料。
Run in background with token authentication
环境设置
/path/to/conda/envs/ENV_NAME/bin/jupyter lab --no-browser --port=8888
**Parameters**:
- `--no-browser`: Don't auto-open browser (for remote sessions)
- `--port=8888`: Specify port (default, can change if occupied)
- Run in background: Use `run_in_background=true` in Bash tool
**Access URL format**:
**To stop later**:
- Find shell ID from BashOutput tool
- Use KillShell with that ID
**Installation if missing**:
```bash
/path/to/conda/envs/ENV_NAME/bin/pip install jupyterlab对于基于CLI的工作流(Claude Code、SSH会话):
bash
undefinedNotebook Size Management
Run in background with token authentication
For notebooks > 256 KB:
- Use to read specific cells:
jqcat notebook.ipynb | jq '.cells[10:20]' - Count cells:
cat notebook.ipynb | jq '.cells | length' - Check sections:
cat notebook.ipynb | jq '.cells[75:81] | .[].source[:2]'
/path/to/conda/envs/ENV_NAME/bin/jupyter lab --no-browser --port=8888
**参数**:
- `--no-browser`: 不自动打开浏览器(用于远程会话)
- `--port=8888`: 指定端口(默认,若被占用可更改)
- 后台运行: 在Bash工具中使用`run_in_background=true`
**访问URL格式**:
**后续停止**:
- 从BashOutput工具中查找Shell ID
- 使用KillShell和该ID
**若缺失则安装**:
```bash
/path/to/conda/envs/ENV_NAME/bin/pip install jupyterlabData Enrichment Pattern
Notebook大小管理
When linking external metadata with analysis data:
python
undefined对于大于256 KB的Notebook:
- 使用读取特定单元格:
jqcat notebook.ipynb | jq '.cells[10:20]' - 统计单元格数量:
cat notebook.ipynb | jq '.cells | length' - 检查章节:
cat notebook.ipynb | jq '.cells[75:81] | .[].source[:2]'
Cell 6: Load genome metadata
数据增强模式
import csv
genome_data = []
with open('genome_metadata.tsv') as f:
reader = csv.DictReader(f, delimiter='\t')
genome_data = list(reader)
genome_lookup = {}
for row in genome_data:
species_id = row['species_id']
if species_id not in genome_lookup:
genome_lookup[species_id] = []
genome_lookup[species_id].append(row)
当将外部元数据与分析数据关联时:
python
undefinedCell 7: Enrich workflow data with genome characteristics
Cell 6: Load genome metadata
for inv in data:
species_id = inv.get('species_id')
if species_id and species_id in genome_lookup:
genome_info = genome_lookup[species_id][0]
# Add genome characteristics
inv['genome_size'] = genome_info.get('Genome size', '')
inv['heterozygosity'] = genome_info.get('Heterozygosity', '')
# ... other characteristics
else:
# Set to None for missing data
inv['genome_size'] = None
inv['heterozygosity'] = Noneimport csv
genome_data = []
with open('genome_metadata.tsv') as f:
reader = csv.DictReader(f, delimiter='\t')
genome_data = list(reader)
genome_lookup = {}
for row in genome_data:
species_id = row['species_id']
if species_id not in genome_lookup:
genome_lookup[species_id] = []
genome_lookup[species_id].append(row)
Create filtered dataset
Cell 7: Enrich workflow data with genome characteristics
data_with_species = [inv for inv in data if inv.get('species_id') and inv.get('genome_size')]
undefinedfor inv in data:
species_id = inv.get('species_id')
if species_id and species_id in genome_lookup:
genome_info = genome_lookup[species_id][0]
# Add genome characteristics
inv['genome_size'] = genome_info.get('Genome size', '')
inv['heterozygosity'] = genome_info.get('Heterozygosity', '')
# ... other characteristics
else:
# Set to None for missing data
inv['genome_size'] = None
inv['heterozygosity'] = NoneData Backup Strategy
Create filtered dataset
The Problem
—
Long-running data enrichment projects risk:
- Losing days of work from accidental overwrites
- Unable to revert to previous data states
- No documentation of what changed when
- Running out of disk space from manual backups
data_with_species = [inv for inv in data if inv.get('species_id') and inv.get('genome_size')]
undefinedSolution: Automated Two-Tier Backup System
数据备份策略
—
问题
Architecture:
- Daily backups - Rolling 7-day window (auto-cleanup)
- Milestone backups - Permanent, compressed (gzip ~80% reduction)
- CHANGELOG - Automatic documentation of all changes
Implementation:
bash
undefined长时间运行的数据增强项目面临以下风险:
- 意外覆盖导致数天工作丢失
- 无法恢复到之前的数据状态
- 无变更记录
- 手动备份导致磁盘空间不足
Daily backup (start of each work session)
解决方案: 自动化两层备份系统
./backup_table.sh
架构:
- 每日备份 - 滚动7天窗口(自动清理)
- 里程碑备份 - 永久、压缩(gzip约80%压缩率)
- CHANGELOG - 自动记录所有变更
实现:
bash
undefinedMilestone backup (after major changes)
Daily backup (start of each work session)
./backup_table.sh milestone "added genomescope data for 21 species"
./backup_table.sh
List all backups
Milestone backup (after major changes)
./backup_table.sh list
./backup_table.sh milestone "added genomescope data for 21 species"
Restore from backup (with safety backup)
List all backups
./backup_table.sh restore 2026-01-23
**Directory structure:**backups/
├── daily/ # Rolling 7-day backups (~770KB each)
│ ├── backup_2026-01-17.csv
│ └── backup_2026-01-23.csv
├── milestones/ # Permanent compressed backups (~200KB each)
│ ├── milestone_2026-01-20_initial_enrichment.csv.gz
│ └── milestone_2026-01-23_recovered_accessions.csv.gz
├── CHANGELOG.md # Auto-generated change log
└── README.md # User documentation
**Storage efficiency:**
- Daily backups: ~5.4 MB (7 days × 770KB)
- Milestone backups: ~200KB each compressed (80% size reduction)
- Total: <10 MB for complete project history
- Old daily backups auto-delete after 7 days
**When to create milestones:**
- After adding new data sources (GenomeScope, karyotypes, etc.)
- Before major data transformations
- When completing analysis sections
- Before submitting/publishing
**Global installer available:**
```bash./backup_table.sh list
Install backup system in any repository
Restore from backup (with safety backup)
install-backup-system -f your_data_file.csv
**Key features:**
- Never overwrites without confirmation
- Creates safety backup before restore
- Complete audit trail in CHANGELOG
- Color-coded terminal output
- Handles both CSV and TSV files
**Benefits for data analysis:**
- **Data provenance** - CHANGELOG documents every modification
- **Confidence to experiment** - Easy rollback encourages trying approaches
- **Professional workflow** - Matches publication standards
- **Collaboration-ready** - Team members can understand data history./backup_table.sh restore 2026-01-23
**目录结构:**backups/
├── daily/ # Rolling 7-day backups (~770KB each)
│ ├── backup_2026-01-17.csv
│ └── backup_2026-01-23.csv
├── milestones/ # Permanent compressed backups (~200KB each)
│ ├── milestone_2026-01-20_initial_enrichment.csv.gz
│ └── milestone_2026-01-23_recovered_accessions.csv.gz
├── CHANGELOG.md # Auto-generated change log
└── README.md # User documentation
**存储效率:**
- 每日备份: ~5.4 MB(7天 × 770KB)
- 里程碑备份: 每个约200KB压缩后(80%大小缩减)
- 总计: 完整项目历史占用<10 MB
- 旧的每日备份7天后自动删除
**何时创建里程碑:**
- 添加新数据源后(GenomeScope、核型等)
- 重大数据转换前
- 完成分析章节时
- 提交/出版前
**全局安装程序可用:**
```bashDebugging Data Availability
Install backup system in any repository
Before creating correlation plots, verify data overlap:
python
undefinedinstall-backup-system -f your_data_file.csv
**关键特性:**
- 无确认则永不覆盖
- 恢复前创建安全备份
- CHANGELOG中有完整审计追踪
- 彩色终端输出
- 支持CSV和TSV文件
**数据分析优势:**
- **数据来源** - CHANGELOG记录每次修改
- **实验信心** - 轻松回滚鼓励尝试不同方法
- **专业工作流** - 符合出版标准
- **协作就绪** - 团队成员可了解数据历史Check how many entities have both metrics
调试数据可用性
species_with_metric_a = set(inv.get('species_id') for inv in data
if inv.get('metric_a'))
species_with_metric_b = set(inv.get('species_id') for inv in data
if inv.get('metric_b'))
overlap = species_with_metric_a.intersection(species_with_metric_b)
print(f"Species with both metrics: {len(overlap)}")
if len(overlap) < 10:
print("⚠️ Warning: Limited data for correlation analysis")
print(f" Metric A: {len(species_with_metric_a)} species")
print(f" Metric B: {len(species_with_metric_b)} species")
print(f" Overlap: {len(overlap)} species")
undefined创建相关性图表前,验证数据重叠:
python
undefinedVariable State Validation
Check how many entities have both metrics
When debugging notebook errors, add validation cells to check variable integrity:
python
undefinedspecies_with_metric_a = set(inv.get('species_id') for inv in data
if inv.get('metric_a'))
species_with_metric_b = set(inv.get('species_id') for inv in data
if inv.get('metric_b'))
overlap = species_with_metric_a.intersection(species_with_metric_b)
print(f"Species with both metrics: {len(overlap)}")
if len(overlap) < 10:
print("⚠️ Warning: Limited data for correlation analysis")
print(f" Metric A: {len(species_with_metric_a)} species")
print(f" Metric B: {len(species_with_metric_b)} species")
print(f" Overlap: {len(overlap)} species")
undefinedValidation cell - place before error-prone sections
变量状态验证
print('=== VARIABLE VALIDATION ===')
print(f'Type of data: {type(data)}')
print(f'Is data a list? {isinstance(data, list)}')
if isinstance(data, list):
print(f'Length: {len(data)}')
if len(data) > 0:
print(f'First item type: {type(data[0])}')
print(f'First item keys: {list(data[0].keys())[:10]}')
elif isinstance(data, dict):
print(f'⚠️ WARNING: data is a dict, not a list!')
print(f'Dict keys: {list(data.keys())[:10]}')
print(f'This suggests variable shadowing occurred.')
**When to use**:
- After "Restart & Run All" produces errors
- When error messages suggest wrong variable type
- Before cells that fail intermittently
- In notebooks with 50+ cells
**Best practice**: Include automatic validation in cells that depend on critical global variables.调试Notebook错误时,添加验证单元格以检查变量完整性:
python
undefinedProgrammatic Notebook Manipulation
Validation cell - place before error-prone sections
When inserting cells into large notebooks:
python
import jsonprint('=== VARIABLE VALIDATION ===')
print(f'Type of data: {type(data)}')
print(f'Is data a list? {isinstance(data, list)}')
if isinstance(data, list):
print(f'Length: {len(data)}')
if len(data) > 0:
print(f'First item type: {type(data[0])}')
print(f'First item keys: {list(data[0].keys())[:10]}')
elif isinstance(data, dict):
print(f'⚠️ WARNING: data is a dict, not a list!')
print(f'Dict keys: {list(data.keys())[:10]}')
print(f'This suggests variable shadowing occurred.')
**何时使用**:
- “重启并全部运行”产生错误后
- 错误信息表明变量类型错误时
- 在间歇性失败的单元格之前
- 在50+单元格的Notebook中
**最佳实践**: 在依赖关键全局变量的单元格中包含自动验证。Read notebook
程序化Notebook操作
with open('notebook.ipynb', 'r') as f:
notebook = json.load(f)
向大型Notebook中插入单元格时:
python
import jsonCreate new cell
Read notebook
new_cell = {
"cell_type": "code",
"execution_count": None,
"metadata": {},
"outputs": [],
"source": [line + '\n' for line in code.split('\n')]
}
with open('notebook.ipynb', 'r') as f:
notebook = json.load(f)
Insert at position
Create new cell
insert_position = 50
notebook['cells'] = (notebook['cells'][:insert_position] +
[new_cell] +
notebook['cells'][insert_position:])
new_cell = {
"cell_type": "code",
"execution_count": None,
"metadata": {},
"outputs": [],
"source": [line + '\n' for line in code.split('\n')]
}
Write back
Insert at position
with open('notebook.ipynb', 'w') as f:
json.dump(notebook, f, indent=1)
undefinedinsert_position = 50
notebook['cells'] = (notebook['cells'][:insert_position] +
[new_cell] +
notebook['cells'][insert_position:])
Synchronizing Figure Code and Notebook Documentation
Write back
Pattern: Code changes to figure generation → Must update notebook text
Common Scenario: Updated figure filtering/outlier removal/statistical tests
Workflow:
- Update figure generation Python script
- Regenerate figures
- CRITICAL: Update Jupyter notebook markdown cells documenting the figure
- Use tool (NOT
NotebookEdittool) forEditfiles.ipynb
Example:
python
undefinedwith open('notebook.ipynb', 'w') as f:
json.dump(notebook, f, indent=1)
undefinedAfter adding Mann-Whitney test to figure generation:
同步图表代码与Notebook文档
NotebookEdit(
notebook_path="/path/to/notebook.ipynb",
cell_id="cell-14", # Found via grep or Read
cell_type="markdown",
new_source="Updated description mentioning Mann-Whitney test..."
)
**Finding Figure Cells**:
```bash模式: 图表生成代码变更 → 必须更新Notebook文本
常见场景: 更新图表过滤/异常值移除/统计检验
工作流:
- 更新图表生成Python脚本
- 重新生成图表
- 关键: 更新Jupyter Notebook中记录图表的markdown单元格
- 对文件使用
.ipynb工具(而非NotebookEdit工具)Edit
示例:
python
undefinedLocate figure references
After adding Mann-Whitney test to figure generation:
grep -n "figure_name.png" notebook.ipynb
NotebookEdit(
notebook_path="/path/to/notebook.ipynb",
cell_id="cell-14", # Found via grep or Read
cell_type="markdown",
new_source="Updated description mentioning Mann-Whitney test..."
)
**查找图表单元格**:
```bashOr use Glob + Grep
Locate figure references
grep -n "Figure 4" notebook.ipynb
**Why Critical**: Outdated documentation causes confusion. Notebook text saying "Limited data" when data is now complete, or not mentioning new statistical tests, misleads readers.grep -n "figure_name.png" notebook.ipynb
Best Practices Summary
Or use Glob + Grep
- Always check data availability before creating analyses
- Document outlier removal clearly in titles and comments
- Use consistent naming for variables and figures
- Include statistical testing for all correlations
- Separate visualization from statistics when filtering outliers
- Create templates for repetitive analyses
- Use helper functions consistently across cells
- Organize with markdown headers for navigation
- Test with small datasets before running full analyses
- Save intermediate results for expensive computations
grep -n "Figure 4" notebook.ipynb
**为何关键**: 过时的文档会导致混淆。Notebook文本说“数据有限”但数据现已完整,或未提及新的统计检验,会误导读者。Common Tasks
最佳实践总结
Removing Panels from Multi-Panel Figures
—
Scenario: Convert 2-panel figure to 1-panel after removing unavailable data.
Steps:
-
Update subplot layout:python
# Before: 2 panels fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # After: 1 panel fig, ax = plt.subplots(1, 1, figsize=(10, 6)) -
Remove panel code: Delete all code for removed panel (ax2)
-
Update figure filename:python
# Before plt.savefig('06_scaffold_l50_l90_comparison.png') # After plt.savefig('06_scaffold_l50_comparison.png') -
Update notebook references:
- Image display:
display(Image(...'06_scaffold_l50_comparison.png')) - Title: Remove references to removed data
- Description: Add note about why panel is excluded
- Image display:
-
Clean up old files:bash
rm figures/*_l50_l90_*.png
- 创建分析前始终检查数据可用性
- 在标题和注释中清晰记录异常值移除方法
- 对变量和图表使用一致命名
- 所有相关性分析都包含统计检验
- 过滤异常值时将可视化与统计计算分离
- 为重复分析创建模板
- 在单元格间一致使用辅助函数
- 使用markdown标题进行组织以方便导航
- 运行完整分析前先用小数据集测试
- 为耗时计算保存中间结果
—
常见任务
—
从多面板图表中移除面板
—
场景: 移除不可用数据后,将2面板图表转换为1面板。
步骤:
-
更新子图布局:python
# Before: 2 panels fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # After: 1 panel fig, ax = plt.subplots(1, 1, figsize=(10, 6)) -
移除面板代码: 删除所有已移除面板(ax2)的代码
-
更新图表文件名:python
# Before plt.savefig('06_scaffold_l50_l90_comparison.png') # After plt.savefig('06_scaffold_l50_comparison.png') -
更新Notebook引用:
- 图像显示:
display(Image(...'06_scaffold_l50_comparison.png')) - 标题: 移除对已移除数据的引用
- 描述: 添加关于为何排除面板的说明
- 图像显示:
-
清理旧文件:bash
rm figures/*_l50_l90_*.png