project-sharing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Project Sharing and Output Preparation

项目共享与输出准备

Expert guidance for preparing project outputs for sharing with collaborators, reviewers, or repositories. Creates organized packages at different sharing levels while preserving your working directory.
为与合作者、评审人员或代码库共享的项目输出提供专业指导。在保留工作目录的同时,创建不同共享层级的有序包。

When to Use This Skill

何时使用此技能

  • Sharing analysis results with collaborators
  • Preparing supplementary materials for publications
  • Creating reproducible research packages
  • Archiving completed projects
  • Handoff to other researchers
  • Submitting to data repositories
  • 与合作者分享分析结果
  • 为出版物准备补充材料
  • 创建可复现的研究包
  • 归档已完成的项目
  • 交接给其他研究人员
  • 向数据仓库提交内容

Core Principles

核心原则

  1. Work on copies - Never modify the working directory
  2. Choose appropriate level - Match sharing depth to audience needs
  3. Document everything - Include clear guides and metadata
  4. Clean before sharing - Remove debug code, clear outputs, anonymize if needed
  5. Make it reproducible - Include dependencies and instructions
  6. ⚠️ CRITICAL: After creating sharing folder, all future work happens in the main project directory, NOT in the sharing folder - Sharing folders are read-only snapshots

  1. 基于副本操作 - 绝不修改工作目录
  2. 选择合适层级 - 匹配受众需求的共享深度
  3. 全面文档化 - 包含清晰的指南和元数据
  4. 共享前清理 - 删除调试代码、清除输出、必要时匿名化
  5. 确保可复现性 - 包含依赖项和操作说明
  6. ⚠️ 关键提示:创建共享文件夹后,所有后续工作都在主项目目录中进行,而非共享文件夹 - 共享文件夹是只读快照

Three Sharing Levels

三个共享层级

Level 1: Summary Only

层级1:仅摘要

Purpose: Quick sharing for presentations, reports, or high-level review
What to include:
  • PDF export of final notebook(s)
  • Final data/results (CSV, Excel, figures) - optional
  • Brief README
Use when:
  • Sharing results with non-technical stakeholders
  • Presentations or talks
  • Quick review without reproduction needs
  • Space/time constraints
Structure:
shared-summary/
├── README.md                          # Brief overview
├── analysis-YYYY-MM-DD.pdf           # Notebook as PDF
└── results/
    ├── figures/
    │   ├── fig1-main-result.png
    │   └── fig2-comparison.png
    └── tables/
        └── summary-statistics.csv

用途: 用于演示、报告或高层级评审的快速共享
包含内容:
  • 最终Notebook的PDF导出文件
  • 最终数据/结果(CSV、Excel、图表)- 可选
  • 简短的README
适用场景:
  • 与非技术利益相关者共享结果
  • 演示或演讲
  • 无需复现的快速评审
  • 空间/时间受限
结构:
shared-summary/
├── README.md                          # 简要概述
├── analysis-YYYY-MM-DD.pdf           # 导出为PDF的Notebook
└── results/
    ├── figures/
    │   ├── fig1-main-result.png
    │   └── fig2-comparison.png
    └── tables/
        └── summary-statistics.csv

Level 2: Reproducible

层级2:可复现

Purpose: Enable others to reproduce your analysis from processed data
What to include:
  • Analysis notebooks (.ipynb) - cleaned
  • Scripts for figure generation
  • Processed/analysis-ready data
  • Requirements file (requirements.txt or environment.yml)
  • Detailed README with instructions
Use when:
  • Sharing with collaborating researchers
  • Peer review / manuscript supplementary materials
  • Teaching or tutorials
  • Standard collaboration needs
Structure:
shared-reproducible/
├── README.md                          # Setup and reproduction instructions
├── MANIFEST.md                        # File descriptions
├── environment.yml                    # Conda environment OR requirements.txt
├── notebooks/
│   ├── 01-data-processing.ipynb      # Cleaned, outputs cleared
│   ├── 02-analysis.ipynb
│   └── 03-visualization.ipynb
├── scripts/
│   ├── generate_figures.py           # Standalone scripts
│   └── utils.py
└── data/
    ├── processed/
    │   ├── cleaned_data.csv
    │   └── processed_results.tsv
    └── README.md                      # Data provenance

用途: 让他人能够基于处理后的数据复现你的分析
包含内容:
  • 清理后的分析Notebook(.ipynb)
  • 图表生成脚本
  • 处理好的/可用于分析的数据
  • 依赖项文件(requirements.txt或environment.yml)
  • 包含操作说明的详细README
适用场景:
  • 与合作研究人员共享
  • 同行评审/手稿补充材料
  • 教学或教程
  • 标准协作需求
结构:
shared-reproducible/
├── README.md                          # 安装和复现说明
├── MANIFEST.md                        # 文件说明
├── environment.yml                    # Conda环境配置或requirements.txt
├── notebooks/
│   ├── 01-data-processing.ipynb      # 已清理、输出已清除
│   ├── 02-analysis.ipynb
│   └── 03-visualization.ipynb
├── scripts/
│   ├── generate_figures.py           # 独立脚本
│   └── utils.py
└── data/
    ├── processed/
    │   ├── cleaned_data.csv
    │   └── processed_results.tsv
    └── README.md                      # 数据来源说明

Level 3: Full Traceability

层级3:完全可追溯

Purpose: Complete transparency from raw data through all processing steps
What to include:
  • Starting/raw data
  • All processing scripts and notebooks
  • All intermediate files
  • Final results
  • Complete documentation
  • Full dependency specification
Use when:
  • Archiving for future reference
  • Regulatory compliance
  • High-stakes reproducibility (clinical, policy)
  • Data repository submission (Zenodo, Dryad, etc.)
  • Complete project handoff
Structure:
shared-complete/
├── README.md                          # Complete project guide
├── MANIFEST.md                        # Comprehensive file listing
├── environment.yml
├── data/
│   ├── raw/                          # Original, unmodified data
│   │   ├── sample_A_reads.fastq.gz
│   │   └── README.md                 # Data source, download date
│   ├── intermediate/                 # Processing steps
│   │   ├── 01-filtered/
│   │   ├── 02-aligned/
│   │   └── README.md
│   └── processed/                    # Final analysis-ready
│       └── final_dataset.csv
├── scripts/
│   ├── 01-download-data.sh
│   ├── 02-quality-control.py
│   ├── 03-filtering.py
│   ├── 04-analysis.py
│   └── utils/
├── notebooks/
│   ├── exploratory/                  # Early exploration
│   └── final/                        # Publication analyses
├── results/
│   ├── figures/
│   ├── tables/
│   └── supplementary/
└── documentation/
    ├── methods.md                    # Detailed methodology
    ├── changelog.md                  # Processing decisions
    └── data-dictionary.md            # Variable definitions

用途: 实现从原始数据到所有处理步骤的完全透明
包含内容:
  • 初始/原始数据
  • 所有处理脚本和Notebook
  • 所有中间文件
  • 最终结果
  • 完整文档
  • 全面的依赖项说明
适用场景:
  • 为未来参考进行归档
  • 合规性要求
  • 高风险可复现性场景(临床、政策)
  • 向数据仓库提交(Zenodo、Dryad等)
  • 完整项目交接
结构:
shared-complete/
├── README.md                          # 完整项目指南
├── MANIFEST.md                        # 全面文件列表
├── environment.yml
├── data/
│   ├── raw/                          # 原始、未修改的数据
│   │   ├── sample_A_reads.fastq.gz
│   │   └── README.md                 # 数据源、下载日期
│   ├── intermediate/                 # 处理步骤文件
│   │   ├── 01-filtered/
│   │   ├── 02-aligned/
│   │   └── README.md
│   └── processed/                    # 最终可用于分析的数据
│       └── final_dataset.csv
├── scripts/
│   ├── 01-download-data.sh
│   ├── 02-quality-control.py
│   ├── 03-filtering.py
│   ├── 04-analysis.py
│   └── utils/
├── notebooks/
│   ├── exploratory/                  # 早期探索性Notebook
│   └── final/                        # 用于发表的分析Notebook
├── results/
│   ├── figures/
│   ├── tables/
│   └── supplementary/
└── documentation/
    ├── methods.md                    # 详细方法论
    ├── changelog.md                  # 处理决策记录
    └── data-dictionary.md            # 变量定义

Preparation Workflow

准备工作流

Step 1: Ask User for Sharing Level

步骤1:询问用户所需的共享层级

Questions to determine level:
Which sharing level do you need?

1. Summary Only - PDF + final results (quick sharing)
2. Reproducible - Notebooks + scripts + data (standard sharing)
3. Full Traceability - Everything from raw data (archival/compliance)

Additional questions:
- Who is the audience? (colleagues, reviewers, public)
- Are there size constraints?
- Any sensitive data to handle?
- Timeline for sharing?
用于确定层级的问题:
你需要哪个共享层级?

1. 仅摘要 - PDF + 最终结果(快速共享)
2. 可复现 - Notebook + 脚本 + 数据(标准共享)
3. 完全可追溯 - 从原始数据开始的所有内容(归档/合规)

附加问题:
- 受众是谁?(同事、评审人员、公众)
- 有大小限制吗?
- 是否有需要处理的敏感数据?
- 共享的时间线?

Step 2: Identify Files to Include

步骤2:确定要包含的文件

For each level, identify:
Level 1 - Summary:
  • Main analysis notebook(s)
  • Key figures (publication-quality)
  • Summary tables/statistics
  • Optional: Final processed dataset
Level 2 - Reproducible:
  • All analysis notebooks (not exploratory)
  • Figure generation scripts
  • Processed/cleaned data
  • Environment specification
  • Any utility functions/modules
Level 3 - Full:
  • Raw data (or links if too large)
  • All processing scripts
  • All notebooks (including exploratory)
  • All intermediate files
  • Complete documentation
针对每个层级,确定:
层级1 - 摘要:
  • 主要分析Notebook
  • 关键图表(达到出版物质量)
  • 摘要表格/统计数据
  • 可选:最终处理后的数据集
层级2 - 可复现:
  • 所有分析Notebook(不包括探索性的)
  • 图表生成脚本
  • 处理后的/清理好的数据
  • 环境配置文件
  • 任何工具函数/模块
层级3 - 完全可追溯:
  • 原始数据(如果过大则提供链接)
  • 所有处理脚本
  • 所有Notebook(包括探索性的)
  • 所有中间文件
  • 完整文档

Step 3: Create Sharing Directory

步骤3:创建共享目录

bash
undefined
bash
undefined

Create dated directory

创建带日期的目录

SHARE_DIR="shared-$(date +%Y%m%d)-[level]" mkdir -p "$SHARE_DIR"
SHARE_DIR="shared-$(date +%Y%m%d)-[level]" mkdir -p "$SHARE_DIR"

Create subdirectories based on level

根据层级创建子目录

... appropriate structure from above

... 上述对应的目录结构

undefined
undefined

Step 4: Copy and Clean Files

步骤4:复制并清理文件

For notebooks (.ipynb):
python
import nbformat
from nbconvert.preprocessors import ClearOutputPreprocessor

def clean_notebook(input_path, output_path):
    """Clean notebook: clear outputs, remove debug cells."""

    # Read notebook
    with open(input_path, 'r') as f:
        nb = nbformat.read(f, as_version=4)

    # Clear outputs
    clear_output = ClearOutputPreprocessor()
    nb, _ = clear_output.preprocess(nb, {})

    # Remove cells tagged as 'debug' or 'remove'
    nb.cells = [cell for cell in nb.cells
                if 'debug' not in cell.metadata.get('tags', [])
                and 'remove' not in cell.metadata.get('tags', [])]

    # Write cleaned notebook
    with open(output_path, 'w') as f:
        nbformat.write(nb, f)
For data files:
  • Copy as-is for small files
  • Consider compression for large files
  • Check for sensitive information
For scripts:
  • Remove debugging code
  • Add docstrings if missing
  • Ensure paths are relative
针对Notebook(.ipynb):
python
import nbformat
from nbconvert.preprocessors import ClearOutputPreprocessor

def clean_notebook(input_path, output_path):
    """清理Notebook:清除输出,删除调试单元格。"""

    # 读取Notebook
    with open(input_path, 'r') as f:
        nb = nbformat.read(f, as_version=4)

    # 清除输出
    clear_output = ClearOutputPreprocessor()
    nb, _ = clear_output.preprocess(nb, {})

    # 删除标记为'debug'或'remove'的单元格
    nb.cells = [cell for cell in nb.cells
                if 'debug' not in cell.metadata.get('tags', [])
                and 'remove' not in cell.metadata.get('tags', [])]

    # 写入清理后的Notebook
    with open(output_path, 'w') as f:
        nbformat.write(nb, f)
针对数据文件:
  • 小文件直接复制
  • 大文件考虑压缩
  • 检查是否包含敏感信息
针对脚本:
  • 删除调试代码
  • 补充缺失的文档字符串
  • 确保路径为相对路径

Step 5: Generate Documentation

步骤5:生成文档

README.md Template

README.md模板

markdown
undefined
markdown
undefined

Project: [Project Name]

项目:[项目名称]

Date: YYYY-MM-DD Author: [Your Name] Sharing Level: [Summary/Reproducible/Full]
日期: YYYY-MM-DD 作者: [你的姓名] 共享层级: [摘要/可复现/完全可追溯]

Overview

概述

Brief description of the project and analysis.
项目和分析的简要描述。

Contents

内容

See MANIFEST.md for detailed file descriptions.
详细文件说明请查看MANIFEST.md。

Requirements

环境要求

[For Reproducible/Full levels]
  • Python 3.X
  • See environment.yml for dependencies
[针对可复现/完全可追溯层级]
  • Python 3.X
  • 依赖项请查看environment.yml

Setup

安装步骤

```bash
```bash

Create environment

创建环境

conda env create -f environment.yml conda activate project-name ```
conda env create -f environment.yml conda activate project-name ```

Reproduction Steps

复现步骤

[For Reproducible/Full levels]
  1. [Description of first step] ```bash jupyter notebook notebooks/01-analysis.ipynb ```
  2. [Description of second step]
[针对可复现/完全可追溯层级]
  1. [第一步说明] ```bash jupyter notebook notebooks/01-analysis.ipynb ```
  2. [第二步说明]

Data Sources

数据源

[For Full level]
  • Dataset A: [Source, download date, version]
  • Dataset B: [Source, download date, version]
[针对完全可追溯层级]
  • 数据集A:[来源、下载日期、版本]
  • 数据集B:[来源、下载日期、版本]

Contact

联系方式

[Your email or preferred contact]
[你的邮箱或首选联系方式]

License

许可证

[If applicable - e.g., CC BY 4.0, MIT]
undefined
[如适用 - 例如 CC BY 4.0、MIT]
undefined

MANIFEST.md Template

MANIFEST.md模板

markdown
undefined
markdown
undefined

File Manifest

文件清单

Generated: YYYY-MM-DD
生成日期:YYYY-MM-DD

Directory Structure

目录结构

``` shared-YYYYMMDD/ ├── README.md - Project overview and setup ├── MANIFEST.md - This file [... complete tree ...] ```
``` shared-YYYYMMDD/ ├── README.md - 项目概述和安装说明 ├── MANIFEST.md - 本文件 [... 完整目录树 ...] ```

File Descriptions

文件说明

Notebooks

Notebook

  • `notebooks/01-data-processing.ipynb` - Initial data loading and cleaning
  • `notebooks/02-analysis.ipynb` - Main statistical analysis
  • `notebooks/03-visualization.ipynb` - Figure generation for publication
  • `notebooks/01-data-processing.ipynb` - 初始数据加载和清理
  • `notebooks/02-analysis.ipynb` - 主要统计分析
  • `notebooks/03-visualization.ipynb` - 用于发表的图表生成

Data

数据

  • `data/processed/cleaned_data.csv` - Quality-controlled dataset (N=XXX samples)
    • Columns: [list key columns]
    • Missing values handled by [method]
  • `data/processed/cleaned_data.csv` - 质量控制后的数据集(样本量N=XXX)
    • 列:[关键列列表]
    • 缺失值处理方式:[方法]

Scripts

脚本

  • `scripts/generate_figures.py` - Automated figure generation
    • Usage: `python generate_figures.py --input data/processed/cleaned_data.csv`
  • `scripts/generate_figures.py` - 自动化图表生成
    • 使用方法:`python generate_figures.py --input data/processed/cleaned_data.csv`

Results

结果

  • `results/figures/fig1-main.png` - Main result showing [description]
  • `results/tables/summary_stats.csv` - Descriptive statistics
[Continue for all files...]
undefined
  • `results/figures/fig1-main.png` - 主要结果,展示[描述]
  • `results/tables/summary_stats.csv` - 描述性统计数据
[所有文件的说明...]
undefined

Step 6: Handle Sensitive Data

步骤6:处理敏感数据

Check for sensitive information:
  • Personal identifiable information (PII)
  • Access credentials (API keys, passwords)
  • Proprietary data
  • Institutional data with sharing restrictions
  • Patient/subject identifiers
Strategies:
  1. Anonymize - Remove or hash identifiers
  2. Exclude - Don't include sensitive files
  3. Aggregate - Share summary statistics only
  4. Document restrictions - Note what's excluded and why
Example anonymization:
python
import hashlib

def anonymize_ids(df, id_column='subject_id'):
    """Replace IDs with hashed values."""
    df[id_column] = df[id_column].apply(
        lambda x: hashlib.sha256(str(x).encode()).hexdigest()[:8]
    )
    return df
检查敏感信息:
  • 个人可识别信息(PII)
  • 访问凭证(API密钥、密码)
  • 专有数据
  • 有共享限制的机构数据
  • 患者/受试者标识符
处理策略:
  1. 匿名化 - 删除或哈希处理标识符
  2. 排除 - 不包含敏感文件
  3. 聚合 - 仅共享摘要统计数据
  4. 记录限制 - 说明排除的内容及原因
匿名化示例:
python
import hashlib

def anonymize_ids(df, id_column='subject_id'):
    """用哈希值替换ID。"""
    df[id_column] = df[id_column].apply(
        lambda x: hashlib.sha256(str(x).encode()).hexdigest()[:8]
    )
    return df

Step 7: Package and Compress

步骤7:打包与压缩

For smaller packages (<100MB):
bash
undefined
针对较小的包(<100MB):
bash
undefined

Create zip archive

创建zip归档

zip -r shared-YYYYMMDD.zip shared-YYYYMMDD/

**For larger packages:**
```bash
zip -r shared-YYYYMMDD.zip shared-YYYYMMDD/

**针对较大的包:**
```bash

Create tar.gz (better compression)

创建tar.gz(压缩效果更好)

tar -czf shared-YYYYMMDD.tar.gz shared-YYYYMMDD/
tar -czf shared-YYYYMMDD.tar.gz shared-YYYYMMDD/

Or split into parts if very large

如果非常大,可分块

tar -czf - shared-YYYYMMDD/ | split -b 1G - shared-YYYYMMDD.tar.gz.part

**Document package contents:**
- Total size
- Number of files
- Compression method
- How to extract
tar -czf - shared-YYYYMMDD/ | split -b 1G - shared-YYYYMMDD.tar.gz.part

**记录包内容:**
- 总大小
- 文件数量
- 压缩方式
- 解压方法

Step 8: Return to Working Directory

步骤8:返回工作目录

⚠️ IMPORTANT: After creating the sharing package, always work in the main project directory.
The sharing folder is a snapshot for distribution only. Any future development, analysis, or modifications should happen in your original working directory, not in the
shared-*/
folder.
Claude should:
  • Change directory back to main project:
    cd ..
    (if needed)
  • Confirm working directory:
    pwd
  • Continue all work in the original project location
  • Treat sharing folders as read-only archives
Example:
bash
undefined
⚠️ 重要提示:创建共享包后,始终在主项目目录中工作。
共享文件夹仅用于分发的快照。任何后续开发、分析或修改都应在原始工作目录中进行,而非
shared-*/
文件夹中。
Claude应执行:
  • 切换回主项目目录:
    cd ..
    (如有需要)
  • 确认当前目录:
    pwd
  • 所有后续工作都在原始项目位置进行
  • 将共享文件夹视为只读归档
示例:
bash
undefined

After creating sharing package

创建共享包后

cd /path/to/main/project # Return to working directory pwd # Verify location
cd /path/to/main/project # 返回工作目录 pwd # 验证位置

Continue work here, NOT in shared-YYYYMMDD/

在此处继续工作,而非shared-YYYYMMDD/


---

---

Best Practices

最佳实践

Notebook Cleaning

Notebook清理

Before sharing notebooks:
  1. Clear all outputs
    bash
    jupyter nbconvert --clear-output --inplace notebook.ipynb
  2. Remove debug cells
    • Tag cells for removal: Cell → Cell Tags → add "remove"
    • Filter during copy
  3. Add markdown explanations
    • Ensure each code cell has context
    • Add section headers
    • Document assumptions
  4. Check cell execution order
    • Run "Restart & Run All" to verify
    • Fix any out-of-order dependencies
  5. Remove absolute paths
    python
    # ❌ Bad
    data = pd.read_csv('/Users/yourname/project/data.csv')
    
    # ✅ Good
    data = pd.read_csv('../data/data.csv')
    # or
    from pathlib import Path
    data_dir = Path(__file__).parent / 'data'
共享Notebook前:
  1. 清除所有输出
    bash
    jupyter nbconvert --clear-output --inplace notebook.ipynb
  2. 删除调试单元格
    • 标记要删除的单元格:Cell → Cell Tags → 添加"remove"
    • 复制时过滤这些单元格
  3. 添加Markdown说明
    • 确保每个代码单元格都有上下文
    • 添加章节标题
    • 记录假设条件
  4. 检查单元格执行顺序
    • 运行“重启并全部运行”以验证
    • 修复任何顺序依赖问题
  5. 删除绝对路径
    python
    # ❌ 不良示例
    data = pd.read_csv('/Users/yourname/project/data.csv')
    
    # ✅ 良好示例
    data = pd.read_csv('../data/data.csv')
    # 或
    from pathlib import Path
    data_dir = Path(__file__).parent / 'data'

File Organization

文件组织

Naming conventions for shared files:
  • Use descriptive names:
    telomere_analysis_results.csv
    not
    results.csv
  • Include dates for time-sensitive data:
    data_2024-01-15.csv
  • Version if applicable:
    analysis_v2.ipynb
  • No spaces: use
    -
    or
    _
Size considerations:
  • Document large files in README
  • Consider hosting large data separately (institutional storage, Zenodo)
  • Provide download links instead of including in package
  • Use
    .gitattributes
    for large file tracking if using Git
共享文件的命名规范:
  • 使用描述性名称:
    telomere_analysis_results.csv
    而非
    results.csv
  • 对时间敏感的数据添加日期:
    data_2024-01-15.csv
  • 适用时添加版本:
    analysis_v2.ipynb
  • 不使用空格:使用
    -
    _
大小注意事项:
  • 在README中记录大文件
  • 考虑单独托管大文件(机构存储、Zenodo)
  • 提供下载链接而非包含在包中
  • 如果使用Git,用
    .gitattributes
    跟踪大文件

Documentation Requirements

文档要求

Minimum documentation for each level:
Level 1 - Summary:
  • What the results show
  • Key findings
  • Date and author
Level 2 - Reproducible:
  • Setup instructions
  • How to run the analysis
  • Software dependencies
  • Expected runtime
  • Data source information
Level 3 - Full:
  • Complete methodology
  • All data sources with versions
  • Processing decisions and rationale
  • Known issues or limitations
  • Contact information
每个层级的最低文档要求:
层级1 - 摘要:
  • 结果展示内容
  • 关键发现
  • 日期和作者
层级2 - 可复现:
  • 安装说明
  • 分析运行方法
  • 软件依赖项
  • 预期运行时间
  • 数据源信息
层级3 - 完全可追溯:
  • 完整方法论
  • 所有带版本的数据源
  • 处理决策及理由
  • 已知问题或限制
  • 联系方式

Dependency Management

依赖项管理

Create requirements file:
For pip:
bash
undefined
创建依赖项文件:
针对pip:
bash
undefined

From active environment

从活跃环境导出

pip freeze > requirements.txt
pip freeze > requirements.txt

Or manually curated (better)

或手动整理(推荐)

cat > requirements.txt << EOF pandas>=1.5.0 numpy>=1.23.0 matplotlib>=3.6.0 scipy>=1.9.0 EOF

**For conda:**
```bash
cat > requirements.txt << EOF pandas>=1.5.0 numpy>=1.23.0 matplotlib>=3.6.0 scipy>=1.9.0 EOF

**针对conda:**
```bash

Export current environment

导出当前环境

conda env export > environment.yml
conda env export > environment.yml

Or minimal (recommended)

或最小化导出(推荐)

conda env export --from-history > environment.yml
conda env export --from-history > environment.yml

Then edit to remove build-specific details

然后编辑以移除构建相关细节


---

---

Common Scenarios

常见场景

Scenario 1: Sharing with Lab Collaborators

场景1:与实验室合作者共享

Level: Reproducible
Include:
  • Cleaned analysis notebooks
  • Processed data
  • Figure generation scripts
  • environment.yml
  • README with reproduction steps
Don't include:
  • Exploratory notebooks
  • Failed analysis attempts
  • Debug outputs
  • Personal notes
层级: 可复现
包含内容:
  • 清理后的分析Notebook
  • 处理后的数据
  • 图表生成脚本
  • environment.yml
  • 包含复现步骤的README
不包含内容:
  • 探索性Notebook
  • 失败的分析尝试
  • 调试输出
  • 个人笔记

Scenario 2: Manuscript Supplementary Material

场景2:手稿补充材料

Level: Reproducible or Full (depending on journal)
Include:
  • All notebooks used for figures in paper
  • Scripts for each figure panel
  • Processed data (or instructions to obtain)
  • Complete environment specification
  • Detailed methods document
Best practices:
  • Number notebooks to match paper sections
  • Export key figures in publication formats (PDF, high-res PNG)
  • Include data dictionary for all variables
  • Test reproduction on clean environment
层级: 可复现或完全可追溯(取决于期刊要求)
包含内容:
  • 论文中图表所用的所有Notebook
  • 每个图板的脚本
  • 处理后的数据(或获取说明)
  • 完整的环境配置
  • 详细的方法文档
最佳实践:
  • 按论文章节编号Notebook
  • 以出版物格式导出关键图表(PDF、高分辨率PNG)
  • 包含所有变量的数据字典
  • 在干净环境中测试复现

Scenario 3: Project Archival

场景3:项目归档

Level: Full Traceability
Include:
  • Complete data pipeline from raw to processed
  • All versions of analysis
  • Meeting notes or decision logs
  • External tool versions
  • System information
Organization tips:
  • Use dates in directory names
  • Keep chronological changelog
  • Document all external dependencies
  • Include contact info for questions
层级: 完全可追溯
包含内容:
  • 从原始到处理的完整数据流水线
  • 所有版本的分析
  • 会议记录或决策日志
  • 外部工具版本
  • 系统信息
组织技巧:
  • 在目录名称中使用日期
  • 按时间顺序维护变更日志
  • 记录所有外部依赖项
  • 包含问题咨询的联系方式

Scenario 4: Data Repository Submission (Zenodo, Figshare)

场景4:向数据仓库提交(Zenodo、Figshare)

Level: Full Traceability
Additional considerations:
  • Add LICENSE file (CC BY 4.0, MIT, etc.)
  • Include CITATION.cff or CITATION.txt
  • Comprehensive metadata
  • README with DOI/reference instructions
  • Consider maximum file sizes
  • Review repository-specific guidelines

层级: 完全可追溯
附加注意事项:
  • 添加LICENSE文件(CC BY 4.0、MIT等)
  • 包含CITATION.cff或CITATION.txt
  • 全面的元数据
  • 包含DOI/引用说明的README
  • 考虑最大文件大小限制
  • 查看仓库特定指南

Quality Checklist

质量检查清单

Before finalizing the sharing package:
最终确定共享包前:

File Quality

文件质量

  • All notebooks run without errors
  • Notebook outputs cleared
  • No absolute paths in code
  • No hardcoded credentials or API keys
  • File sizes documented
  • Large files compressed or linked
  • 所有Notebook运行无错误
  • Notebook输出已清除
  • 代码中无绝对路径
  • 无硬编码凭证或API密钥
  • 文件大小已记录
  • 大文件已压缩或提供链接

Documentation

文档

  • README explains setup and usage
  • MANIFEST describes all files
  • Data sources documented
  • Dependencies specified
  • Contact information included
  • License specified (if applicable)
  • README说明安装和使用方法
  • MANIFEST描述所有文件
  • 数据源已记录
  • 依赖项已明确
  • 包含联系方式
  • 已指定许可证(如适用)

Reproducibility

可复现性

  • Requirements file tested in clean environment
  • All data accessible (included or linked)
  • Scripts run in documented order
  • Expected outputs match actual outputs
  • Processing time documented
  • 依赖项文件已在干净环境中测试
  • 所有数据可访问(已包含或提供链接)
  • 脚本按记录的顺序运行
  • 预期输出与实际输出匹配
  • 处理时间已记录

Privacy & Sensitivity

隐私与敏感信息

  • No sensitive data included
  • Identifiers anonymized if needed
  • Institutional policies checked
  • Collaborator permissions obtained
  • 未包含敏感数据
  • 必要时已匿名化标识符
  • 已检查机构政策
  • 已获得合作者许可

Organization

组织性

  • Clear directory structure
  • Consistent naming conventions
  • Files logically grouped
  • No duplicate files
  • No unnecessary files (cache, .DS_Store, etc.)

  • 清晰的目录结构
  • 一致的命名规范
  • 文件逻辑分组
  • 无重复文件
  • 无不必要文件(缓存、.DS_Store等)

Integration with Other Skills

与其他技能的集成

Works well with:
  • folder-organization - Ensures source project is well-organized before sharing
  • jupyter-notebook-analysis - Creates notebooks that are share-ready
  • managing-environments - Documents dependencies properly
Before using this skill:
  1. Organize working directory (folder-organization)
  2. Finalize analysis (jupyter-notebook-analysis)
  3. Document environment (managing-environments)
After using this skill:
  1. Test package in clean environment
  2. Share via appropriate channel (email, repository, cloud storage)
  3. Keep archived copy for reference

适配良好的技能:
  • folder-organization - 确保源项目在共享前已整理有序
  • jupyter-notebook-analysis - 创建可直接共享的Notebook
  • managing-environments - 正确记录依赖项
使用此技能前:
  1. 整理工作目录(folder-organization)
  2. 完成分析(jupyter-notebook-analysis)
  3. 记录环境信息(managing-environments)
使用此技能后:
  1. 在干净环境中测试包
  2. 通过合适渠道共享(邮件、代码库、云存储)
  3. 保留归档副本以供参考

Example Scripts

示例脚本

Create Sharing Package Script

创建共享包脚本

python
#!/usr/bin/env python3
"""Create sharing package for project."""

import shutil
from pathlib import Path
from datetime import date
import nbformat
from nbconvert.preprocessors import ClearOutputPreprocessor

def create_sharing_package(level='reproducible', output_dir=None):
    """
    Create sharing package.

    Args:
        level: 'summary', 'reproducible', or 'full'
        output_dir: Output directory name (auto-generated if None)
    """

    # Create output directory
    if output_dir is None:
        output_dir = f"shared-{date.today():%Y%m%d}-{level}"

    share_path = Path(output_dir)
    share_path.mkdir(exist_ok=True)

    print(f"Creating {level} sharing package in {share_path}")

    # Create structure based on level
    if level == 'summary':
        create_summary_package(share_path)
    elif level == 'reproducible':
        create_reproducible_package(share_path)
    elif level == 'full':
        create_full_package(share_path)

    print(f"✓ Package created: {share_path}")
    print(f"  Review and compress: tar -czf {share_path}.tar.gz {share_path}")

def clean_notebook(input_path, output_path):
    """Clean notebook outputs and debug cells."""
    with open(input_path) as f:
        nb = nbformat.read(f, as_version=4)

    # Clear outputs
    clear = ClearOutputPreprocessor()
    nb, _ = clear.preprocess(nb, {})

    # Remove debug cells
    nb.cells = [c for c in nb.cells
                if 'debug' not in c.metadata.get('tags', [])]

    with open(output_path, 'w') as f:
        nbformat.write(nb, f)
python
#!/usr/bin/env python3
"""为项目创建共享包。"""

import shutil
from pathlib import Path
from datetime import date
import nbformat
from nbconvert.preprocessors import ClearOutputPreprocessor

def create_sharing_package(level='reproducible', output_dir=None):
    """
    创建共享包。

    参数:
        level: 'summary', 'reproducible', 或 'full'
        output_dir: 输出目录名称(未指定则自动生成)
    """

    # 创建输出目录
    if output_dir is None:
        output_dir = f"shared-{date.today():%Y%m%d}-{level}"

    share_path = Path(output_dir)
    share_path.mkdir(exist_ok=True)

    print(f"在{share_path}中创建{level}级共享包")

    # 根据层级创建结构
    if level == 'summary':
        create_summary_package(share_path)
    elif level == 'reproducible':
        create_reproducible_package(share_path)
    elif level == 'full':
        create_full_package(share_path)

    print(f"✓ 包已创建: {share_path}")
    print(f"  请检查并压缩: tar -czf {share_path}.tar.gz {share_path}")

def clean_notebook(input_path, output_path):
    """清理Notebook输出和调试单元格。"""
    with open(input_path) as f:
        nb = nbformat.read(f, as_version=4)

    # 清除输出
    clear = ClearOutputPreprocessor()
    nb, _ = clear.preprocess(nb, {})

    # 删除调试单元格
    nb.cells = [c for c in nb.cells
                if 'debug' not in c.metadata.get('tags', [])]

    with open(output_path, 'w') as f:
        nbformat.write(nb, f)

... implement level-specific functions ...

... 实现各层级的具体函数 ...

if name == 'main': import sys level = sys.argv[1] if len(sys.argv) > 1 else 'reproducible' create_sharing_package(level)

---
if name == 'main': import sys level = sys.argv[1] if len(sys.argv) > 1 else 'reproducible' create_sharing_package(level)

---

Summary

总结

Key principles for project sharing:
  1. 🎯 Choose the right level - Match sharing depth to audience needs
  2. 📋 Copy, don't move - Preserve your working directory
  3. 🧹 Clean thoroughly - Remove debug code, clear outputs
  4. 📝 Document everything - README + MANIFEST minimum
  5. 🔒 Check sensitivity - Anonymize or exclude as needed
  6. Test before sharing - Run in clean environment
  7. 📦 Package properly - Compress and document contents
  8. ⚠️ Work in main directory - After creating sharing package, ALL future work happens in the original project directory, NOT in the sharing folder
Remember: Good sharing practices benefit both collaborators and your future self!

项目共享的关键原则:
  1. 🎯 选择合适层级 - 匹配受众需求的共享深度
  2. 📋 基于副本操作 - 保留工作目录
  3. 🧹 彻底清理 - 删除调试代码、清除输出
  4. 📝 全面文档化 - 至少包含README + MANIFEST
  5. 🔒 检查敏感信息 - 必要时匿名化或排除
  6. 共享前测试 - 在干净环境中运行
  7. 📦 正确打包 - 压缩并记录内容
  8. ⚠️ 在主目录工作 - 创建共享包后,所有后续工作都在原始项目目录中进行,而非共享文件夹
记住: 良好的共享实践对合作者和未来的你都有好处!

⚠️ Critical Reminder for Claude

⚠️ 给Claude的关键提醒

After creating any sharing package:
  1. Always return to the main project directory
  2. Never work in
    shared-*/
    directories
    - These are read-only snapshots
  3. All future edits, analysis, and development happen in the original working directory
  4. Sharing folders are for distribution only, not active development
If the user asks to modify files, always check the current directory and ensure you're working in the main project location, not in a sharing package.
创建任何共享包后:
  1. 始终返回主项目目录
  2. 绝不在
    shared-*/
    目录中工作
    - 这些是只读快照
  3. 所有后续编辑、分析和开发都在原始工作目录中进行
  4. 共享文件夹仅用于分发,而非活跃开发
如果用户要求修改文件,始终检查当前目录,确保在主项目位置工作,而非共享包中。