nextflow-development

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

nf-core Pipeline Deployment

nf-core 流程部署

Run nf-core bioinformatics pipelines on local or public sequencing data.
Target users: Bench scientists and researchers without specialized bioinformatics training who need to run large-scale omics analyses—differential expression, variant calling, or chromatin accessibility analysis.
在本地或公共测序数据上运行nf-core生物信息学流程。
目标用户: 没有专业生物信息学培训背景的实验室科学家和研究人员,他们需要进行大规模组学分析——差异表达分析、变异检测或染色质可及性分析。

Workflow Checklist

工作流检查清单

- [ ] Step 0: Acquire data (if from GEO/SRA)
- [ ] Step 1: Environment check (MUST pass)
- [ ] Step 2: Select pipeline (confirm with user)
- [ ] Step 3: Run test profile (MUST pass)
- [ ] Step 4: Create samplesheet
- [ ] Step 5: Configure & run (confirm genome with user)
- [ ] Step 6: Verify outputs

- [ ] 步骤0:获取数据(若来自GEO/SRA)
- [ ] 步骤1:环境检查(必须通过)
- [ ] 步骤2:选择流程(与用户确认)
- [ ] 步骤3:运行测试配置文件(必须通过)
- [ ] 步骤4:创建样本表
- [ ] 步骤5:配置并运行(与用户确认参考基因组)
- [ ] 步骤6:验证输出结果

Step 0: Acquire Data (GEO/SRA Only)

步骤0:获取数据(仅适用于GEO/SRA)

Skip this step if user has local FASTQ files.
For public datasets, fetch from GEO/SRA first. See references/geo-sra-acquisition.md for the full workflow.
Quick start:
bash
undefined
如果用户有本地FASTQ文件,请跳过此步骤。
对于公共数据集,先从GEO/SRA获取。完整工作流请参考references/geo-sra-acquisition.md
快速开始:
bash
undefined

1. Get study info

1. 获取研究信息

python scripts/sra_geo_fetch.py info GSE110004
python scripts/sra_geo_fetch.py info GSE110004

2. Download (interactive mode)

2. 下载(交互模式)

python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i
python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i

3. Generate samplesheet

3. 生成样本表

python scripts/sra_geo_fetch.py samplesheet GSE110004 --fastq-dir ./fastq -o samplesheet.csv

**DECISION POINT:** After fetching study info, confirm with user:
- Which sample subset to download (if multiple data types)
- Suggested genome and pipeline

Then continue to Step 1.

---
python scripts/sra_geo_fetch.py samplesheet GSE110004 --fastq-dir ./fastq -o samplesheet.csv

**决策点:** 获取研究信息后,与用户确认:
- 要下载的样本子集(如果有多种数据类型)
- 推荐的参考基因组和流程

之后继续步骤1。

---

Step 1: Environment Check

步骤1:环境检查

Run first. Pipeline will fail without passing environment.
bash
python scripts/check_environment.py
All critical checks must pass. If any fail, provide fix instructions:
优先运行。未通过环境检查的话,流程会失败。
bash
python scripts/check_environment.py
所有关键检查项必须通过。如果有检查失败,提供修复说明:

Docker issues

Docker问题

ProblemFix
Not installedInstall from https://docs.docker.com/get-docker/
Permission denied
sudo usermod -aG docker $USER
then re-login
Daemon not running
sudo systemctl start docker
问题修复方案
未安装https://docs.docker.com/get-docker/安装
权限不足执行
sudo usermod -aG docker $USER
后重新登录
守护进程未运行执行
sudo systemctl start docker

Nextflow issues

Nextflow问题

ProblemFix
Not installed
curl -s https://get.nextflow.io | bash && mv nextflow ~/bin/
Version < 23.04
nextflow self-update
问题修复方案
未安装执行
curl -s https://get.nextflow.io | bash && mv nextflow ~/bin/
版本 < 23.04执行
nextflow self-update

Java issues

Java问题

ProblemFix
Not installed / < 11
sudo apt install openjdk-11-jdk
Do not proceed until all checks pass. For HPC/Singularity, see references/troubleshooting.md.

问题修复方案
未安装 / 版本 < 11执行
sudo apt install openjdk-11-jdk
所有检查通过前请勿继续。 针对HPC/Singularity的情况,请参考references/troubleshooting.md

Step 2: Select Pipeline

步骤2:选择流程

DECISION POINT: Confirm with user before proceeding.
Data TypePipelineVersionGoal
RNA-seq
rnaseq
3.22.2Gene expression
WGS/WES
sarek
3.7.1Variant calling
ATAC-seq
atacseq
2.1.2Chromatin accessibility
Auto-detect from data:
bash
python scripts/detect_data_type.py /path/to/data
For pipeline-specific details:
  • references/pipelines/rnaseq.md
  • references/pipelines/sarek.md
  • references/pipelines/atacseq.md

决策点:继续前请与用户确认。
数据类型流程版本目标
RNA-seq
rnaseq
3.22.2基因表达分析
WGS/WES
sarek
3.7.1变异检测
ATAC-seq
atacseq
2.1.2染色质可及性分析
自动检测数据类型:
bash
python scripts/detect_data_type.py /path/to/data
流程详细信息请参考:
  • references/pipelines/rnaseq.md
  • references/pipelines/sarek.md
  • references/pipelines/atacseq.md

Step 3: Run Test Profile

步骤3:运行测试配置文件

Validates environment with small data. MUST pass before real data.
bash
nextflow run nf-core/<pipeline> -r <version> -profile test,docker --outdir test_output
PipelineCommand
rnaseq
nextflow run nf-core/rnaseq -r 3.22.2 -profile test,docker --outdir test_rnaseq
sarek
nextflow run nf-core/sarek -r 3.7.1 -profile test,docker --outdir test_sarek
atacseq
nextflow run nf-core/atacseq -r 2.1.2 -profile test,docker --outdir test_atacseq
Verify:
bash
ls test_output/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log
If test fails, see references/troubleshooting.md.

用小型数据验证环境。处理真实数据前必须通过此测试。
bash
nextflow run nf-core/<pipeline> -r <version> -profile test,docker --outdir test_output
流程命令
rnaseq
nextflow run nf-core/rnaseq -r 3.22.2 -profile test,docker --outdir test_rnaseq
sarek
nextflow run nf-core/sarek -r 3.7.1 -profile test,docker --outdir test_sarek
atacseq
nextflow run nf-core/atacseq -r 2.1.2 -profile test,docker --outdir test_atacseq
验证:
bash
ls test_output/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log
如果测试失败,请参考references/troubleshooting.md

Step 4: Create Samplesheet

步骤4:创建样本表

Generate automatically

自动生成

bash
python scripts/generate_samplesheet.py /path/to/data <pipeline> -o samplesheet.csv
The script:
  • Discovers FASTQ/BAM/CRAM files
  • Pairs R1/R2 reads
  • Infers sample metadata
  • Validates before writing
For sarek: Script prompts for tumor/normal status if not auto-detected.
bash
python scripts/generate_samplesheet.py /path/to/data <pipeline> -o samplesheet.csv
该脚本功能:
  • 发现FASTQ/BAM/CRAM文件
  • 配对R1/R2读段
  • 推断样本元数据
  • 写入前进行验证
针对sarek流程: 如果未自动检测到肿瘤/正常状态,脚本会提示用户输入。

Validate existing samplesheet

验证现有样本表

bash
python scripts/generate_samplesheet.py --validate samplesheet.csv <pipeline>
bash
python scripts/generate_samplesheet.py --validate samplesheet.csv <pipeline>

Samplesheet formats

样本表格式

rnaseq:
csv
sample,fastq_1,fastq_2,strandedness
SAMPLE1,/abs/path/R1.fq.gz,/abs/path/R2.fq.gz,auto
sarek:
csv
patient,sample,lane,fastq_1,fastq_2,status
patient1,tumor,L001,/abs/path/tumor_R1.fq.gz,/abs/path/tumor_R2.fq.gz,1
patient1,normal,L001,/abs/path/normal_R1.fq.gz,/abs/path/normal_R2.fq.gz,0
atacseq:
csv
sample,fastq_1,fastq_2,replicate
CONTROL,/abs/path/ctrl_R1.fq.gz,/abs/path/ctrl_R2.fq.gz,1

rnaseq:
csv
sample,fastq_1,fastq_2,strandedness
SAMPLE1,/abs/path/R1.fq.gz,/abs/path/R2.fq.gz,auto
sarek:
csv
patient,sample,lane,fastq_1,fastq_2,status
patient1,tumor,L001,/abs/path/tumor_R1.fq.gz,/abs/path/tumor_R2.fq.gz,1
patient1,normal,L001,/abs/path/normal_R1.fq.gz,/abs/path/normal_R2.fq.gz,0
atacseq:
csv
sample,fastq_1,fastq_2,replicate
CONTROL,/abs/path/ctrl_R1.fq.gz,/abs/path/ctrl_R2.fq.gz,1

Step 5: Configure & Run

步骤5:配置并运行

5a. Check genome availability

5a. 检查参考基因组可用性

bash
python scripts/manage_genomes.py check <genome>
bash
python scripts/manage_genomes.py check <genome>

If not installed:

若未安装:

python scripts/manage_genomes.py download <genome>

Common genomes: GRCh38 (human), GRCh37 (legacy), GRCm39 (mouse), R64-1-1 (yeast), BDGP6 (fly)
python scripts/manage_genomes.py download <genome>

常用参考基因组:GRCh38(人类)、GRCh37(旧版)、GRCm39(小鼠)、R64-1-1(酵母)、BDGP6(果蝇)

5b. Decision points

5b. 决策点

DECISION POINT: Confirm with user:
  1. Genome: Which reference to use
  2. Pipeline-specific options:
    • rnaseq: aligner (star_salmon recommended, hisat2 for low memory)
    • sarek: tools (haplotypecaller for germline, mutect2 for somatic)
    • atacseq: read_length (50, 75, 100, or 150)
决策点:与用户确认:
  1. 参考基因组: 使用哪个参考序列
  2. 流程特定选项:
    • rnaseq: 比对工具(推荐star_salmon,低内存场景用hisat2)
    • sarek: 分析工具(生殖系变异用haplotypecaller,体细胞变异用mutect2)
    • atacseq: 读长(50、75、100或150)

5c. Run pipeline

5c. 运行流程

bash
nextflow run nf-core/<pipeline> \
    -r <version> \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    --genome <genome> \
    -resume
Key flags:
  • -r
    : Pin version
  • -profile docker
    : Use Docker (or
    singularity
    for HPC)
  • --genome
    : iGenomes key
  • -resume
    : Continue from checkpoint
Resource limits (if needed):
bash
--max_cpus 8 --max_memory '32.GB' --max_time '24.h'

bash
nextflow run nf-core/<pipeline> \
    -r <version> \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    --genome <genome> \
    -resume
关键参数:
  • -r
    :固定版本
  • -profile docker
    :使用Docker(HPC环境用
    singularity
  • --genome
    :iGenomes编号
  • -resume
    :从检查点继续运行
资源限制(如需):
bash
--max_cpus 8 --max_memory '32.GB' --max_time '24.h'

Step 6: Verify Outputs

步骤6:验证输出结果

Check completion

检查运行完成状态

bash
ls results/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log
bash
ls results/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log

Key outputs by pipeline

各流程的关键输出

rnaseq:
  • results/star_salmon/salmon.merged.gene_counts.tsv
    - Gene counts
  • results/star_salmon/salmon.merged.gene_tpm.tsv
    - TPM values
sarek:
  • results/variant_calling/*/
    - VCF files
  • results/preprocessing/recalibrated/
    - BAM files
atacseq:
  • results/macs2/narrowPeak/
    - Peak calls
  • results/bwa/mergedLibrary/bigwig/
    - Coverage tracks

rnaseq:
  • results/star_salmon/salmon.merged.gene_counts.tsv
    - 基因计数
  • results/star_salmon/salmon.merged.gene_tpm.tsv
    - TPM值
sarek:
  • results/variant_calling/*/
    - VCF文件
  • results/preprocessing/recalibrated/
    - BAM文件
atacseq:
  • results/macs2/narrowPeak/
    - 峰调用结果
  • results/bwa/mergedLibrary/bigwig/
    - 覆盖度轨迹

Quick Reference

快速参考

For common exit codes and fixes, see references/troubleshooting.md.
常见退出码及修复方案请参考references/troubleshooting.md

Resume failed run

恢复失败的运行

bash
nextflow run nf-core/<pipeline> -resume

bash
nextflow run nf-core/<pipeline> -resume

References

参考文档

  • references/geo-sra-acquisition.md - Downloading public GEO/SRA data
  • references/troubleshooting.md - Common issues and fixes
  • references/installation.md - Environment setup
  • references/pipelines/rnaseq.md - RNA-seq pipeline details
  • references/pipelines/sarek.md - Variant calling details
  • references/pipelines/atacseq.md - ATAC-seq details

  • references/geo-sra-acquisition.md - 下载公共GEO/SRA数据
  • references/troubleshooting.md - 常见问题及修复方案
  • references/installation.md - 环境搭建
  • references/pipelines/rnaseq.md - RNA-seq流程详细信息
  • references/pipelines/sarek.md - 变异检测流程详细信息
  • references/pipelines/atacseq.md - ATAC-seq流程详细信息

Disclaimer

免责声明

This skill is provided as a prototype example demonstrating how to integrate nf-core bioinformatics pipelines into Claude Code for automated analysis workflows. The current implementation supports three pipelines (rnaseq, sarek, and atacseq), serving as a foundation that enables the community to expand support to the full set of nf-core pipelines.
It is intended for educational and research purposes and should not be considered production-ready without appropriate validation for your specific use case. Users are responsible for ensuring their computing environment meets pipeline requirements and for verifying analysis results.
Anthropic does not guarantee the accuracy of bioinformatics outputs, and users should follow standard practices for validating computational analyses. This integration is not officially endorsed by or affiliated with the nf-core community.
本工具是一个原型示例,展示如何将nf-core生物信息学流程集成到Claude Code中以实现自动化分析工作流。当前版本支持三个流程(rnaseq、sarek和atacseq),作为基础框架,社区可以扩展支持所有nf-core流程。
本工具仅用于教育和研究目的,未经针对特定使用场景的适当验证,请勿用于生产环境。用户需自行确保其计算环境满足流程要求,并验证分析结果。
Anthropic不保证生物信息学输出结果的准确性,用户应遵循计算分析验证的标准实践。本集成未得到nf-core社区的官方认可或关联。

Attribution

引用说明

When publishing results, cite the appropriate pipeline. Citations are available in each nf-core repository's CITATIONS.md file (e.g., https://github.com/nf-core/rnaseq/blob/3.22.2/CITATIONS.md).
发表研究结果时,请引用相应的流程。引用信息可在每个nf-core仓库的CITATIONS.md文件中找到(例如:https://github.com/nf-core/rnaseq/blob/3.22.2/CITATIONS.md)。

Licenses

许可证