systems-architect
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystems Architect Skill
系统架构师技能
Purpose
用途
Design robust, scalable architectures for bioinformatics software and pipelines.
为生物信息学软件和流程设计稳健、可扩展的架构。
When to Use This Skill
何时使用此技能
Use this skill when you need to:
- Design software architecture for complex bioinformatics systems
- Choose appropriate data structures (pandas, anndata, HDF5, databases)
- Plan for scalability (memory, compute, storage)
- Define APIs and interfaces between components
- Design pipeline orchestration (Snakemake, Nextflow, custom)
- Make technology stack decisions
当你需要以下操作时使用此技能:
- 为复杂生物信息学系统设计软件架构
- 选择合适的数据结构(pandas、anndata、HDF5、数据库)
- 规划可扩展性(内存、计算、存储)
- 定义组件间的API和接口
- 设计流程编排(Snakemake、Nextflow、自定义方案)
- 做出技术栈决策
Workflow Integration
工作流集成
Pattern: Requirements → Architecture Design → Implementation Spec
Biologist Commentator validates requirements
↓
Systems Architect designs architecture
↓
Produces technical specification
↓
Software Developer implements from spec模式:需求 → 架构设计 → 实现规范
Biologist Commentator validates requirements
↓
Systems Architect designs architecture
↓
Produces technical specification
↓
Software Developer implements from specCore Responsibilities
核心职责
1. System Design
1. 系统设计
- Component architecture (modular, extensible)
- Data flow design
- Error handling strategy
- Scalability planning
- 组件架构(模块化、可扩展)
- 数据流设计
- 错误处理策略
- 可扩展性规划
2. Technology Selection
2. 技术选型
- Data structures (when to use what)
- Storage formats (CSV, HDF5, Parquet, databases)
- Execution environments (local, HPC, cloud)
- Pipeline orchestration tools
- 数据结构(何时使用何种结构)
- 存储格式(CSV、HDF5、Parquet、数据库)
- 执行环境(本地、HPC、云)
- 流程编排工具
3. Performance Planning
3. 性能规划
- Memory requirements estimation
- Compute resource allocation
- I/O optimization strategies
- Parallelization approach
- 内存需求估算
- 计算资源分配
- I/O优化策略
- 并行化方案
4. Integration Strategy
4. 集成策略
- How to wrap existing tools
- Container strategy (Docker/Singularity)
- Dependency management
- Version pinning
- 如何封装现有工具
- 容器策略(Docker/Singularity)
- 依赖管理
- 版本固定
5. Architecture Context Document
5. 架构上下文文档
- Maintain persistent context document describing module structure
- Track dependencies and modification order for safe incremental changes
- Document intended usage patterns for each major component
- Provide streaming/incremental change strategies
- 维护描述模块结构的持久化上下文文档
- 跟踪依赖关系和修改顺序,以支持安全的增量变更
- 记录每个主要组件的预期使用模式
- 提供流式/增量变更策略
Architecture Context Document
架构上下文文档
The Architecture Context Document () is a persistent, version-controlled reference that captures architectural intent across sessions. Unlike ephemeral handoffs (deleted after workflow completion), this document survives to guide future development.
.architecture/context.mdPurpose: Provide all agents with a bird's-eye view of the codebase structure, preventing scope creep and ensuring dependency-respecting changes.
Lifecycle:
- Created: During Phase 3 (Architecture Design) of programming-pm workflow
- Updated: When architectural changes occur (new modules, dependency changes, interface modifications)
- Read: By senior-developer and junior-developer before starting implementation (pre-flight step)
Template and protocols: See for:
references/architecture-context-template.md- Four-section template (Module Interconnections, Usage Patterns, Modification Order, Streaming Strategies)
- Generation protocol (Phase 3, Bootstrap Mode for existing codebases, SIMPLE mode abbreviation)
- Maintenance protocol (when to update, staleness detection, drift handling)
- Merge conflict resolution
架构上下文文档()是一个持久化、版本控制的参考文档,用于跨会话捕获架构意图。与工作流完成后即删除的临时交接文档不同,此文档会保留下来以指导未来的开发工作。
.architecture/context.md用途:为所有Agent提供代码库结构的全局视图,防止范围蔓延并确保变更符合依赖关系。
生命周期:
- 创建:在programming-pm工作流的第3阶段(架构设计)期间创建
- 更新:当架构发生变更时(新模块、依赖变更、接口修改)
- 读取:资深开发人员和初级开发人员在开始实现前读取(预启动步骤)
模板和协议:请参阅获取:
references/architecture-context-template.md- 四部分模板(模块互连、使用模式、修改顺序、流式策略)
- 生成协议(第3阶段、现有代码库的Bootstrap模式、SIMPLE模式缩写)
- 维护协议(何时更新、过时检测、漂移处理)
- 合并冲突解决
Bootstrap Mode
Bootstrap模式
For existing codebases without an Architecture Context Document, systems-architect generates the document during Phase 3 using static analysis:
- List modules/components from directory structure (,
src/)modules/ - Infer dependencies from import statements
- Mark unknowns explicitly with ,
[TBD], or[UNKNOWN]tags[INFERRED] - Document incomplete areas as "Known Gaps" at document end
Bootstrap Mode prioritizes incomplete but honest documentation over fabricated completeness. Developers are instructed to treat the code as ground truth and report discrepancies.
对于没有架构上下文文档的现有代码库,系统架构师会在第3阶段通过静态分析生成文档:
- 从目录结构(、
src/)列出模块/组件modules/ - 从导入语句推断依赖关系
- 用、
[TBD]或[UNKNOWN]标签明确标记未知内容[INFERRED] - 在文档末尾将不完整的区域记录为“已知缺口”
Bootstrap模式优先考虑不完整但真实的文档,而非编造的完整性。开发人员需以代码为事实依据,并报告不一致之处。
Standard Architecture Template
标准架构模板
Use :
assets/architecture_template.mdundefined使用:
assets/architecture_template.mdundefinedSystem Architecture: [Project Name]
System Architecture: [Project Name]
Overview
Overview
[1-2 sentence system description]
[1-2 sentence system description]
Components
Components
Data Flow
Data Flow
[Input] → [Processing] → [Output]
[Input] → [Processing] → [Output]
Technology Stack
Technology Stack
- Language: Python 3.11
- Key Libraries: pandas, numpy, scikit-learn
- Storage: HDF5 for matrices, SQLite for metadata
- Execution: Snakemake on HPC cluster
- Language: Python 3.11
- Key Libraries: pandas, numpy, scikit-learn
- Storage: HDF5 for matrices, SQLite for metadata
- Execution: Snakemake on HPC cluster
Scalability
Scalability
- Dataset size: [Expected range]
- Memory: [Requirements]
- Compute: [CPU cores, time estimates]
- Storage: [Space requirements]
- Dataset size: [Expected range]
- Memory: [Requirements]
- Compute: [CPU cores, time estimates]
- Storage: [Space requirements]
Error Handling
Error Handling
[Strategy for failures, retries, logging]
[Strategy for failures, retries, logging]
Deployment
Deployment
[Installation, configuration, execution]
undefined[Installation, configuration, execution]
undefinedData Structure Selection Guide
数据结构选择指南
See for full details.
references/data_structure_guide.mdQuick Reference:
| Use Case | Structure | When |
|---|---|---|
| Tabular data <1GB | pandas DataFrame | General analysis |
| Tabular data >1GB | Dask DataFrame | Out-of-core processing |
| Single-cell data | AnnData | scRNA-seq analysis |
| Large matrices | HDF5 | Persistent storage |
| Relational queries | SQLite/PostgreSQL | Complex joins |
| Genomic intervals | BED/GFF files | Standard interchange |
| Time series | pandas with DatetimeIndex | Temporal data |
详情请参阅。
references/data_structure_guide.md快速参考:
| 使用场景 | 结构 | 适用时机 |
|---|---|---|
| 表格数据 <1GB | pandas DataFrame | 常规分析 |
| 表格数据 >1GB | Dask DataFrame | 核外处理 |
| 单细胞数据 | AnnData | scRNA-seq分析 |
| 大型矩阵 | HDF5 | 持久化存储 |
| 关系型查询 | SQLite/PostgreSQL | 复杂连接 |
| 基因组区间 | BED/GFF files | 标准交换格式 |
| 时间序列 | pandas with DatetimeIndex | 时间数据 |
Scalability Considerations
可扩展性考量
Memory Estimation
内存估算
RNA-seq count matrix: genes × samples × 8 bytes
20,000 genes × 1,000 samples × 8 = 160 MB (fits in RAM)
20,000 genes × 100,000 cells × 8 = 16 GB (need sparse or chunking)RNA-seq count matrix: genes × samples × 8 bytes
20,000 genes × 1,000 samples × 8 = 160 MB (fits in RAM)
20,000 genes × 100,000 cells × 8 = 16 GB (need sparse or chunking)Compute Planning
计算规划
DESeq2 analysis: O(n_genes × n_samples²)
100 samples: ~5 minutes
1,000 samples: ~8 hours
Strategy: Subset for testing, full run overnightDESeq2 analysis: O(n_genes × n_samples²)
100 samples: ~5 minutes
1,000 samples: ~8 hours
Strategy: Subset for testing, full run overnightStorage Planning
存储规划
FASTQ (compressed): 50-100 MB per million reads
50M reads = 5 GB
100 samples × 50M reads = 500 GB
Strategy: Delete FASTQ after alignment, keep BAMFASTQ (compressed): 50-100 MB per million reads
50M reads = 5 GB
100 samples × 50M reads = 500 GB
Strategy: Delete FASTQ after alignment, keep BAMIntegration Patterns
集成模式
Wrapping External Tools
封装外部工具
python
undefinedpython
undefinedPattern 1: Subprocess call
Pattern 1: Subprocess call
import subprocess
result = subprocess.run(
['fastqc', input_file, '-o', output_dir],
capture_output=True, check=True
)
import subprocess
result = subprocess.run(
['fastqc', input_file, '-o', output_dir],
capture_output=True, check=True
)
Pattern 2: Python binding (preferred if available)
Pattern 2: Python binding (preferred if available)
import pysam
bam = pysam.AlignmentFile(bam_file, 'rb')
undefinedimport pysam
bam = pysam.AlignmentFile(bam_file, 'rb')
undefinedContainer Strategy
容器策略
yaml
undefinedyaml
undefinedDockerfile approach for reproducibility
Dockerfile approach for reproducibility
FROM python:3.11-slim
RUN pip install numpy pandas scikit-learn
COPY pipeline.py /app/
ENTRYPOINT ["python", "/app/pipeline.py"]
undefinedFROM python:3.11-slim
RUN pip install numpy pandas scikit-learn
COPY pipeline.py /app/
ENTRYPOINT ["python", "/app/pipeline.py"]
undefinedOutput: Technical Specification
输出:技术规范
Deliverable to Software Developer includes:
- Architecture diagram (components + data flow)
- Component specifications (inputs, outputs, responsibilities)
- Technology stack (exact versions)
- Data structures (schemas, formats)
- Error handling (what to do when steps fail)
- Performance requirements (memory, time, storage)
- Testing strategy (unit, integration, validation)
- Architecture Context Document (- persistent context for incremental development)
.architecture/context.md
交付给软件开发人员的内容包括:
- 架构图(组件 + 数据流)
- 组件规范(输入、输出、职责)
- 技术栈(精确版本)
- 数据结构(schemas、格式)
- 错误处理(步骤失败时的处理方式)
- 性能要求(内存、时间、存储)
- 测试策略(单元测试、集成测试、验证)
- 架构上下文文档(- 用于增量开发的持久化上下文)
.architecture/context.md
References
参考资料
For detailed guidance:
- - Common patterns with pros/cons
references/architecture_patterns.md - - When to use which data structure
references/data_structure_guide.md - - Memory, compute, storage planning
references/scalability_considerations.md - - How to wrap tools, containers, dependencies
references/integration_patterns.md - - Architecture Context Document template, generation, and maintenance protocols
references/architecture-context-template.md
如需详细指导:
- - 常见模式及优缺点
references/architecture_patterns.md - - 何时使用何种数据结构
references/data_structure_guide.md - - 内存、计算、存储规划
references/scalability_considerations.md - - 如何封装工具、容器、依赖
references/integration_patterns.md - - 架构上下文文档模板、生成和维护协议
references/architecture-context-template.md
Example Architecture
示例架构
Project: QC Pipeline for 1,000 RNA-seq Samples
undefined项目:1000个RNA-seq样本的QC流程
undefinedArchitecture Specification
Architecture Specification
Overview
Overview
Parallel QC pipeline processing 1,000 bulk RNA-seq FASTQ files with automated report generation.
Parallel QC pipeline processing 1,000 bulk RNA-seq FASTQ files with automated report generation.
Components
Components
- Validator: Check FASTQ integrity, format
- QC Runner: Execute FastQC in parallel
- Aggregator: Combine metrics with MultiQC
- Reporter: Generate summary statistics and plots
- Validator: Check FASTQ integrity, format
- QC Runner: Execute FastQC in parallel
- Aggregator: Combine metrics with MultiQC
- Reporter: Generate summary statistics and plots
Data Flow
Data Flow
FASTQ files → Validator → QC Runner (parallel) → Aggregator → HTML Report
FASTQ files → Validator → QC Runner (parallel) → Aggregator → HTML Report
Technology Stack
Technology Stack
- Execution: Snakemake (manages dependencies, parallelization)
- QC: FastQC 0.12.1
- Aggregation: MultiQC 1.14
- Custom code: Python 3.11, pandas, matplotlib
- Storage: FASTQ (gzip), QC metrics (JSON), report (HTML)
- Execution: Snakemake (manages dependencies, parallelization)
- QC: FastQC 0.12.1
- Aggregation: MultiQC 1.14
- Custom code: Python 3.11, pandas, matplotlib
- Storage: FASTQ (gzip), QC metrics (JSON), report (HTML)
Scalability
Scalability
- Data: 1,000 samples × 50M reads × 100 bp = 500 GB FASTQ
- Compute: 100 parallel jobs on HPC cluster
- Time: 30 min per sample → 300 min total (5 hours)
- Memory: 4 GB per FastQC job = 400 GB total (distributed)
- Data: 1,000 samples × 50M reads × 100 bp = 500 GB FASTQ
- Compute: 100 parallel jobs on HPC cluster
- Time: 30 min per sample → 300 min total (5 hours)
- Memory: 4 GB per FastQC job = 400 GB total (distributed)
Error Handling
Error Handling
- Retry failed jobs (3 attempts)
- Continue pipeline if individual samples fail
- Log all errors with sample ID
- Final report includes QC pass/fail status per sample
- Retry failed jobs (3 attempts)
- Continue pipeline if individual samples fail
- Log all errors with sample ID
- Final report includes QC pass/fail status per sample
Deployment
Deployment
- Install: micromamba env from environment.yml
- Config: samples.csv (list of FASTQ paths)
- Execute: snakemake --cores 100 --cluster "sbatch -c 4 --mem=4GB"
- Output: results/multiqc_report.html
Hands to Software Developer for implementation.- Install: micromamba env from environment.yml
- Config: samples.csv (list of FASTQ paths)
- Execute: snakemake --cores 100 --cluster "sbatch -c 4 --mem=4GB"
- Output: results/multiqc_report.html
移交软件开发人员进行实现。Success Criteria
成功标准
Architecture is complete when:
- All components clearly defined
- Data flow unambiguous
- Technology choices justified
- Scalability analyzed (memory, compute, storage)
- Error handling planned
- Developer can implement without architecture questions
架构即完成,当:
- 所有组件已明确定义
- 数据流清晰无歧义
- 技术选型已论证
- 可扩展性已分析(内存、计算、存储)
- 错误处理已规划
- 开发人员无需询问架构问题即可实现