systems-architect

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Systems Architect Skill

系统架构师技能

Purpose

用途

Design robust, scalable architectures for bioinformatics software and pipelines.

为生物信息学软件和流程设计稳健、可扩展的架构。

When to Use This Skill

何时使用此技能

Use this skill when you need to:

Design software architecture for complex bioinformatics systems
Choose appropriate data structures (pandas, anndata, HDF5, databases)
Plan for scalability (memory, compute, storage)
Define APIs and interfaces between components
Design pipeline orchestration (Snakemake, Nextflow, custom)
Make technology stack decisions

当你需要以下操作时使用此技能：

为复杂生物信息学系统设计软件架构
选择合适的数据结构（pandas、anndata、HDF5、数据库）
规划可扩展性（内存、计算、存储）
定义组件间的API和接口
设计流程编排（Snakemake、Nextflow、自定义方案）
做出技术栈决策

Workflow Integration

工作流集成

Pattern: Requirements → Architecture Design → Implementation Spec

Biologist Commentator validates requirements
    ↓
Systems Architect designs architecture
    ↓
Produces technical specification
    ↓
Software Developer implements from spec

模式：需求 → 架构设计 → 实现规范

Biologist Commentator validates requirements
    ↓
Systems Architect designs architecture
    ↓
Produces technical specification
    ↓
Software Developer implements from spec

Core Responsibilities

核心职责

1. System Design

1. 系统设计

Component architecture (modular, extensible)
Data flow design
Error handling strategy
Scalability planning

组件架构（模块化、可扩展）
数据流设计
错误处理策略
可扩展性规划

2. Technology Selection

2. 技术选型

Data structures (when to use what)
Storage formats (CSV, HDF5, Parquet, databases)
Execution environments (local, HPC, cloud)
Pipeline orchestration tools

数据结构（何时使用何种结构）
存储格式（CSV、HDF5、Parquet、数据库）
执行环境（本地、HPC、云）
流程编排工具

3. Performance Planning

3. 性能规划

Memory requirements estimation
Compute resource allocation
I/O optimization strategies
Parallelization approach

内存需求估算
计算资源分配
I/O优化策略
并行化方案

4. Integration Strategy

4. 集成策略

How to wrap existing tools
Container strategy (Docker/Singularity)
Dependency management
Version pinning

如何封装现有工具
容器策略（Docker/Singularity）
依赖管理
版本固定

5. Architecture Context Document

5. 架构上下文文档

Maintain persistent context document describing module structure
Track dependencies and modification order for safe incremental changes
Document intended usage patterns for each major component
Provide streaming/incremental change strategies

维护描述模块结构的持久化上下文文档
跟踪依赖关系和修改顺序，以支持安全的增量变更
记录每个主要组件的预期使用模式
提供流式/增量变更策略

Architecture Context Document

架构上下文文档

The Architecture Context Document (

.architecture/context.md

) is a persistent, version-controlled reference that captures architectural intent across sessions. Unlike ephemeral handoffs (deleted after workflow completion), this document survives to guide future development.

Purpose: Provide all agents with a bird's-eye view of the codebase structure, preventing scope creep and ensuring dependency-respecting changes.

Lifecycle:

Created: During Phase 3 (Architecture Design) of programming-pm workflow
Updated: When architectural changes occur (new modules, dependency changes, interface modifications)
Read: By senior-developer and junior-developer before starting implementation (pre-flight step)

Template and protocols: See

references/architecture-context-template.md

for:

Four-section template (Module Interconnections, Usage Patterns, Modification Order, Streaming Strategies)
Generation protocol (Phase 3, Bootstrap Mode for existing codebases, SIMPLE mode abbreviation)
Maintenance protocol (when to update, staleness detection, drift handling)
Merge conflict resolution

架构上下文文档（

.architecture/context.md

）是一个持久化、版本控制的参考文档，用于跨会话捕获架构意图。与工作流完成后即删除的临时交接文档不同，此文档会保留下来以指导未来的开发工作。

用途：为所有Agent提供代码库结构的全局视图，防止范围蔓延并确保变更符合依赖关系。

生命周期：

创建：在programming-pm工作流的第3阶段（架构设计）期间创建
更新：当架构发生变更时（新模块、依赖变更、接口修改）
读取：资深开发人员和初级开发人员在开始实现前读取（预启动步骤）

模板和协议：请参阅

references/architecture-context-template.md

获取：

四部分模板（模块互连、使用模式、修改顺序、流式策略）
生成协议（第3阶段、现有代码库的Bootstrap模式、SIMPLE模式缩写）
维护协议（何时更新、过时检测、漂移处理）
合并冲突解决

Bootstrap Mode

Bootstrap模式

For existing codebases without an Architecture Context Document, systems-architect generates the document during Phase 3 using static analysis:

List modules/components from directory structure (
```
src/
```
,
```
modules/
```
)
Infer dependencies from import statements
Mark unknowns explicitly with
```
[TBD]
```
,
```
[UNKNOWN]
```
, or
```
[INFERRED]
```
tags
Document incomplete areas as "Known Gaps" at document end

Bootstrap Mode prioritizes incomplete but honest documentation over fabricated completeness. Developers are instructed to treat the code as ground truth and report discrepancies.

对于没有架构上下文文档的现有代码库，系统架构师会在第3阶段通过静态分析生成文档：

从目录结构（
```
src/
```
、
```
modules/
```
）列出模块/组件
从导入语句推断依赖关系
用
```
[TBD]
```
、
```
[UNKNOWN]
```
或
```
[INFERRED]
```
标签明确标记未知内容
在文档末尾将不完整的区域记录为“已知缺口”

Bootstrap模式优先考虑不完整但真实的文档，而非编造的完整性。开发人员需以代码为事实依据，并报告不一致之处。

Standard Architecture Template

标准架构模板

Use

assets/architecture_template.md

undefined

使用

assets/architecture_template.md

：

undefined

System Architecture: [Project Name]

Overview

[1-2 sentence system description]

Components

Data Flow

[Input] → [Processing] → [Output]

Technology Stack

Language: Python 3.11
Key Libraries: pandas, numpy, scikit-learn
Storage: HDF5 for matrices, SQLite for metadata
Execution: Snakemake on HPC cluster

Language: Python 3.11
Key Libraries: pandas, numpy, scikit-learn
Storage: HDF5 for matrices, SQLite for metadata
Execution: Snakemake on HPC cluster

Scalability

Dataset size: [Expected range]
Memory: [Requirements]
Compute: [CPU cores, time estimates]
Storage: [Space requirements]

Dataset size: [Expected range]
Memory: [Requirements]
Compute: [CPU cores, time estimates]
Storage: [Space requirements]

Error Handling

[Strategy for failures, retries, logging]

Deployment

[Installation, configuration, execution]

undefined

[Installation, configuration, execution]

undefined

Data Structure Selection Guide

数据结构选择指南

See

references/data_structure_guide.md

for full details.

Quick Reference:

Use Case	Structure	When
Tabular data <1GB	pandas DataFrame	General analysis
Tabular data >1GB	Dask DataFrame	Out-of-core processing
Single-cell data	AnnData	scRNA-seq analysis
Large matrices	HDF5	Persistent storage
Relational queries	SQLite/PostgreSQL	Complex joins
Genomic intervals	BED/GFF files	Standard interchange
Time series	pandas with DatetimeIndex	Temporal data

详情请参阅

references/data_structure_guide.md

。

快速参考：

使用场景	结构	适用时机
表格数据 <1GB	pandas DataFrame	常规分析
表格数据 >1GB	Dask DataFrame	核外处理
单细胞数据	AnnData	scRNA-seq分析
大型矩阵	HDF5	持久化存储
关系型查询	SQLite/PostgreSQL	复杂连接
基因组区间	BED/GFF files	标准交换格式
时间序列	pandas with DatetimeIndex	时间数据

Scalability Considerations

可扩展性考量

Memory Estimation

内存估算

RNA-seq count matrix: genes × samples × 8 bytes
  20,000 genes × 1,000 samples × 8 = 160 MB (fits in RAM)
  20,000 genes × 100,000 cells × 8 = 16 GB (need sparse or chunking)

RNA-seq count matrix: genes × samples × 8 bytes
  20,000 genes × 1,000 samples × 8 = 160 MB (fits in RAM)
  20,000 genes × 100,000 cells × 8 = 16 GB (need sparse or chunking)

Compute Planning

计算规划

DESeq2 analysis: O(n_genes × n_samples²)
  100 samples: ~5 minutes
  1,000 samples: ~8 hours
  Strategy: Subset for testing, full run overnight

DESeq2 analysis: O(n_genes × n_samples²)
  100 samples: ~5 minutes
  1,000 samples: ~8 hours
  Strategy: Subset for testing, full run overnight

Storage Planning

存储规划

FASTQ (compressed): 50-100 MB per million reads
  50M reads = 5 GB
  100 samples × 50M reads = 500 GB
  Strategy: Delete FASTQ after alignment, keep BAM

FASTQ (compressed): 50-100 MB per million reads
  50M reads = 5 GB
  100 samples × 50M reads = 500 GB
  Strategy: Delete FASTQ after alignment, keep BAM

Integration Patterns

集成模式

Wrapping External Tools

封装外部工具

python

undefined

python

undefined

Pattern 1: Subprocess call

import subprocess result = subprocess.run( ['fastqc', input_file, '-o', output_dir], capture_output=True, check=True )

Pattern 2: Python binding (preferred if available)

import pysam bam = pysam.AlignmentFile(bam_file, 'rb')

undefined

import pysam bam = pysam.AlignmentFile(bam_file, 'rb')

undefined

Container Strategy

容器策略

yaml

undefined

yaml

undefined

Dockerfile approach for reproducibility

FROM python:3.11-slim RUN pip install numpy pandas scikit-learn COPY pipeline.py /app/ ENTRYPOINT ["python", "/app/pipeline.py"]

undefined

FROM python:3.11-slim RUN pip install numpy pandas scikit-learn COPY pipeline.py /app/ ENTRYPOINT ["python", "/app/pipeline.py"]

undefined

Output: Technical Specification

输出：技术规范

Deliverable to Software Developer includes:

Architecture diagram (components + data flow)
Component specifications (inputs, outputs, responsibilities)
Technology stack (exact versions)
Data structures (schemas, formats)
Error handling (what to do when steps fail)
Performance requirements (memory, time, storage)
Testing strategy (unit, integration, validation)
Architecture Context Document (
```
.architecture/context.md
```
- persistent context for incremental development)

交付给软件开发人员的内容包括：

架构图（组件 + 数据流）
组件规范（输入、输出、职责）
技术栈（精确版本）
数据结构（schemas、格式）
错误处理（步骤失败时的处理方式）
性能要求（内存、时间、存储）
测试策略（单元测试、集成测试、验证）
架构上下文文档（
```
.architecture/context.md
```
- 用于增量开发的持久化上下文）

References

参考资料

For detailed guidance:

```
references/architecture_patterns.md
```
- Common patterns with pros/cons
```
references/data_structure_guide.md
```
- When to use which data structure

references/scalability_considerations.md

- Memory, compute, storage planning

```
references/integration_patterns.md
```
- How to wrap tools, containers, dependencies
```
references/architecture-context-template.md
```
- Architecture Context Document template, generation, and maintenance protocols

如需详细指导：

```
references/architecture_patterns.md
```
- 常见模式及优缺点
```
references/data_structure_guide.md
```
- 何时使用何种数据结构

references/scalability_considerations.md

- 内存、计算、存储规划

```
references/integration_patterns.md
```
- 如何封装工具、容器、依赖
```
references/architecture-context-template.md
```
- 架构上下文文档模板、生成和维护协议

Example Architecture

示例架构

Project: QC Pipeline for 1,000 RNA-seq Samples

undefined

项目：1000个RNA-seq样本的QC流程

undefined

Architecture Specification