architecture-design

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Architecture Design - ML Project Template

架构设计 - ML项目模板

This skill defines the standard code architecture for machine learning projects based on the template structure. When modifying or extending code, follow these patterns to maintain consistency.
本指南定义了基于模板结构的机器学习项目标准代码架构。修改或扩展代码时,请遵循以下模式以保持一致性。

Overview

概述

The project follows a modular, extensible architecture with clear separation of concerns. Each module (data, model, trainer, analysis) is independently organized using factory and registry patterns for maximum flexibility.
本项目采用模块化、可扩展的架构,权责清晰分离。每个模块(data、model、trainer、analysis)都使用工厂和注册模式独立组织,最大化灵活性。

Core Design Patterns

核心设计模式

Factory Pattern

Factory Pattern

Each module uses a factory to create instances dynamically:
python
undefined
每个模块使用工厂动态创建实例:
python
undefined

Example from data_module/dataset/init.py

Example from data_module/dataset/init.py

DATASET_FACTORY: Dict = {}
def DatasetFactory(data_name: str): dataset = DATASET_FACTORY.get(data_name, None) if dataset is None: print(f"{data_name} dataset is not implementation, use simple dataset") dataset = DATASET_FACTORY.get('simple') return dataset

For detailed guidance, refer to `references/factory_pattern.md`.
DATASET_FACTORY: Dict = {}
def DatasetFactory(data_name: str): dataset = DATASET_FACTORY.get(data_name, None) if dataset is None: print(f"{data_name} dataset is not implementation, use simple dataset") dataset = DATASET_FACTORY.get('simple') return dataset

详细指引请参考 `references/factory_pattern.md`。

Registry Pattern

Registry Pattern

Components register themselves via decorators:
python
undefined
组件通过装饰器完成自我注册:
python
undefined

Example from data_module/dataset/simple_dataset.py

Example from data_module/dataset/simple_dataset.py

@register_dataset("simple") class SimpleDataset(Dataset): def init(self, data): self.data = data

For detailed guidance, refer to `references/registry_pattern.md`.
@register_dataset("simple") class SimpleDataset(Dataset): def init(self, data): self.data = data

详细指引请参考 `references/registry_pattern.md`。

Auto-Import Pattern

Auto-Import Pattern

Modules automatically discover and import submodules:
python
undefined
模块自动发现并导入子模块:
python
undefined

Example from data_module/dataset/init.py

Example from data_module/dataset/init.py

models_dir = os.path.dirname(file) import_modules(models_dir, "src.data_module.dataset")

For detailed guidance, refer to `references/auto_import.md`.
models_dir = os.path.dirname(file) import_modules(models_dir, "src.data_module.dataset")

详细指引请参考 `references/auto_import.md`。

Directory Structure

目录结构

project/
├── run/
│   ├── pipeline/            # Main workflow scripts
│   │   ├── training/        # Training pipelines
│   │   ├── prepare_data/    # Data preparation pipelines
│   │   └── analysis/        # Analysis pipelines
│   └── conf/                # Hydra configuration files
│       ├── training/        # Training configs
│       ├── dataset/         # Dataset configs
│       ├── model/           # Model configs
│       ├── prepare_data/    # Data prep configs
│       └── analysis/        # Analysis configs
├── src/
│   ├── data_module/         # Data processing module
│   │   ├── dataset/         # Dataset implementations
│   │   ├── augmentation/    # Data augmentation
│   │   ├── collate_fn/      # Collate functions
│   │   ├── compute_metrics/ # Metrics computation
│   │   ├── prepare_data/    # Data preparation logic
│   │   ├── data_func/       # Data utility functions
│   │   └── utils.py         # Module-specific utilities
│   │
│   ├── model_module/        # Model implementations
│   │   ├── brain_decoder/   # Brain decoder models
│   │   └── model/           # Alternative model location
│   │
│   ├── trainer_module/      # Training logic
│   ├── analysis_module/     # Analysis and evaluation
│   ├── llm/                 # LLM-related code
│   └── utils/               # Shared utilities
├── data/
│   ├── raw/                 # Original, immutable data
│   ├── processed/           # Cleaned, transformed data
│   └── external/            # Third-party data
├── outputs/
│   ├── logs/                # Training and evaluation logs
│   ├── checkpoints/         # Model checkpoints
│   ├── tables/              # Result tables
│   └── figures/             # Plots and visualizations
├── pyproject.toml           # Project configuration
├── uv.lock                  # Dependency lock file
├── TODO.md                  # Task tracking
├── README.md                # Project documentation
└── .gitignore               # Git ignore rules
For detailed directory structure with file descriptions, refer to
references/structure.md
.
project/
├── run/
│   ├── pipeline/            # 主工作流脚本
│   │   ├── training/        # 训练流水线
│   │   ├── prepare_data/    # 数据准备流水线
│   │   └── analysis/        # 分析流水线
│   └── conf/                # Hydra 配置文件
│       ├── training/        # 训练配置
│       ├── dataset/         # 数据集配置
│       ├── model/           # 模型配置
│       ├── prepare_data/    # 数据准备配置
│       └── analysis/        # 分析配置
├── src/
│   ├── data_module/         # 数据处理模块
│   │   ├── dataset/         # 数据集实现
│   │   ├── augmentation/    # 数据增强
│   │   ├── collate_fn/      # Collate 函数
│   │   ├── compute_metrics/ # 指标计算
│   │   ├── prepare_data/    # 数据准备逻辑
│   │   ├── data_func/       # 数据工具函数
│   │   └── utils.py         # 模块专属工具
│   │
│   ├── model_module/        # 模型实现
│   │   ├── brain_decoder/   # 脑解码器模型
│   │   └── model/           # 备用模型存放目录
│   │
│   ├── trainer_module/      # 训练逻辑
│   ├── analysis_module/     # 分析与评估
│   ├── llm/                 # LLM相关代码
│   └── utils/               # 公共工具
├── data/
│   ├── raw/                 # 原始不可修改数据
│   ├── processed/           # 清洗转换后的数据
│   └── external/            # 第三方数据
├── outputs/
│   ├── logs/                # 训练与评估日志
│   ├── checkpoints/         # 模型 checkpoint
│   ├── tables/              # 结果表格
│   └── figures/             # 图表与可视化结果
├── pyproject.toml           # 项目配置
├── uv.lock                  # 依赖锁定文件
├── TODO.md                  # 任务跟踪
├── README.md                # 项目文档
└── .gitignore               # Git 忽略规则
带文件说明的详细目录结构请参考
references/structure.md

Module Organization

模块组织规范

Creating a New Dataset

创建新数据集

When adding a new dataset:
  1. Create file in
    src/data_module/dataset/
  2. Use
    @register_dataset("name")
    decorator
  3. Inherit from
    torch.utils.data.Dataset
  4. Implement
    __init__
    ,
    __len__
    ,
    __getitem__
python
from torch.utils.data import Dataset
from typing import Dict
import torch
from src.data_module.dataset import register_dataset

@register_dataset("custom")
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i: int) -> Dict[str, torch.Tensor]:
        return self.data[i]
添加新数据集时遵循以下步骤:
  1. src/data_module/dataset/
    下创建文件
  2. 使用
    @register_dataset("name")
    装饰器
  3. 继承自
    torch.utils.data.Dataset
  4. 实现
    __init__
    __len__
    __getitem__
    方法
python
from torch.utils.data import Dataset
from typing import Dict
import torch
from src.data_module.dataset import register_dataset

@register_dataset("custom")
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i: int) -> Dict[str, torch.Tensor]:
        return self.data[i]

Creating a New Model

创建新模型

CRITICAL: Models use config-driven pattern
When adding a new model:
  1. Create file in
    src/model_module/model/
    or appropriate module subdirectory
  2. Use
    @register_model('ModelName')
    decorator
  3. __init__
    accepts ONLY
    cfg
    parameter - all hyperparameters come from config
  4. forward()
    returns dict:
    {"loss": loss, "labels": labels, "logits": logits}
  5. Handle training vs inference modes using
    self.training
python
from src.model_module.brain_decoder import register_model

@register_model('MyModel')
class MyModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.task = cfg.dataset.task

        # ALL parameters from cfg
        self.hidden_dim = cfg.model.hidden_dim
        self.output_dim = cfg.dataset.target_size[cfg.dataset.task]

    def forward(self, x, labels=None, **kwargs):
        if self.training:
            # Training logic
            pass
        else:
            # Inference logic
            pass

        return {"loss": loss, "labels": labels, "logits": logits}
重要提示:模型采用配置驱动模式
添加新模型时遵循以下步骤:
  1. src/model_module/model/
    或对应模块子目录下创建文件
  2. 使用
    @register_model('ModelName')
    装饰器
  3. __init__
    仅接受
    cfg
    参数,所有超参数均来自配置
  4. forward()
    返回字典:
    {"loss": loss, "labels": labels, "logits": logits}
  5. 使用
    self.training
    区分训练与推理模式
python
from src.model_module.brain_decoder import register_model

@register_model('MyModel')
class MyModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.task = cfg.dataset.task

        # 所有参数均来自 cfg
        self.hidden_dim = cfg.model.hidden_dim
        self.output_dim = cfg.dataset.target_size[cfg.dataset.task]

    def forward(self, x, labels=None, **kwargs):
        if self.training:
            # 训练逻辑
            pass
        else:
            # 推理逻辑
            pass

        return {"loss": loss, "labels": labels, "logits": logits}

Adding Data Augmentation

添加数据增强

When adding augmentation:
  1. Create file in
    src/data_module/augmentation/
  2. Implement transformation function
  3. Register with factory if needed
添加增强逻辑时遵循以下步骤:
  1. src/data_module/augmentation/
    下创建文件
  2. 实现转换函数
  3. 按需注册到工厂

Code Style Guidelines

代码风格规范

For comprehensive style guidelines, refer to
references/code_style.md
.
Key principles:
  • Always use type hints for function signatures
  • Follow import order: standard library → third-party → local
  • Module
    __init__.py
    files contain factory/registry logic
  • Model classes must be config-driven
完整的风格规范请参考
references/code_style.md
核心原则:
  • 函数签名始终使用类型提示
  • 导入顺序遵循:标准库 → 第三方库 → 本地模块
  • 模块的
    __init__.py
    文件仅存放工厂/注册逻辑
  • 模型类必须是配置驱动的

Configuration Management

配置管理

The project uses Hydra for configuration management:
  • Config files in
    run/conf/
    organize by module
  • Each stage (training, analysis) has its own config structure
  • Use YAML files for all configuration
项目使用Hydra进行配置管理:
  • 配置文件存放在
    run/conf/
    下,按模块组织
  • 每个阶段(训练、分析)都有独立的配置结构
  • 所有配置均使用YAML文件

When Working on This Project

项目协作规范

Before Modifying Code

修改代码前

  1. Read the relevant module's factory/registry pattern
  2. Check existing implementations for consistency
  3. Follow the established directory structure
  4. Use registration decorators for new components
  1. 阅读对应模块的工厂/注册模式说明
  2. 参考现有实现保持一致性
  3. 遵循既定的目录结构
  4. 新组件使用注册装饰器

Adding New Features

添加新功能

  1. Determine which module the feature belongs to
  2. Check if similar functionality exists
  3. Follow factory/registry pattern if creating new component types
  4. Add configuration files if needed
  5. Update documentation
  1. 确认功能所属的模块
  2. 检查是否已存在类似功能
  3. 若创建新组件类型需遵循工厂/注册模式
  4. 按需添加配置文件
  5. 更新文档

Code Review Checklist

代码评审检查项

  • Uses factory/registry pattern appropriately
  • Follows module directory structure
  • Has proper type annotations
  • Imports are correctly ordered
  • Registration decorator is used
  • Configuration files are added if needed
  • 正确使用工厂/注册模式
  • 遵循模块目录结构
  • 有完整的类型注解
  • 导入顺序正确
  • 使用了注册装饰器
  • 按需添加了配置文件

Additional Resources

额外资源

Reference Files

参考文档

For detailed information, consult:
  • references/structure.md
    - Detailed directory structure with file descriptions
  • references/factory_pattern.md
    - Factory pattern in-depth explanation
  • references/registry_pattern.md
    - Registry pattern in-depth explanation
  • references/auto_import.md
    - Auto-import pattern in-depth explanation
  • references/code_style.md
    - Comprehensive code style guidelines
如需详细信息,请查阅:
  • references/structure.md
    - 带文件说明的详细目录结构
  • references/factory_pattern.md
    - 工厂模式深度说明
  • references/registry_pattern.md
    - 注册模式深度说明
  • references/auto_import.md
    - 自动导入模式深度说明
  • references/code_style.md
    - 完整代码风格规范

Example Files

示例文件

Working examples in
examples/
:
  • examples/custom_dataset.py
    - Custom dataset implementation
  • examples/custom_model.py
    - Custom model implementation
  • examples/augmentation_example.py
    - Data augmentation example
  • examples/config_example.yaml
    - Configuration file example
  • examples/pipeline_example.sh
    - Pipeline script example
examples/
目录下的可运行示例:
  • examples/custom_dataset.py
    - 自定义数据集实现示例
  • examples/custom_model.py
    - 自定义模型实现示例
  • examples/augmentation_example.py
    - 数据增强示例
  • examples/config_example.yaml
    - 配置文件示例
  • examples/pipeline_example.sh
    - 流水线脚本示例