architecture-review

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Architecture Evaluation Framework

架构评估框架

Current Technology Stack

当前技术栈

LayerTechnologyPurpose
OSTalos LinuxImmutable, API-driven Kubernetes OS
GitOpsFlux + ResourceSetsDeclarative cluster state reconciliation
CNI/NetworkCiliumeBPF networking, network policies, Hubble observability
StorageLonghornDistributed block storage with S3 backup
Object StorageGarageS3-compatible distributed object storage
DatabaseCNPG (CloudNativePG)PostgreSQL operator with HA and backups
Cache/KVDragonflyRedis-compatible in-memory store
Monitoringkube-prometheus-stackPrometheus + Grafana + Alertmanager
LoggingAlloy → LokiLog collection pipeline
Certificatescert-managerAutomated TLS certificate management
SecretsESO + AWS SSMExternal Secrets Operator with Parameter Store
UpgradesTupprDeclarative Talos/Kubernetes/Cilium upgrades
InfrastructureTerragrunt + OpenTofuInfrastructure as Code for bare-metal provisioning
CI/CDGitHub Actions + OCIArtifact-based promotion pipeline
层级技术用途
操作系统Talos Linux不可变、API驱动的Kubernetes操作系统
GitOpsFlux + ResourceSets声明式集群状态同步
CNI/网络CiliumeBPF网络、网络策略、Hubble可观测性
存储Longhorn支持S3备份的分布式块存储
对象存储Garage兼容S3的分布式对象存储
数据库CNPG (CloudNativePG)支持高可用与备份能力的PostgreSQL operator
缓存/键值存储Dragonfly兼容Redis的内存存储
监控kube-prometheus-stackPrometheus + Grafana + Alertmanager 套件
日志Alloy → Loki日志采集流水线
证书cert-manager自动化TLS证书管理
密钥ESO + AWS SSM对接参数存储的External Secrets Operator
升级Tuppr声明式Talos/Kubernetes/Cilium升级
基础设施Terragrunt + OpenTofu用于裸金属资源编排的基础设施即代码
CI/CDGitHub Actions + OCI基于制品的晋升流水线

Evaluation Criteria

评估标准

When evaluating any proposed technology addition or architecture change, assess against these criteria:
评估任何拟引入的技术或架构变更时,请对照以下标准进行评审:

1. Principle Alignment

1. 原则对齐

Score the proposal against each core principle (Strong/Weak/Neutral):
  • Enterprise at Home: Does it reflect production-grade patterns?
  • Everything as Code: Can it be fully represented in git?
  • Automation is Key: Does it reduce or increase manual toil?
  • Learning First: Does it teach valuable enterprise skills?
  • DRY and Code Reuse: Does it leverage existing patterns or create duplication?
  • Continuous Improvement: Does it make the system more maintainable?
对照每个核心原则为提案打分(强/弱/中性):
  • 家庭级企业化:是否符合生产级模式?
  • 一切即代码:是否可以完全在git中声明定义?
  • 自动化优先:是减少还是增加人工工作量?
  • 学习优先:是否能传授有价值的企业级技能?
  • DRY与代码复用:是复用现有模式还是产生重复代码?
  • 持续改进:是否能提升系统的可维护性?

2. Stack Fit

2. 技术栈适配

  • Does this overlap with existing tools? (e.g., adding Redis when Dragonfly exists)
  • Does it integrate with the GitOps workflow? (Must be Flux-deployable)
  • Does it work on bare-metal? (No cloud-only services)
  • Does it support the multi-cluster model? (dev → integration → live)
  • 是否与现有工具功能重叠?(例如已有Dragonfly的情况下新增Redis)
  • 是否能与GitOps工作流集成?(必须支持Flux部署)
  • 是否能在裸金属环境运行?(不接受仅支持云环境的服务)
  • 是否支持多集群模式?(开发 → 集成 → 生产)

3. Operational Cost

3. 运营成本

  • How is it monitored? (Must integrate with kube-prometheus-stack)
  • How is it backed up? (Must have a recovery story)
  • How does it handle upgrades? (Must be declarative, ideally via Renovate)
  • What's the failure blast radius? (Isolated > cluster-wide)
  • 如何实现监控?(必须与kube-prometheus-stack集成)
  • 如何实现备份?(必须有完整的恢复方案)
  • 如何处理升级?(必须是声明式,优先支持Renovate自动升级)
  • 故障影响范围有多大?(隔离部署 > 集群级部署)

4. Complexity Budget

4. 复杂度预算

  • Is the complexity justified by the learning value?
  • Could a simpler existing tool solve the same problem?
  • What's the maintenance burden over 12 months?
  • 复杂度带来的学习价值是否足够覆盖成本?
  • 是否有更简单的现有工具可以解决相同问题?
  • 未来12个月的维护负担是多少?

5. Alternative Analysis

5. 替代方案分析

  • What existing stack components could solve this? (Always check first)
  • What are the top 2-3 alternatives in the ecosystem?
  • What do other production homelabs use? (kubesearch research)
  • 现有技术栈组件有哪些可以解决该问题?(始终优先检查)
  • 生态中排名前2-3的替代方案是什么?
  • 其他生产级家庭实验室在使用什么方案?(通过kubesearch调研)

6. Failure Modes

6. 故障模式

  • What happens when this component is unavailable?
  • How does it interact with network policies? (Default deny)
  • What's the recovery procedure? (Must be documented in a runbook)
  • Can it self-heal? (Strong preference for self-healing)
  • 该组件不可用时会产生什么影响?
  • 它如何与网络策略交互?(默认拒绝策略)
  • 恢复流程是什么?(必须在运行手册中明确记录)
  • 是否支持自愈?(强烈偏好具备自愈能力的方案)

Common Design Patterns

通用设计模式

New Application

新应用

  1. HelmRelease via ResourceSet (flux-gitops pattern)
  2. Namespace with network-policy profile label
  3. ExternalSecret for credentials
  4. ServiceMonitor + PrometheusRule for observability
  5. GarageBucketClaim if S3 storage needed
  6. CNPG Cluster if database needed
  1. 通过ResourceSet创建HelmRelease(flux-gitops模式)
  2. 配置网络策略标签的命名空间
  3. 用于凭证管理的ExternalSecret
  4. 用于可观测性的ServiceMonitor + PrometheusRule
  5. 需要S3存储时申请GarageBucketClaim
  6. 需要数据库时创建CNPG Cluster

New Infrastructure Component

新基础设施组件

  1. OpenTofu module in
    infrastructure/modules/
  2. Unit in appropriate stack under
    infrastructure/units/
  3. Test coverage in
    .tftest.hcl
    files
  4. Version pinned in
    versions.env
    if applicable
  1. 存放在
    infrastructure/modules/
    目录下的OpenTofu模块
  2. 存放在
    infrastructure/units/
    对应技术栈目录下的单元
  3. .tftest.hcl
    文件中覆盖测试用例
  4. 适用的话在
    versions.env
    中锁定版本号

New Secret

新密钥

  1. Store in AWS SSM Parameter Store
  2. Reference via ExternalSecret CR
  3. Never commit to git, not even encrypted
  1. 存储在AWS SSM Parameter Store中
  2. 通过ExternalSecret CR引用
  3. 绝对不要提交到git,即使加密也不允许

New Storage

新存储

  1. Longhorn PVC for block storage (default)
  2. GarageBucketClaim for object storage (S3-compatible)
  3. Never use hostPath or emptyDir for persistent data
  1. 块存储默认使用Longhorn PVC
  2. 兼容S3的对象存储使用GarageBucketClaim
  3. 永远不要使用hostPath或emptyDir存储持久化数据

New Database

新数据库

  1. CNPG Cluster CR for PostgreSQL
  2. Automated backups to Garage S3
  3. Connection pooling via PgBouncer (CNPG-managed)
  1. 用CNPG Cluster CR部署PostgreSQL
  2. 自动备份到Garage S3
  3. 通过CNPG托管的PgBouncer实现连接池

New Network Exposure

新网络暴露

  1. HTTPRoute for HTTP/HTTPS traffic (Gateway API)
  2. Appropriate network-policy profile label
  3. cert-manager Certificate for TLS
  4. Internal gateway for internal-only services
  1. HTTP/HTTPS流量使用HTTPRoute(Gateway API)
  2. 配置对应网络策略的标签
  3. 由cert-manager颁发TLS证书
  4. 内部服务使用内部网关

Anti-Patterns to Challenge

需要避免的反模式

Anti-PatternWhy It's WrongCorrect Approach
"Just run a container" without monitoringInvisible failures, no alertingServiceMonitor + PrometheusRule required
Adding a new tool when existing ones sufficeStack bloat, maintenance burdenEvaluate existing stack first
Skipping observability "for now"Technical debt that never gets paidMonitoring is day-1, not day-2
Manual operational stepsDrift, inconsistency, bus factorEverything declarative via GitOps
Cloud-only servicesVendor lock-in, can't run on bare-metalSelf-hosted alternatives preferred
Single-instance without HA storySingle point of failureAt minimum, document recovery procedure
Storing state outside gitShadow configuration, driftGit is the source of truth
反模式问题原因正确做法
仅运行容器不配置监控故障不可见,无告警通知必须配置ServiceMonitor + PrometheusRule
现有工具足够的情况下新增工具技术栈臃肿,增加维护负担优先评估现有技术栈的能力
"暂时"跳过可观测性建设产生永远不会偿还的技术债务监控是首日需求,不是次日需求
手动执行运维步骤产生配置漂移、不一致、单点风险所有内容通过GitOps声明式管理
仅支持云的服务厂商绑定,无法在裸金属环境运行优先选择自托管替代方案
无高可用方案的单实例部署存在单点故障风险至少要明确记录恢复流程
在git外存储配置状态产生影子配置,导致配置漂移git是唯一的事实来源