architecture-review

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Architecture Evaluation Framework

架构评估框架

Current Technology Stack

当前技术栈

Layer	Technology	Purpose
OS	Talos Linux	Immutable, API-driven Kubernetes OS
GitOps	Flux + ResourceSets	Declarative cluster state reconciliation
CNI/Network	Cilium	eBPF networking, network policies, Hubble observability
Storage	Longhorn	Distributed block storage with S3 backup
Object Storage	Garage	S3-compatible distributed object storage
Database	CNPG (CloudNativePG)	PostgreSQL operator with HA and backups
Cache/KV	Dragonfly	Redis-compatible in-memory store
Monitoring	kube-prometheus-stack	Prometheus + Grafana + Alertmanager
Logging	Alloy → Loki	Log collection pipeline
Certificates	cert-manager	Automated TLS certificate management
Secrets	ESO + AWS SSM	External Secrets Operator with Parameter Store
Upgrades	Tuppr	Declarative Talos/Kubernetes/Cilium upgrades
Infrastructure	Terragrunt + OpenTofu	Infrastructure as Code for bare-metal provisioning
CI/CD	GitHub Actions + OCI	Artifact-based promotion pipeline

层级	技术	用途
操作系统	Talos Linux	不可变、API驱动的Kubernetes操作系统
GitOps	Flux + ResourceSets	声明式集群状态同步
CNI/网络	Cilium	eBPF网络、网络策略、Hubble可观测性
存储	Longhorn	支持S3备份的分布式块存储
对象存储	Garage	兼容S3的分布式对象存储
数据库	CNPG (CloudNativePG)	支持高可用与备份能力的PostgreSQL operator
缓存/键值存储	Dragonfly	兼容Redis的内存存储
监控	kube-prometheus-stack	Prometheus + Grafana + Alertmanager 套件
日志	Alloy → Loki	日志采集流水线
证书	cert-manager	自动化TLS证书管理
密钥	ESO + AWS SSM	对接参数存储的External Secrets Operator
升级	Tuppr	声明式Talos/Kubernetes/Cilium升级
基础设施	Terragrunt + OpenTofu	用于裸金属资源编排的基础设施即代码
CI/CD	GitHub Actions + OCI	基于制品的晋升流水线

Evaluation Criteria

评估标准

When evaluating any proposed technology addition or architecture change, assess against these criteria:

评估任何拟引入的技术或架构变更时，请对照以下标准进行评审：

1. Principle Alignment

1. 原则对齐

Score the proposal against each core principle (Strong/Weak/Neutral):

Enterprise at Home: Does it reflect production-grade patterns?
Everything as Code: Can it be fully represented in git?
Automation is Key: Does it reduce or increase manual toil?
Learning First: Does it teach valuable enterprise skills?
DRY and Code Reuse: Does it leverage existing patterns or create duplication?
Continuous Improvement: Does it make the system more maintainable?

对照每个核心原则为提案打分（强/弱/中性）：

家庭级企业化：是否符合生产级模式？
一切即代码：是否可以完全在git中声明定义？
自动化优先：是减少还是增加人工工作量？
学习优先：是否能传授有价值的企业级技能？
DRY与代码复用：是复用现有模式还是产生重复代码？
持续改进：是否能提升系统的可维护性？

2. Stack Fit

2. 技术栈适配

Does this overlap with existing tools? (e.g., adding Redis when Dragonfly exists)
Does it integrate with the GitOps workflow? (Must be Flux-deployable)
Does it work on bare-metal? (No cloud-only services)
Does it support the multi-cluster model? (dev → integration → live)

是否与现有工具功能重叠？（例如已有Dragonfly的情况下新增Redis）
是否能与GitOps工作流集成？（必须支持Flux部署）
是否能在裸金属环境运行？（不接受仅支持云环境的服务）
是否支持多集群模式？（开发 → 集成 → 生产）

3. Operational Cost

3. 运营成本

How is it monitored? (Must integrate with kube-prometheus-stack)
How is it backed up? (Must have a recovery story)
How does it handle upgrades? (Must be declarative, ideally via Renovate)
What's the failure blast radius? (Isolated > cluster-wide)

如何实现监控？（必须与kube-prometheus-stack集成）
如何实现备份？（必须有完整的恢复方案）
如何处理升级？（必须是声明式，优先支持Renovate自动升级）
故障影响范围有多大？（隔离部署 > 集群级部署）

4. Complexity Budget

4. 复杂度预算

Is the complexity justified by the learning value?
Could a simpler existing tool solve the same problem?
What's the maintenance burden over 12 months?

复杂度带来的学习价值是否足够覆盖成本？
是否有更简单的现有工具可以解决相同问题？
未来12个月的维护负担是多少？

5. Alternative Analysis

5. 替代方案分析

What existing stack components could solve this? (Always check first)
What are the top 2-3 alternatives in the ecosystem?
What do other production homelabs use? (kubesearch research)

现有技术栈组件有哪些可以解决该问题？（始终优先检查）
生态中排名前2-3的替代方案是什么？
其他生产级家庭实验室在使用什么方案？（通过kubesearch调研）

6. Failure Modes

6. 故障模式

What happens when this component is unavailable?
How does it interact with network policies? (Default deny)
What's the recovery procedure? (Must be documented in a runbook)
Can it self-heal? (Strong preference for self-healing)

该组件不可用时会产生什么影响？
它如何与网络策略交互？（默认拒绝策略）
恢复流程是什么？（必须在运行手册中明确记录）
是否支持自愈？（强烈偏好具备自愈能力的方案）

Common Design Patterns

通用设计模式

New Application

新应用

HelmRelease via ResourceSet (flux-gitops pattern)
Namespace with network-policy profile label
ExternalSecret for credentials
ServiceMonitor + PrometheusRule for observability
GarageBucketClaim if S3 storage needed
CNPG Cluster if database needed

通过ResourceSet创建HelmRelease（flux-gitops模式）
配置网络策略标签的命名空间
用于凭证管理的ExternalSecret
用于可观测性的ServiceMonitor + PrometheusRule
需要S3存储时申请GarageBucketClaim
需要数据库时创建CNPG Cluster

New Infrastructure Component

新基础设施组件

OpenTofu module in
```
infrastructure/modules/
```
Unit in appropriate stack under
```
infrastructure/units/
```
Test coverage in
```
.tftest.hcl
```
files
Version pinned in
```
versions.env
```
if applicable

存放在
```
infrastructure/modules/
```
目录下的OpenTofu模块
存放在
```
infrastructure/units/
```
对应技术栈目录下的单元
在
```
.tftest.hcl
```
文件中覆盖测试用例
适用的话在
```
versions.env
```
中锁定版本号

New Secret

新密钥

Store in AWS SSM Parameter Store
Reference via ExternalSecret CR
Never commit to git, not even encrypted

存储在AWS SSM Parameter Store中
通过ExternalSecret CR引用
绝对不要提交到git，即使加密也不允许

New Storage

新存储

Longhorn PVC for block storage (default)
GarageBucketClaim for object storage (S3-compatible)
Never use hostPath or emptyDir for persistent data

块存储默认使用Longhorn PVC
兼容S3的对象存储使用GarageBucketClaim
永远不要使用hostPath或emptyDir存储持久化数据

New Database

新数据库

CNPG Cluster CR for PostgreSQL
Automated backups to Garage S3
Connection pooling via PgBouncer (CNPG-managed)

用CNPG Cluster CR部署PostgreSQL
自动备份到Garage S3
通过CNPG托管的PgBouncer实现连接池

New Network Exposure

新网络暴露

HTTPRoute for HTTP/HTTPS traffic (Gateway API)
Appropriate network-policy profile label
cert-manager Certificate for TLS
Internal gateway for internal-only services

HTTP/HTTPS流量使用HTTPRoute（Gateway API）
配置对应网络策略的标签
由cert-manager颁发TLS证书
内部服务使用内部网关

Anti-Patterns to Challenge

需要避免的反模式

Anti-Pattern	Why It's Wrong	Correct Approach
"Just run a container" without monitoring	Invisible failures, no alerting	ServiceMonitor + PrometheusRule required
Adding a new tool when existing ones suffice	Stack bloat, maintenance burden	Evaluate existing stack first
Skipping observability "for now"	Technical debt that never gets paid	Monitoring is day-1, not day-2
Manual operational steps	Drift, inconsistency, bus factor	Everything declarative via GitOps
Cloud-only services	Vendor lock-in, can't run on bare-metal	Self-hosted alternatives preferred
Single-instance without HA story	Single point of failure	At minimum, document recovery procedure
Storing state outside git	Shadow configuration, drift	Git is the source of truth

反模式	问题原因	正确做法
仅运行容器不配置监控	故障不可见，无告警通知	必须配置ServiceMonitor + PrometheusRule
现有工具足够的情况下新增工具	技术栈臃肿，增加维护负担	优先评估现有技术栈的能力
"暂时"跳过可观测性建设	产生永远不会偿还的技术债务	监控是首日需求，不是次日需求
手动执行运维步骤	产生配置漂移、不一致、单点风险	所有内容通过GitOps声明式管理
仅支持云的服务	厂商绑定，无法在裸金属环境运行	优先选择自托管替代方案
无高可用方案的单实例部署	存在单点故障风险	至少要明确记录恢复流程
在git外存储配置状态	产生影子配置，导致配置漂移	git是唯一的事实来源