architecture-review
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseArchitecture Evaluation Framework
架构评估框架
Current Technology Stack
当前技术栈
| Layer | Technology | Purpose |
|---|---|---|
| OS | Talos Linux | Immutable, API-driven Kubernetes OS |
| GitOps | Flux + ResourceSets | Declarative cluster state reconciliation |
| CNI/Network | Cilium | eBPF networking, network policies, Hubble observability |
| Storage | Longhorn | Distributed block storage with S3 backup |
| Object Storage | Garage | S3-compatible distributed object storage |
| Database | CNPG (CloudNativePG) | PostgreSQL operator with HA and backups |
| Cache/KV | Dragonfly | Redis-compatible in-memory store |
| Monitoring | kube-prometheus-stack | Prometheus + Grafana + Alertmanager |
| Logging | Alloy → Loki | Log collection pipeline |
| Certificates | cert-manager | Automated TLS certificate management |
| Secrets | ESO + AWS SSM | External Secrets Operator with Parameter Store |
| Upgrades | Tuppr | Declarative Talos/Kubernetes/Cilium upgrades |
| Infrastructure | Terragrunt + OpenTofu | Infrastructure as Code for bare-metal provisioning |
| CI/CD | GitHub Actions + OCI | Artifact-based promotion pipeline |
| 层级 | 技术 | 用途 |
|---|---|---|
| 操作系统 | Talos Linux | 不可变、API驱动的Kubernetes操作系统 |
| GitOps | Flux + ResourceSets | 声明式集群状态同步 |
| CNI/网络 | Cilium | eBPF网络、网络策略、Hubble可观测性 |
| 存储 | Longhorn | 支持S3备份的分布式块存储 |
| 对象存储 | Garage | 兼容S3的分布式对象存储 |
| 数据库 | CNPG (CloudNativePG) | 支持高可用与备份能力的PostgreSQL operator |
| 缓存/键值存储 | Dragonfly | 兼容Redis的内存存储 |
| 监控 | kube-prometheus-stack | Prometheus + Grafana + Alertmanager 套件 |
| 日志 | Alloy → Loki | 日志采集流水线 |
| 证书 | cert-manager | 自动化TLS证书管理 |
| 密钥 | ESO + AWS SSM | 对接参数存储的External Secrets Operator |
| 升级 | Tuppr | 声明式Talos/Kubernetes/Cilium升级 |
| 基础设施 | Terragrunt + OpenTofu | 用于裸金属资源编排的基础设施即代码 |
| CI/CD | GitHub Actions + OCI | 基于制品的晋升流水线 |
Evaluation Criteria
评估标准
When evaluating any proposed technology addition or architecture change, assess against these criteria:
评估任何拟引入的技术或架构变更时,请对照以下标准进行评审:
1. Principle Alignment
1. 原则对齐
Score the proposal against each core principle (Strong/Weak/Neutral):
- Enterprise at Home: Does it reflect production-grade patterns?
- Everything as Code: Can it be fully represented in git?
- Automation is Key: Does it reduce or increase manual toil?
- Learning First: Does it teach valuable enterprise skills?
- DRY and Code Reuse: Does it leverage existing patterns or create duplication?
- Continuous Improvement: Does it make the system more maintainable?
对照每个核心原则为提案打分(强/弱/中性):
- 家庭级企业化:是否符合生产级模式?
- 一切即代码:是否可以完全在git中声明定义?
- 自动化优先:是减少还是增加人工工作量?
- 学习优先:是否能传授有价值的企业级技能?
- DRY与代码复用:是复用现有模式还是产生重复代码?
- 持续改进:是否能提升系统的可维护性?
2. Stack Fit
2. 技术栈适配
- Does this overlap with existing tools? (e.g., adding Redis when Dragonfly exists)
- Does it integrate with the GitOps workflow? (Must be Flux-deployable)
- Does it work on bare-metal? (No cloud-only services)
- Does it support the multi-cluster model? (dev → integration → live)
- 是否与现有工具功能重叠?(例如已有Dragonfly的情况下新增Redis)
- 是否能与GitOps工作流集成?(必须支持Flux部署)
- 是否能在裸金属环境运行?(不接受仅支持云环境的服务)
- 是否支持多集群模式?(开发 → 集成 → 生产)
3. Operational Cost
3. 运营成本
- How is it monitored? (Must integrate with kube-prometheus-stack)
- How is it backed up? (Must have a recovery story)
- How does it handle upgrades? (Must be declarative, ideally via Renovate)
- What's the failure blast radius? (Isolated > cluster-wide)
- 如何实现监控?(必须与kube-prometheus-stack集成)
- 如何实现备份?(必须有完整的恢复方案)
- 如何处理升级?(必须是声明式,优先支持Renovate自动升级)
- 故障影响范围有多大?(隔离部署 > 集群级部署)
4. Complexity Budget
4. 复杂度预算
- Is the complexity justified by the learning value?
- Could a simpler existing tool solve the same problem?
- What's the maintenance burden over 12 months?
- 复杂度带来的学习价值是否足够覆盖成本?
- 是否有更简单的现有工具可以解决相同问题?
- 未来12个月的维护负担是多少?
5. Alternative Analysis
5. 替代方案分析
- What existing stack components could solve this? (Always check first)
- What are the top 2-3 alternatives in the ecosystem?
- What do other production homelabs use? (kubesearch research)
- 现有技术栈组件有哪些可以解决该问题?(始终优先检查)
- 生态中排名前2-3的替代方案是什么?
- 其他生产级家庭实验室在使用什么方案?(通过kubesearch调研)
6. Failure Modes
6. 故障模式
- What happens when this component is unavailable?
- How does it interact with network policies? (Default deny)
- What's the recovery procedure? (Must be documented in a runbook)
- Can it self-heal? (Strong preference for self-healing)
- 该组件不可用时会产生什么影响?
- 它如何与网络策略交互?(默认拒绝策略)
- 恢复流程是什么?(必须在运行手册中明确记录)
- 是否支持自愈?(强烈偏好具备自愈能力的方案)
Common Design Patterns
通用设计模式
New Application
新应用
- HelmRelease via ResourceSet (flux-gitops pattern)
- Namespace with network-policy profile label
- ExternalSecret for credentials
- ServiceMonitor + PrometheusRule for observability
- GarageBucketClaim if S3 storage needed
- CNPG Cluster if database needed
- 通过ResourceSet创建HelmRelease(flux-gitops模式)
- 配置网络策略标签的命名空间
- 用于凭证管理的ExternalSecret
- 用于可观测性的ServiceMonitor + PrometheusRule
- 需要S3存储时申请GarageBucketClaim
- 需要数据库时创建CNPG Cluster
New Infrastructure Component
新基础设施组件
- OpenTofu module in
infrastructure/modules/ - Unit in appropriate stack under
infrastructure/units/ - Test coverage in files
.tftest.hcl - Version pinned in if applicable
versions.env
- 存放在目录下的OpenTofu模块
infrastructure/modules/ - 存放在对应技术栈目录下的单元
infrastructure/units/ - 在文件中覆盖测试用例
.tftest.hcl - 适用的话在中锁定版本号
versions.env
New Secret
新密钥
- Store in AWS SSM Parameter Store
- Reference via ExternalSecret CR
- Never commit to git, not even encrypted
- 存储在AWS SSM Parameter Store中
- 通过ExternalSecret CR引用
- 绝对不要提交到git,即使加密也不允许
New Storage
新存储
- Longhorn PVC for block storage (default)
- GarageBucketClaim for object storage (S3-compatible)
- Never use hostPath or emptyDir for persistent data
- 块存储默认使用Longhorn PVC
- 兼容S3的对象存储使用GarageBucketClaim
- 永远不要使用hostPath或emptyDir存储持久化数据
New Database
新数据库
- CNPG Cluster CR for PostgreSQL
- Automated backups to Garage S3
- Connection pooling via PgBouncer (CNPG-managed)
- 用CNPG Cluster CR部署PostgreSQL
- 自动备份到Garage S3
- 通过CNPG托管的PgBouncer实现连接池
New Network Exposure
新网络暴露
- HTTPRoute for HTTP/HTTPS traffic (Gateway API)
- Appropriate network-policy profile label
- cert-manager Certificate for TLS
- Internal gateway for internal-only services
- HTTP/HTTPS流量使用HTTPRoute(Gateway API)
- 配置对应网络策略的标签
- 由cert-manager颁发TLS证书
- 内部服务使用内部网关
Anti-Patterns to Challenge
需要避免的反模式
| Anti-Pattern | Why It's Wrong | Correct Approach |
|---|---|---|
| "Just run a container" without monitoring | Invisible failures, no alerting | ServiceMonitor + PrometheusRule required |
| Adding a new tool when existing ones suffice | Stack bloat, maintenance burden | Evaluate existing stack first |
| Skipping observability "for now" | Technical debt that never gets paid | Monitoring is day-1, not day-2 |
| Manual operational steps | Drift, inconsistency, bus factor | Everything declarative via GitOps |
| Cloud-only services | Vendor lock-in, can't run on bare-metal | Self-hosted alternatives preferred |
| Single-instance without HA story | Single point of failure | At minimum, document recovery procedure |
| Storing state outside git | Shadow configuration, drift | Git is the source of truth |
| 反模式 | 问题原因 | 正确做法 |
|---|---|---|
| 仅运行容器不配置监控 | 故障不可见,无告警通知 | 必须配置ServiceMonitor + PrometheusRule |
| 现有工具足够的情况下新增工具 | 技术栈臃肿,增加维护负担 | 优先评估现有技术栈的能力 |
| "暂时"跳过可观测性建设 | 产生永远不会偿还的技术债务 | 监控是首日需求,不是次日需求 |
| 手动执行运维步骤 | 产生配置漂移、不一致、单点风险 | 所有内容通过GitOps声明式管理 |
| 仅支持云的服务 | 厂商绑定,无法在裸金属环境运行 | 优先选择自托管替代方案 |
| 无高可用方案的单实例部署 | 存在单点故障风险 | 至少要明确记录恢复流程 |
| 在git外存储配置状态 | 产生影子配置,导致配置漂移 | git是唯一的事实来源 |