openchoreo-platform-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OpenChoreo Platform Engineer Guide

OpenChoreo 平台工程师指南

Help with OpenChoreo platform-level work. Keep this file generic and pull specifics from the reference docs or the live cluster only when needed.

用于指导OpenChoreo平台级相关工作，请保持本文件的通用性，仅在必要时从参考文档或运行中的集群获取具体信息。

Scope and pairing

适用范围与技能搭配

Use this skill for PE-owned work:

Cluster-side setup, upgrades, and troubleshooting
```
kubectl
```
, Helm, CRD, controller, or agent investigation
Platform resources such as DataPlane, BuildPlane, ObservabilityPlane, Environment, DeploymentPipeline, Project, ComponentType, Trait, and Workflow
Shared platform capabilities such as gateways, secret stores, registries, identity, RBAC, and observability

Activate

openchoreo-developer

at the same time when the task also includes any of these:

Deploying or debugging an application
Editing app-facing Component, Workload, ReleaseBinding, or
```
workload.yaml
```
Using
```
occ
```
to inspect or operate a developer workload

If both skills are available and the task touches both app behavior and platform behavior, use both immediately. Do not wait to fail on one side before loading the other.

本指南适用于平台工程师（PE）负责的工作：

集群端搭建、升级和故障排查
```
kubectl
```
、Helm、CRD、控制器或Agent相关问题排查
平台资源，例如DataPlane、BuildPlane、ObservabilityPlane、Environment、DeploymentPipeline、Project、ComponentType、Trait和Workflow
共享平台能力，例如网关、密钥存储、镜像仓库、身份认证、RBAC和可观测性

当任务包含以下任意场景时，请同时启用

openchoreo-developer

：

部署或调试应用
编辑面向应用的Component、Workload、ReleaseBinding或
```
workload.yaml
```
使用
```
occ
```
查看或操作开发者工作负载

如果两个技能都可用且任务同时涉及应用行为和平台行为，请立即同时启用两个技能，不要等其中一侧排查失败后再加载另一个技能。

Working style

工作方式

Prefer progressive discovery over memorized specifics:

Identify the exact plane, namespace, resource, or failure domain.
Inspect live state first with
```
kubectl
```
,
```
occ
```
, Helm, and current resource YAML.
Read only the reference file that matches the task.
Make the smallest change that can prove or fix the issue.
Verify the result from the live cluster before moving on.

Treat the live cluster and current repo as the source of truth. If a remembered field name, example, or behavior conflicts with current output, trust the current output and then confirm in the relevant reference file or repository source.

Avoid loading all references up front. Pull them in only when the task requires that area.

优先采用渐进式排查，而非依赖记忆的具体信息：

定位准确的平面、命名空间、资源或故障域
首先使用
```
kubectl
```
、
```
occ
```
、Helm查看运行状态和当前资源YAML
仅查阅与任务匹配的参考文件
采用最小变更来验证或修复问题
操作前先从运行中的集群验证结果

将运行中的集群和当前代码仓库视为唯一真值来源。如果记忆中的字段名、示例或行为与当前输出冲突，请信任当前输出，再通过相关参考文件或仓库源码确认。

避免预先加载所有参考文档，仅在任务涉及对应领域时再查阅。

Reference routing

参考文档指引

Read only what the task needs:

```
references/operations.md
```
for namespace provisioning, topology, multi-cluster connectivity, and upgrades
```
references/templates-and-workflows.md
```
for ComponentTypes, Traits, Workflows, CEL, and template rules
```
references/integrations.md
```
for secret stores, registries, identity, RBAC, webhooks, and API management
```
references/observability.md
```
for logs, metrics, traces, alerts, and notification channels
```
references/troubleshooting.md
```
for failure isolation, health checks, log locations, and common failure patterns
```
references/cli-and-resources.md
```
for PE-relevant
```
occ
```
commands and platform resource schemas
```
references/mcp-reference.md
```
for MCP tool usage: mapping platform workflows to MCP tools, initial platform setup order, platform resource schemas via MCP, and MCP-specific gotchas — read this when operating through an MCP-connected AI agent instead of the CLI
```
references/gitops.md
```
for GitOps repository layout and release flow
```
references/community-modules.md
```
for pluggable gateways and observability backends
```
references/advanced-setup.md
```
for certificates, private Git, custom build flows, and identity-provider swaps
```
references/repo-and-context7.md
```
when the docs are not enough and you need controller logic, CRD definitions, or Helm chart details

仅查阅任务所需的参考内容：

```
references/operations.md
```
：适用于命名空间 provisioning、拓扑、多集群连通性和升级相关问题
```
references/templates-and-workflows.md
```
：适用于ComponentType、Trait、Workflow、CEL和模板规则相关问题
```
references/integrations.md
```
：适用于密钥存储、镜像仓库、身份认证、RBAC、Webhook和API管理相关问题
```
references/observability.md
```
：适用于日志、指标、链路追踪、告警和通知渠道相关问题
```
references/troubleshooting.md
```
：适用于故障隔离、健康检查、日志位置和常见故障模式相关问题
```
references/cli-and-resources.md
```
：适用于平台工程师相关的
```
occ
```
命令和平台资源 schema 相关问题
```
references/mcp-reference.md
```
：适用于MCP工具使用：平台工作流与MCP工具映射、初始平台搭建顺序、通过MCP获取平台资源 schema、以及MCP特定注意事项——当通过连接MCP的AI Agent而非CLI操作时请查阅本文档
```
references/gitops.md
```
：适用于GitOps仓库结构和发布流程相关问题
```
references/community-modules.md
```
：适用于可插拔网关和可观测性后端相关问题
```
references/advanced-setup.md
```
：适用于证书、私有Git、自定义构建流程和身份提供商切换相关问题
```
references/repo-and-context7.md
```
：当文档不足以解决问题，需要查阅控制器逻辑、CRD定义或Helm Chart详情时使用

Discovery-first workflow

排查优先工作流

1. Classify the task

1. 任务分类

Decide whether the work is:

Pure platform work
App work that needs PE help
A mixed task that needs both OpenChoreo skills

For mixed tasks, keep the app-facing thread and the platform-facing thread connected. Many deployment failures are caused by an interaction between Component config and platform config.

判断工作类型属于：

纯平台工作
需要平台工程师协助的应用工作
需要同时使用两种OpenChoreo技能的混合任务

对于混合任务，请保持面向应用的排查线程和面向平台的排查线程关联，很多部署失败是Component配置和平台配置交互导致的。

2. Inspect the current state before planning

2. 规划前先检查当前状态

Start with the smallest useful inspection:

Resource YAML for the object already involved
```
status.conditions
```
Relevant controller, gateway, or agent logs
Current Helm release values when the issue might be installation- or upgrade-related

Do not assume a field exists because it appeared in an older example. Inspect the current CR, schema, or docs before patching. This matters especially for overrides, plane registration, workflow configuration, and trait parameters.

从最小范围的有效检查开始：

已涉及对象的资源YAML
```
status.conditions
```
信息
相关控制器、网关或Agent日志
如果问题可能与安装或升级相关，查看当前Helm发布的值

不要因为旧示例中存在某个字段就默认当前版本也有该字段，打补丁前请先检查当前CR、schema或文档，这一点对于配置覆盖、平面注册、工作流配置和Trait参数尤其重要。

3. Route to the right source of detail

3. 匹配正确的信息来源

After the first inspection, load the matching reference file. If the reference still leaves ambiguity:

Inspect the repository or generated CRDs
Use Context7 for current OpenChoreo docs
Check the live object shape on the cluster

Keep the investigation targeted. Avoid a full-cluster inventory unless the failure is clearly systemic or the affected resource is still unknown.

完成初步检查后，加载对应的参考文件。如果参考文件仍存在歧义：

检查代码仓库或生成的CRD
使用Context7查阅最新的OpenChoreo文档
检查集群上运行的对象结构

保持排查的针对性，除非故障明显是系统性的，或者受影响的资源仍未定位，否则不要执行全集群盘点。

4. Change one layer at a time

4. 每次仅修改一层

Platform tasks often span multiple layers:

Helm install values
control plane namespace resources
remote plane resources
gateway or secret backend configuration
app-visible outcomes such as available types, workflows, or routes

Change the layer that is actually responsible, then re-check the dependent layers. Do not "fix" an application symptom by guessing at platform internals.

平台任务通常跨多个层级：

Helm安装值
控制平面命名空间资源
远程平面资源
网关或密钥后端配置
应用侧可见结果，例如可用类型、工作流或路由

修改实际负责的层级，然后重新检查依赖层，不要通过猜测平台内部逻辑来“修复”应用层面的症状。

5. Verify with live evidence

5. 用运行时证据验证

Verification should come from the platform, not assumption:

Resource conditions changed as expected
Controller or agent logs show the new state
Helm release and pod rollout are healthy
The downstream app-facing symptom is gone

If the platform change succeeded but the app still fails, hand off to or continue with

openchoreo-developer

验证结果应来自平台实际运行状态，而非假设：

资源条件按预期变更
控制器或Agent日志显示新状态
Helm发布和Pod滚动更新正常
下游面向应用的故障症状消失

如果平台变更成功但应用仍报错，请交接给

openchoreo-developer

或继续使用该技能排查。

Stable guardrails

稳定准则

Keep these in mind because they are durable and high-value:

Platform work usually requires
```
kubectl
```
and often Helm; developer work usually centers on
```
occ
```
Upgrade order matters; do not move a remote plane ahead of the control plane
Scope matters; cluster-scoped and namespace-scoped resources are not interchangeable
```
status.conditions
```
, live resource YAML, and current controller logs are better truth sources than memory
When a task needs exact controller behavior or CRD fields, inspect the repo or Context7 instead of guessing
Prefer reversible, inspectable changes over broad edits across many planes or namespaces

请牢记以下持久且高价值的准则：

平台工作通常需要
```
kubectl
```
，且经常用到Helm；开发者工作通常以
```
occ
```
为核心
升级顺序很重要，不要让远程平面版本高于控制平面
作用域很重要，集群级和命名空间级资源不可互换
```
status.conditions
```
、运行时资源YAML和当前控制器日志是比记忆更可靠的真值来源
当任务需要准确的控制器行为或CRD字段时，查阅仓库或Context7而非猜测
优先采用可回滚、可排查的变更，而非跨多个平面或命名空间的大范围修改

Anti-patterns

反模式

Loading every reference file before identifying the actual problem
Repeating stale examples without checking the current cluster or resource schema
Performing wide cluster sweeps before checking the affected object and logs
Treating app-level deployment symptoms as purely platform issues without checking the app resource chain
Making several platform changes at once and losing the causal signal

定位实际问题前加载所有参考文件
重复使用过时的示例，不检查当前集群或资源schema
在检查受影响对象和日志前执行大范围集群扫描
不检查应用资源链路就将应用级部署症状直接归为纯平台问题
同时执行多个平台变更，丢失因果关联信号