openchoreo-platform-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenChoreo Platform Engineer Guide
OpenChoreo 平台工程师指南
Help with OpenChoreo platform-level work. Keep this file generic and pull specifics from the reference docs or the live cluster only when needed.
用于指导OpenChoreo平台级相关工作,请保持本文件的通用性,仅在必要时从参考文档或运行中的集群获取具体信息。
Scope and pairing
适用范围与技能搭配
Use this skill for PE-owned work:
- Cluster-side setup, upgrades, and troubleshooting
- , Helm, CRD, controller, or agent investigation
kubectl - Platform resources such as DataPlane, BuildPlane, ObservabilityPlane, Environment, DeploymentPipeline, Project, ComponentType, Trait, and Workflow
- Shared platform capabilities such as gateways, secret stores, registries, identity, RBAC, and observability
Activate at the same time when the task also includes any of these:
openchoreo-developer- Deploying or debugging an application
- Editing app-facing Component, Workload, ReleaseBinding, or
workload.yaml - Using to inspect or operate a developer workload
occ
If both skills are available and the task touches both app behavior and platform behavior, use both immediately. Do not wait to fail on one side before loading the other.
本指南适用于平台工程师(PE)负责的工作:
- 集群端搭建、升级和故障排查
- 、Helm、CRD、控制器或Agent相关问题排查
kubectl - 平台资源,例如DataPlane、BuildPlane、ObservabilityPlane、Environment、DeploymentPipeline、Project、ComponentType、Trait和Workflow
- 共享平台能力,例如网关、密钥存储、镜像仓库、身份认证、RBAC和可观测性
当任务包含以下任意场景时,请同时启用:
openchoreo-developer- 部署或调试应用
- 编辑面向应用的Component、Workload、ReleaseBinding或
workload.yaml - 使用查看或操作开发者工作负载
occ
如果两个技能都可用且任务同时涉及应用行为和平台行为,请立即同时启用两个技能,不要等其中一侧排查失败后再加载另一个技能。
Working style
工作方式
Prefer progressive discovery over memorized specifics:
- Identify the exact plane, namespace, resource, or failure domain.
- Inspect live state first with ,
kubectl, Helm, and current resource YAML.occ - Read only the reference file that matches the task.
- Make the smallest change that can prove or fix the issue.
- Verify the result from the live cluster before moving on.
Treat the live cluster and current repo as the source of truth. If a remembered field name, example, or behavior conflicts with current output, trust the current output and then confirm in the relevant reference file or repository source.
Avoid loading all references up front. Pull them in only when the task requires that area.
优先采用渐进式排查,而非依赖记忆的具体信息:
- 定位准确的平面、命名空间、资源或故障域
- 首先使用、
kubectl、Helm查看运行状态和当前资源YAMLocc - 仅查阅与任务匹配的参考文件
- 采用最小变更来验证或修复问题
- 操作前先从运行中的集群验证结果
将运行中的集群和当前代码仓库视为唯一真值来源。如果记忆中的字段名、示例或行为与当前输出冲突,请信任当前输出,再通过相关参考文件或仓库源码确认。
避免预先加载所有参考文档,仅在任务涉及对应领域时再查阅。
Reference routing
参考文档指引
Read only what the task needs:
- for namespace provisioning, topology, multi-cluster connectivity, and upgrades
references/operations.md - for ComponentTypes, Traits, Workflows, CEL, and template rules
references/templates-and-workflows.md - for secret stores, registries, identity, RBAC, webhooks, and API management
references/integrations.md - for logs, metrics, traces, alerts, and notification channels
references/observability.md - for failure isolation, health checks, log locations, and common failure patterns
references/troubleshooting.md - for PE-relevant
references/cli-and-resources.mdcommands and platform resource schemasocc - for MCP tool usage: mapping platform workflows to MCP tools, initial platform setup order, platform resource schemas via MCP, and MCP-specific gotchas — read this when operating through an MCP-connected AI agent instead of the CLI
references/mcp-reference.md - for GitOps repository layout and release flow
references/gitops.md - for pluggable gateways and observability backends
references/community-modules.md - for certificates, private Git, custom build flows, and identity-provider swaps
references/advanced-setup.md - when the docs are not enough and you need controller logic, CRD definitions, or Helm chart details
references/repo-and-context7.md
仅查阅任务所需的参考内容:
- :适用于命名空间 provisioning、拓扑、多集群连通性和升级相关问题
references/operations.md - :适用于ComponentType、Trait、Workflow、CEL和模板规则相关问题
references/templates-and-workflows.md - :适用于密钥存储、镜像仓库、身份认证、RBAC、Webhook和API管理相关问题
references/integrations.md - :适用于日志、指标、链路追踪、告警和通知渠道相关问题
references/observability.md - :适用于故障隔离、健康检查、日志位置和常见故障模式相关问题
references/troubleshooting.md - :适用于平台工程师相关的
references/cli-and-resources.md命令和平台资源 schema 相关问题occ - :适用于MCP工具使用:平台工作流与MCP工具映射、初始平台搭建顺序、通过MCP获取平台资源 schema、以及MCP特定注意事项——当通过连接MCP的AI Agent而非CLI操作时请查阅本文档
references/mcp-reference.md - :适用于GitOps仓库结构和发布流程相关问题
references/gitops.md - :适用于可插拔网关和可观测性后端相关问题
references/community-modules.md - :适用于证书、私有Git、自定义构建流程和身份提供商切换相关问题
references/advanced-setup.md - :当文档不足以解决问题,需要查阅控制器逻辑、CRD定义或Helm Chart详情时使用
references/repo-and-context7.md
Discovery-first workflow
排查优先工作流
1. Classify the task
1. 任务分类
Decide whether the work is:
- Pure platform work
- App work that needs PE help
- A mixed task that needs both OpenChoreo skills
For mixed tasks, keep the app-facing thread and the platform-facing thread connected. Many deployment failures are caused by an interaction between Component config and platform config.
判断工作类型属于:
- 纯平台工作
- 需要平台工程师协助的应用工作
- 需要同时使用两种OpenChoreo技能的混合任务
对于混合任务,请保持面向应用的排查线程和面向平台的排查线程关联,很多部署失败是Component配置和平台配置交互导致的。
2. Inspect the current state before planning
2. 规划前先检查当前状态
Start with the smallest useful inspection:
- Resource YAML for the object already involved
status.conditions- Relevant controller, gateway, or agent logs
- Current Helm release values when the issue might be installation- or upgrade-related
Do not assume a field exists because it appeared in an older example. Inspect the current CR, schema, or docs before patching. This matters especially for overrides, plane registration, workflow configuration, and trait parameters.
从最小范围的有效检查开始:
- 已涉及对象的资源YAML
- 信息
status.conditions - 相关控制器、网关或Agent日志
- 如果问题可能与安装或升级相关,查看当前Helm发布的值
不要因为旧示例中存在某个字段就默认当前版本也有该字段,打补丁前请先检查当前CR、schema或文档,这一点对于配置覆盖、平面注册、工作流配置和Trait参数尤其重要。
3. Route to the right source of detail
3. 匹配正确的信息来源
After the first inspection, load the matching reference file. If the reference still leaves ambiguity:
- Inspect the repository or generated CRDs
- Use Context7 for current OpenChoreo docs
- Check the live object shape on the cluster
Keep the investigation targeted. Avoid a full-cluster inventory unless the failure is clearly systemic or the affected resource is still unknown.
完成初步检查后,加载对应的参考文件。如果参考文件仍存在歧义:
- 检查代码仓库或生成的CRD
- 使用Context7查阅最新的OpenChoreo文档
- 检查集群上运行的对象结构
保持排查的针对性,除非故障明显是系统性的,或者受影响的资源仍未定位,否则不要执行全集群盘点。
4. Change one layer at a time
4. 每次仅修改一层
Platform tasks often span multiple layers:
- Helm install values
- control plane namespace resources
- remote plane resources
- gateway or secret backend configuration
- app-visible outcomes such as available types, workflows, or routes
Change the layer that is actually responsible, then re-check the dependent layers. Do not "fix" an application symptom by guessing at platform internals.
平台任务通常跨多个层级:
- Helm安装值
- 控制平面命名空间资源
- 远程平面资源
- 网关或密钥后端配置
- 应用侧可见结果,例如可用类型、工作流或路由
修改实际负责的层级,然后重新检查依赖层,不要通过猜测平台内部逻辑来“修复”应用层面的症状。
5. Verify with live evidence
5. 用运行时证据验证
Verification should come from the platform, not assumption:
- Resource conditions changed as expected
- Controller or agent logs show the new state
- Helm release and pod rollout are healthy
- The downstream app-facing symptom is gone
If the platform change succeeded but the app still fails, hand off to or continue with .
openchoreo-developer验证结果应来自平台实际运行状态,而非假设:
- 资源条件按预期变更
- 控制器或Agent日志显示新状态
- Helm发布和Pod滚动更新正常
- 下游面向应用的故障症状消失
如果平台变更成功但应用仍报错,请交接给或继续使用该技能排查。
openchoreo-developerStable guardrails
稳定准则
Keep these in mind because they are durable and high-value:
- Platform work usually requires and often Helm; developer work usually centers on
kubectlocc - Upgrade order matters; do not move a remote plane ahead of the control plane
- Scope matters; cluster-scoped and namespace-scoped resources are not interchangeable
- , live resource YAML, and current controller logs are better truth sources than memory
status.conditions - When a task needs exact controller behavior or CRD fields, inspect the repo or Context7 instead of guessing
- Prefer reversible, inspectable changes over broad edits across many planes or namespaces
请牢记以下持久且高价值的准则:
- 平台工作通常需要,且经常用到Helm;开发者工作通常以
kubectl为核心occ - 升级顺序很重要,不要让远程平面版本高于控制平面
- 作用域很重要,集群级和命名空间级资源不可互换
- 、运行时资源YAML和当前控制器日志是比记忆更可靠的真值来源
status.conditions - 当任务需要准确的控制器行为或CRD字段时,查阅仓库或Context7而非猜测
- 优先采用可回滚、可排查的变更,而非跨多个平面或命名空间的大范围修改
Anti-patterns
反模式
- Loading every reference file before identifying the actual problem
- Repeating stale examples without checking the current cluster or resource schema
- Performing wide cluster sweeps before checking the affected object and logs
- Treating app-level deployment symptoms as purely platform issues without checking the app resource chain
- Making several platform changes at once and losing the causal signal
- 定位实际问题前加载所有参考文件
- 重复使用过时的示例,不检查当前集群或资源schema
- 在检查受影响对象和日志前执行大范围集群扫描
- 不检查应用资源链路就将应用级部署症状直接归为纯平台问题
- 同时执行多个平台变更,丢失因果关联信号