mirror-doctor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Mirror Pipeline Doctor

Mirror管道诊断工具

Diagnose and fix existing Mirror pipeline problems by running CLI commands, identifying root causes, and executing fixes.
通过运行CLI命令、确定根本原因并执行修复方案,诊断并修复现有Mirror管道的问题。

Boundaries

适用范围

  • Diagnose and fix EXISTING Mirror pipeline problems.
  • Do not build new pipelines — use
    /mirror
    for config reference or
    /turbo-builder
    for new Turbo pipelines.
  • Do not serve as a command reference — use
    /mirror
    for CLI syntax and flag lookups.
  • Do not handle Turbo pipelines — use
    /turbo-doctor
    for
    goldsky turbo
    problems.
  • Do not create secrets — use
    /secrets
    for credential management. But DO check whether secrets exist as part of diagnosis.
  • 诊断并修复现有Mirror管道的问题。
  • 不负责构建新管道——如需配置参考请使用
    /mirror
    ,如需构建新Turbo管道请使用
    /turbo-builder
  • 不提供命令参考——如需CLI语法和参数查询请使用
    /mirror
  • 不处理Turbo管道问题——
    goldsky turbo
    相关问题请使用
    /turbo-doctor
  • 不创建密钥——如需凭证管理请使用
    /secrets
    。但在诊断过程中检查密钥是否存在。

Diagnostic Workflow

诊断流程

Follow these steps in order. Each step builds on the previous one.
按以下顺序执行步骤,每一步都基于上一步的结果推进。

Step 1: Verify Authentication

步骤1:验证身份认证

Run
goldsky project list 2>&1
to confirm the user is logged in.
  • If logged in: Note the project name and continue.
  • If not logged in: Direct the user to
    /auth-setup
    . Do not proceed until auth works.
运行
goldsky project list 2>&1
确认用户已登录。
  • 已登录:记录项目名称并继续。
  • 未登录:引导用户使用
    /auth-setup
    完成认证。认证完成前不要继续后续步骤。

Step 2: Identify the Pipeline

步骤2:定位目标管道

Run
goldsky pipeline list --include-runtime-details 2>&1
to list all Mirror pipelines with their status.
If the user already named a pipeline, confirm it exists in the list. If not, show the list and ask which pipeline they want to diagnose.
Note both the desired status (ACTIVE, INACTIVE, PAUSED) and the runtime status (STARTING, RUNNING, FAILING, TERMINATED) — Mirror pipelines have both, and the combination tells the story.
运行
goldsky pipeline list --include-runtime-details 2>&1
列出所有Mirror管道及其状态。
如果用户已指定管道名称,确认该管道存在于列表中。如果未指定,展示列表并询问用户要诊断的管道。
同时记录期望状态(ACTIVE、INACTIVE、PAUSED)和运行时状态(STARTING、RUNNING、FAILING、TERMINATED)——Mirror管道同时具备这两种状态,二者的组合能反映问题全貌。

Step 3: Triage by Status

步骤3:按状态分类处理

The desired + runtime status combination determines the diagnostic path:
DesiredRuntimeMeaningAction
ACTIVERUNNINGHealthy — pipeline is processing dataAsk user what symptom they're seeing. Proceed to Step 4.
ACTIVESTARTINGPipeline is initializingAsk how long. If >10 min, proceed to Step 4.
ACTIVEFAILINGPipeline is encountering errors but hasn't terminated yetProceed to Step 4 immediately — this is time-sensitive.
ACTIVETERMINATEDMost common failure. Pipeline wanted to run but crashed.Proceed to Step 4.
PAUSEDTERMINATEDUser paused the pipeline (snapshot was taken).Ask if they want to resume:
goldsky pipeline start <name> --from-snapshot last
INACTIVETERMINATEDUser stopped the pipeline (no snapshot).Ask if they want to start:
goldsky pipeline start <name>
ACTIVE + TERMINATED is the most common case. The pipeline's desired status is ACTIVE (it should be running) but the runtime has terminated due to an error. Focus the diagnosis here.
期望状态与运行时状态的组合决定了诊断路径:
期望状态运行时状态含义操作
ACTIVERUNNING健康状态——管道正在处理数据询问用户遇到的症状,继续步骤4。
ACTIVESTARTING管道正在初始化询问已持续时长。如果超过10分钟,继续步骤4。
ACTIVEFAILING管道正在报错但尚未终止立即进入步骤4——此情况具有时效性。
ACTIVETERMINATED最常见故障。管道本应运行但已崩溃。进入步骤4。
PAUSEDTERMINATED用户已暂停管道(已生成快照)。询问用户是否要恢复:
goldsky pipeline start <name> --from-snapshot last
INACTIVETERMINATED用户已停止管道(无快照)。询问用户是否要启动:
goldsky pipeline start <name>
ACTIVE + TERMINATED是最常见的情况。管道的期望状态为ACTIVE(应处于运行状态),但运行时因错误已终止,需重点针对此情况进行诊断。

Step 4: Gather Diagnostic Data

步骤4:收集诊断数据

Run these commands to understand what went wrong:
bash
undefined
运行以下命令排查问题原因:
bash
undefined

Get error details and runtime metrics

获取错误详情和运行时指标

goldsky pipeline monitor <name> 2>&1
goldsky pipeline monitor <name> 2>&1

Check for in-flight requests blocking operations

检查是否有进行中的请求阻塞操作

goldsky pipeline monitor <name> --update-request 2>&1
goldsky pipeline monitor <name> --update-request 2>&1

Get the pipeline definition to check for misconfig

获取管道定义以检查配置错误

goldsky pipeline get <name> --definition 2>&1
goldsky pipeline get <name> --definition 2>&1

Get pipeline info including version

获取包含版本信息的管道详情

goldsky pipeline info <name> 2>&1
goldsky pipeline info <name> 2>&1

Check available snapshots

检查可用快照

goldsky pipeline snapshots list <name> 2>&1

Run these in sequence and analyze the output before proceeding. The monitor output is the most important — it shows error messages, records received/written metrics, and runtime status transitions.
goldsky pipeline snapshots list <name> 2>&1

按顺序运行这些命令并分析输出后再继续。其中`monitor`命令的输出最为重要——它会展示错误信息、记录接收/写入指标以及运行时状态变化。

Step 5: Match Error Patterns

步骤5:匹配错误模式

Based on the diagnostic data, match against these known patterns:
根据诊断数据,匹配以下已知错误模式:

Bad or Missing Secret

密钥错误或缺失

Symptoms: Pipeline terminates shortly after starting. Monitor shows credential or authentication errors.
Verify: Run
goldsky secret list 2>&1
and cross-reference with the
secret_name
values in the pipeline definition from Step 4.
Fix:
  1. If the secret doesn't exist, direct the user to
    /secrets
    to create it.
  2. If the secret exists but credentials are wrong, create a new secret (secrets are immutable — you create a replacement with the same name).
  3. Restart:
    goldsky pipeline restart <name> --from-snapshot last
症状:管道启动后不久即终止。
monitor
输出显示凭证或身份认证错误。
验证:运行
goldsky secret list 2>&1
,并与步骤4中管道定义里的
secret_name
值交叉核对。
修复方案
  1. 如果密钥不存在,引导用户使用
    /secrets
    创建密钥。
  2. 如果密钥存在但凭证错误,创建新密钥(密钥不可变——需创建同名替代密钥)。
  3. 重启管道:
    goldsky pipeline restart <name> --from-snapshot last

Sink Unreachable

目标存储不可达

Symptoms: Connection timeout, connection refused, or network errors in the monitor output. Pipeline may cycle between FAILING and TERMINATED.
Common causes:
  • Firewall not allowing inbound from AWS us-west-2 (Mirror pipelines write from this region)
  • Database is down or restarted
  • Connection pool exhausted
  • Wrong port or host in the secret
Fix:
  1. Verify the sink is reachable from us-west-2.
  2. Check that the secret has the correct host, port, and credentials.
  3. Once connectivity is restored, restart:
    goldsky pipeline restart <name> --from-snapshot last
症状
monitor
输出显示连接超时、连接被拒绝或网络错误。管道可能在FAILING和TERMINATED状态间循环。
常见原因
  • 防火墙未允许来自AWS us-west-2区域的入站请求(Mirror管道从此区域写入数据)
  • 数据库已下线或重启
  • 连接池耗尽
  • 密钥中的端口或主机地址错误
修复方案
  1. 验证目标存储可从us-west-2区域访问。
  2. 检查密钥中的主机、端口和凭证是否正确。
  3. 恢复连接后,重启管道:
    goldsky pipeline restart <name> --from-snapshot last

Resource Exhaustion

资源耗尽

Symptoms: Pipeline runs for a while then terminates. Monitor may show high record counts or slow processing. Common during large backfills or pipelines with many sources/JOINs.
Fix:
  1. Resize:
    goldsky pipeline resize <name> <size>
    — sizes are
    s
    ,
    m
    ,
    l
    ,
    xl
    ,
    xxl
    .
  2. Start small and go up.
    s
    handles most workloads (up to 300K records/sec, ~8 subgraph sources). Use
    l
    or larger for big chain backfills or heavy JOINs.
症状:管道运行一段时间后终止。
monitor
可能显示高记录量或处理缓慢。常见于大规模回填或包含多个数据源/JOIN操作的管道。
修复方案
  1. 调整资源规格:
    goldsky pipeline resize <name> <size>
    ——规格包括
    s
    ,
    m
    ,
    l
    ,
    xl
    ,
    xxl
  2. 从小规格开始逐步升级。
    s
    规格可处理大多数工作负载(最高30万条记录/秒,约8个子图数据源)。大规模链数据回填或复杂JOIN操作请使用
    l
    或更大规格。

In-Flight Request Blocking

进行中的请求阻塞

Symptoms: User tries to update, delete, or restart the pipeline but gets "Cannot process request, found existing request in-flight."
Diagnose:
goldsky pipeline monitor <name> --update-request
— this shows what operation is in progress (usually a snapshot).
Fix:
  1. If the in-flight operation is a snapshot that's making progress, wait for it.
  2. If it's stuck or unwanted:
    goldsky pipeline cancel-update <name>
  3. Then retry the original operation.
症状:用户尝试更新、删除或重启管道时收到错误:"Cannot process request, found existing request in-flight."
诊断:运行
goldsky pipeline monitor <name> --update-request
——此命令会显示正在进行的操作(通常是快照创建)。
修复方案
  1. 如果进行中的操作是正在推进的快照,请等待其完成。
  2. 如果操作卡住或无需继续:
    goldsky pipeline cancel-update <name>
  3. 然后重试原操作。

Stuck Snapshot

快照卡住

Symptoms: Pipeline can't be paused, updated, or restarted because a snapshot creation is taking too long or failing. The
--update-request
monitor shows snapshot progress stuck at a percentage.
Fix:
  1. Cancel the stuck snapshot:
    goldsky pipeline cancel-update <name>
  2. Restart without waiting for a new snapshot:
    goldsky pipeline restart <name> --from-snapshot last
  3. If there's no usable snapshot:
    goldsky pipeline restart <name> --from-snapshot none
    (starts from scratch — warn the user this reprocesses data)
症状:由于快照创建耗时过长或失败,管道无法暂停、更新或重启。
--update-request
监控显示快照进度卡在某个百分比。
修复方案
  1. 取消卡住的快照:
    goldsky pipeline cancel-update <name>
  2. 无需等待新快照即可重启:
    goldsky pipeline restart <name> --from-snapshot last
  3. 如果没有可用快照:
    goldsky pipeline restart <name> --from-snapshot none
    (从头开始——需提醒用户这会重新处理所有数据)

Transform SQL Error

转换SQL错误

Symptoms: Pipeline terminates with SQL-related error messages. Could be syntax errors, referencing a non-existent column, or type mismatches.
Diagnose: Check the pipeline definition (
goldsky pipeline get <name> --definition
) and look at the
transforms
section.
Fix:
  1. Identify the SQL error from the monitor output.
  2. Fix the SQL in the pipeline YAML file.
  3. Validate:
    goldsky pipeline validate <file.yaml>
  4. Reapply:
    goldsky pipeline apply <file.yaml> --status ACTIVE --from-snapshot last
Use
/mirror
for SQL transform syntax reference if needed.
症状:管道因SQL相关错误终止。可能是语法错误、引用不存在的列或类型不匹配。
诊断:检查管道定义(
goldsky pipeline get <name> --definition
)中的
transforms
部分。
修复方案
  1. monitor
    输出中定位SQL错误。
  2. 修复管道YAML文件中的SQL代码。
  3. 验证配置:
    goldsky pipeline validate <file.yaml>
  4. 重新应用配置:
    goldsky pipeline apply <file.yaml> --status ACTIVE --from-snapshot last
如需SQL转换语法参考,请使用
/mirror

Pipeline in Restart Loop

管道处于重启循环

Symptoms: Pipeline repeatedly cycles through STARTING → FAILING → TERMINATED. Monitor shows the same error recurring.
This is usually a symptom of another root cause — bad secret, sink unreachable, or resource issues. The pipeline keeps trying to start but hits the same wall.
Fix:
  1. Identify the underlying error from the monitor (it's usually one of the patterns above).
  2. Fix the root cause first.
  3. Then restart:
    goldsky pipeline restart <name> --from-snapshot last
症状:管道反复在STARTING → FAILING → TERMINATED状态间循环。
monitor
显示相同错误重复出现。
这通常是其他根本原因的表现——密钥错误、目标存储不可达或资源问题。管道持续尝试启动但遇到相同障碍。
修复方案
  1. monitor
    输出中确定底层错误(通常属于上述模式之一)。
  2. 先修复根本原因。
  3. 然后重启管道:
    goldsky pipeline restart <name> --from-snapshot last

Sink Downtime Cascade

目标存储宕机连锁反应

Symptoms: Pipeline was running fine, then the sink (database) went down temporarily. Pipeline auto-retried, then restarted its writers, then eventually terminated.
This is expected behavior — Mirror handles transient sink errors automatically (retry batch → restart writers → fail after prolonged issues).
Fix:
  1. Confirm the sink is back up and healthy.
  2. Restart from the last snapshot:
    goldsky pipeline restart <name> --from-snapshot last
  3. The pipeline will resume from where it left off, not reprocess everything.
症状:管道原本运行正常,之后目标存储(数据库)临时下线。管道自动重试,随后重启写入器,最终终止。
这是预期行为——Mirror会自动处理临时目标存储错误(重试批次 → 重启写入器 → 长时间故障后终止)。
修复方案
  1. 确认目标存储已恢复正常。
  2. 从最后一个快照重启:
    goldsky pipeline restart <name> --from-snapshot last
  3. 管道会从中断处恢复,不会重新处理所有数据。

Step 6: Present Diagnosis

步骤6:呈现诊断结果

After identifying the issue, present findings clearly:
undefined
确定问题后,清晰展示诊断结果:
undefined

Diagnosis

诊断结果

Pipeline: <name> Status: <desired> + <runtime> Issue: <one-line summary>
Root cause: <What's wrong and why>
Evidence:
  • <Error message or observation from monitor>
  • <Relevant detail from pipeline definition>
Recommended fix:
  1. <Step 1>
  2. <Step 2>
Prevention: <How to avoid this in the future, if applicable>
undefined
管道名称: <name> 状态: <期望状态> + <运行时状态> 问题: <一句话总结>
根本原因: <问题详情及原因>
证据:
  • <来自monitor的错误信息或观察结果>
  • <来自管道定义的相关细节>
推荐修复方案:
  1. <步骤1>
  2. <步骤2>
预防建议: <如何避免未来出现此类问题(如适用)>
undefined

Step 7: Execute Fix

步骤7:执行修复

Offer to run the fix commands directly. Always confirm with the user before executing:
  • Restart:
    goldsky pipeline restart <name> --from-snapshot last
  • Resize:
    goldsky pipeline resize <name> <size>
  • Cancel blocked operation:
    goldsky pipeline cancel-update <name>
  • Restart from scratch:
    goldsky pipeline restart <name> --from-snapshot none
    (warn: reprocesses data)
  • Reapply config:
    goldsky pipeline apply <file.yaml> --status ACTIVE --from-snapshot last
  • Delete and recreate:
    goldsky pipeline delete <name> -f
    then
    goldsky pipeline apply <file.yaml> --status ACTIVE
    (last resort)
After executing, verify recovery by running
goldsky pipeline monitor <name>
and watching for STARTING → RUNNING transition.
主动提出直接运行修复命令。执行前务必征得用户确认:
  • 重启管道:
    goldsky pipeline restart <name> --from-snapshot last
  • 调整资源规格:
    goldsky pipeline resize <name> <size>
  • 取消阻塞操作:
    goldsky pipeline cancel-update <name>
  • 从头重启:
    goldsky pipeline restart <name> --from-snapshot none
    (提醒:会重新处理数据)
  • 重新应用配置:
    goldsky pipeline apply <file.yaml> --status ACTIVE --from-snapshot last
  • 删除并重建:
    goldsky pipeline delete <name> -f
    然后
    goldsky pipeline apply <file.yaml> --status ACTIVE
    (最后手段)
执行后,运行
goldsky pipeline monitor <name>
验证恢复情况,观察状态是否从STARTING转为RUNNING。

Important Rules

重要规则

  • Always gather data before diagnosing. Never guess at the problem.
  • Check both desired AND runtime status — the combination matters.
  • Confirm with the user before running any destructive commands (delete, restart from scratch).
  • --from-snapshot last
    preserves progress.
    --from-snapshot none
    starts over. Default to
    last
    unless there's a reason not to.
  • Transient errors are auto-retried for up to 6 hours. Non-transient errors terminate immediately. If the pipeline terminated quickly after starting, it's likely a config issue (bad secret, wrong SQL), not a transient network blip.
  • If the problem is beyond CLI diagnosis, suggest contacting support@goldsky.com with the pipeline name, error messages, and project ID.
  • 诊断前务必收集数据,切勿猜测问题。
  • 同时检查期望状态和运行时状态——二者的组合至关重要。
  • 运行任何破坏性命令(删除、从头重启)前需征得用户确认。
  • --from-snapshot last
    会保留进度。
    --from-snapshot none
    会从头开始。除非有特殊原因,默认使用
    last
  • 临时错误会自动重试最长6小时。非临时错误会立即终止。如果管道启动后很快终止,大概率是配置问题(密钥错误、SQL错误),而非临时网络故障。
  • 如果问题超出CLI诊断范围,建议用户联系support@goldsky.com,并提供管道名称、错误信息和项目ID。

When Bash is Not Available

当无法使用Bash时

If you don't have the Bash tool, output the diagnostic commands for the user to run, but structure them clearly:
  1. Give one command at a time.
  2. Explain what to look for in the output.
  3. Based on their description of the output, proceed with the diagnosis.
This is the fallback path — always prefer running commands directly when Bash is available.
如果无法使用Bash工具,输出诊断命令供用户自行运行,但需清晰结构化:
  1. 一次提供一个命令。
  2. 说明需要从输出中关注的内容。
  3. 根据用户描述的输出结果,继续诊断流程。
这是 fallback 方案——当Bash可用时,优先直接运行命令。

Related

相关工具

  • /mirror
    — Pipeline YAML configuration, CLI flag reference, sink setup
  • /secrets
    — Create and manage sink credentials
  • /auth-setup
    — CLI installation and authentication
  • /turbo-doctor
    — Diagnose Turbo pipeline problems (not Mirror)
  • /mirror
    — 管道YAML配置、CLI参数参考、目标存储设置
  • /secrets
    — 创建和管理目标存储凭证
  • /auth-setup
    — CLI安装和身份认证
  • /turbo-doctor
    — 诊断Turbo管道问题(非Mirror管道)