debug-production-renders

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Debug Production Renders

排查生产环境渲染故障

Telecine renders flow through a queue-based pipeline backed by Valkey (Redis-compatible). Each render is a workflow that progresses through multiple queues, each served by a dedicated Cloud Run worker service. Debugging a render means checking its state at three layers: the Postgres database, the Valkey queue state, and the Cloud Run worker logs.
Telecine的渲染任务通过基于Valkey(兼容Redis)的队列式流水线流转。每个渲染任务是一个工作流,会在多个队列中推进,每个队列由专门的Cloud Run Worker服务处理。调试渲染任务意味着要检查三个层级的状态:Postgres数据库、Valkey队列状态和Cloud Run Worker日志。

Quick Start

快速开始

All debug scripts run inside docker. From the monorepo root:
bash
undefined
所有调试脚本都在Docker内运行。从单体仓库根目录执行:
bash
undefined

Full render status: DB row, fragment breakdown, error detail

完整渲染状态:数据库行、片段明细、错误详情

telecine/scripts/debug-render <render-id>
telecine/scripts/debug-render <render-id>

Add live Valkey state (queued/claimed/completed/failed jobs)

添加实时Valkey状态(排队/已认领/已完成/失败的任务)

telecine/scripts/debug-render <render-id> --redis
telecine/scripts/debug-render <render-id> --redis

Add docker compose log grep (local dev only)

添加Docker Compose日志过滤(仅本地开发环境)

telecine/scripts/debug-render <render-id> --logs
telecine/scripts/debug-render <render-id> --logs

All three

同时查看以上三者

telecine/scripts/debug-render <render-id> --redis --logs
undefined
telecine/scripts/debug-render <render-id> --redis --logs
undefined

Render Pipeline Flow

渲染流水线流程

A render progresses through queues in this order:
  1. process-html-initializer -- Preprocesses raw HTML from the API submission.
  2. process-html-finalizer -- Sets workflow data, then enqueues the render-initializer job. This is the bridge from the HTML pipeline into the render pipeline.
  3. render-initializer -- Checks out render source, extracts render info (dimensions, duration, fps) via Electron RPC, creates an assets metadata bundle, then fans out N fragment jobs. For still images (png/jpeg/webp), renders the image directly and skips fragments.
  4. render-fragment (N parallel) -- Each job renders one time slice of the video via Electron RPC, writes the fragment file to storage, and reports progress.
  5. render-finalizer -- Auto-triggered when all fragment jobs complete. Merges fragment files into the final output. Marks the render as complete.
The pipeline is triggered by a Hasura event on
INSERT
into
video2.renders
. The workflow system automatically routes the finalizer queue job when all child jobs complete or when any job fails with exhausted retries.
Queue names, worker service names, and resource allocations are defined in source files -- see "Key Source Files" below.
渲染任务按以下顺序在队列中推进:
  1. process-html-initializer —— 预处理从API提交的原始HTML。
  2. process-html-finalizer —— 设置工作流数据,然后将render-initializer任务加入队列。这是HTML流水线到渲染流水线的桥梁。
  3. render-initializer —— 检出渲染源文件,通过Electron RPC提取渲染信息(尺寸、时长、帧率),创建资源元数据包,然后分发N个片段任务。对于静态图片(png/jpeg/webp),会直接渲染图片并跳过片段处理。
  4. render-fragment(N个并行任务)—— 每个任务通过Electron RPC渲染视频的一个时间切片,将片段文件写入存储,并上报进度。
  5. render-finalizer —— 当所有片段任务完成时自动触发。将片段文件合并为最终输出。标记渲染任务为完成状态。
流水线由
video2.renders
表的Hasura INSERT事件触发。当所有子任务完成或任何任务重试耗尽失败时,工作流系统会自动将终结器队列任务路由到对应队列。
队列名称、Worker服务名称和资源分配定义在源文件中——参见下方“核心源文件”。

Debug Scripts

调试脚本

ScriptPurpose
telecine/scripts/debug-render <id> [--redis] [--logs]
Primary debug tool: DB state, fragments, errors, optional Valkey + logs
telecine/scripts/inspect-render.ts
Lower-level Valkey inspection: workflow data, claimed jobs with ages, all workflow keys
telecine/scripts/check-queue.ts
Queue-level Valkey state: queued/claimed/failed counts, org membership
telecine/scripts/restart-render.ts
Restart a failed render: resets DB status, re-enqueues initializer job
telecine/scripts/create-render.ts
Create a test render from an existing render's org context
telecine/scripts/render-logs [-f] <id>
Grep docker compose logs for initializer/fragment/finalizer services
worktree smoke [branch]
(runs
telecine/scripts/smoke-test.ts
)
End-to-end smoke tests via the public API
telecine/scripts/smoke-test-waveform.ts
Waveform-specific smoke test: inserts renders directly into DB, exercises all ef-waveform modes concurrently
telecine/scripts/console
Node REPL with project imports (db, valkey, queues available)
Run
.ts
scripts via:
telecine/scripts/run tsx scripts/<script>.ts <args>
脚本用途
telecine/scripts/debug-render <id> [--redis] [--logs]
主要调试工具:数据库状态、片段信息、错误详情,可选查看Valkey状态和日志
telecine/scripts/inspect-render.ts
底层Valkey检查工具:工作流数据、已认领任务的时长、所有工作流键
telecine/scripts/check-queue.ts
队列级Valkey状态:排队/已认领/失败任务数量、组织归属
telecine/scripts/restart-render.ts
重启失败的渲染任务:重置数据库状态,重新将初始化任务加入队列
telecine/scripts/create-render.ts
从现有渲染任务的组织上下文创建测试渲染任务
telecine/scripts/render-logs [-f] <id>
过滤Docker Compose日志中的初始化器/片段/终结器服务日志
worktree smoke [branch]
(执行
telecine/scripts/smoke-test.ts
通过公开API进行端到端冒烟测试
telecine/scripts/smoke-test-waveform.ts
波形图专属冒烟测试:直接将渲染任务插入数据库,同时测试所有ef-waveform模式
telecine/scripts/console
带有项目导入的Node REPL(可使用db、valkey、queues)
.ts
脚本执行方式:
telecine/scripts/run tsx scripts/<script>.ts <args>

Querying Cloud Run Logs

查询Cloud Run日志

There are no dedicated log-querying scripts. Use
gcloud
directly. Workers log structured JSON via pino, so use
jsonPayload
filters:
bash
undefined
没有专门的日志查询脚本,请直接使用
gcloud
命令。Worker通过pino输出结构化JSON日志,因此可以使用
jsonPayload
进行过滤:
bash
undefined

All render workers for a specific render ID

特定渲染ID的所有渲染Worker日志

gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker-render" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100 --format=json
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker-render" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100 --format=json

Specific worker stage

特定Worker阶段的日志

gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name="telecine-worker-render-initializer" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name="telecine-worker-render-initializer" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100

Errors only across all workers

所有Worker的错误日志

gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50

Cloud Run service names follow the pattern `telecine-worker-{queue-name}`. See `telecine/deploy/resources/queues/configs.ts` for the full list of queue names.
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50

Cloud Run服务名称遵循`telecine-worker-{queue-name}`的格式。完整队列名称列表请查看`telecine/deploy/resources/queues/configs.ts`。

Valkey Key Schema

Valkey键架构

Queues and workflows use predictable Valkey key patterns. Understanding these lets you query state directly when scripts don't cover your case.
Queue-level keys (per queue name):
  • queues:{queueName}:queued
    -- zset of job keys waiting to be claimed
  • queues:{queueName}:claimed
    -- zset of job keys being processed (score = claim timestamp)
  • queues:{queueName}:completed
    -- zset of completed job keys
  • queues:{queueName}:failed
    -- zset of failed job keys
  • queues:{queueName}:jobs:{jobId}
    -- serialized job data (SuperJSON)
  • queues:{queueName}:orgs
    -- zset of org keys with active jobs
Workflow-level keys (per render ID):
  • workflows:{renderId}:queued
    -- zset of workflow-level queued jobs
  • workflows:{renderId}:claimed
    -- zset of claimed jobs
  • workflows:{renderId}:completed
    -- zset of completed jobs
  • workflows:{renderId}:failed
    -- zset of failed jobs
  • workflows:{renderId}:data
    -- SuperJSON workflow payload (render config)
  • workflows:{renderId}:status
    -- workflow status string
Progress tracking:
  • render:{renderId}
    -- Redis stream for fragment completion progress
Stalled jobs are detected by checking claimed jobs whose score (claim timestamp) is older than 10 seconds.
队列和工作流使用可预测的Valkey键模式。理解这些模式可以让你在脚本无法覆盖场景时直接查询状态。
队列级键(按队列名称):
  • queues:{queueName}:queued
    —— 等待认领的任务键有序集合
  • queues:{queueName}:claimed
    —— 正在处理的任务键有序集合(分数=认领时间戳)
  • queues:{queueName}:completed
    —— 已完成任务键有序集合
  • queues:{queueName}:failed
    —— 失败任务键有序集合
  • queues:{queueName}:jobs:{jobId}
    —— 序列化的任务数据(SuperJSON)
  • queues:{queueName}:orgs
    —— 包含活跃任务的组织键有序集合
工作流级键(按渲染ID):
  • workflows:{renderId}:queued
    —— 工作流级排队任务的有序集合
  • workflows:{renderId}:claimed
    —— 已认领任务的有序集合
  • workflows:{renderId}:completed
    —— 已完成任务的有序集合
  • workflows:{renderId}:failed
    —— 失败任务的有序集合
  • workflows:{renderId}:data
    —— SuperJSON格式的工作流负载(渲染配置)
  • workflows:{renderId}:status
    —— 工作流状态字符串
进度追踪
  • render:{renderId}
    —— 片段完成进度的Redis流
停滞任务通过检查分数(认领时间戳)超过10秒的已认领任务来检测。

Database Tables

数据库表

Render state is persisted in Postgres via Kysely (not Prisma). The
db
client is imported from
@/sql-client.server
.
  • video2.renders
    -- Main render record: status, html, org_id, dimensions, fps, duration_ms, failure_detail, timestamps
  • video2.render_fragments
    -- Per-segment fragment records: render_id, segment_id, attempt_number, timestamps, last_error
  • video2.process_html
    -- HTML processing records
  • video2.files
    -- File records (used by process-isobmff and ingest-image)
Render status values:
created
->
queued
->
rendering
->
complete
|
failed
渲染状态通过Kysely(而非Prisma)持久化到Postgres中。
db
客户端从
@/sql-client.server
导入。
  • video2.renders
    —— 主要渲染记录:状态、html、org_id、尺寸、帧率、时长(毫秒)、失败详情、时间戳
  • video2.render_fragments
    —— 分段片段记录:render_id、segment_id、尝试次数、时间戳、最后一次错误
  • video2.process_html
    —— HTML处理记录
  • video2.files
    —— 文件记录(供process-isobmff和ingest-image使用)
渲染状态值:
created
->
queued
->
rendering
->
complete
|
failed

Debugging Workflow

调试流程

  1. Start with
    debug-render
    -- get the DB status, error detail, and fragment breakdown.
  2. If stuck in "rendering" -- add
    --redis
    to see if jobs are queued, claimed (possibly stalled), or silently failed in Valkey.
  3. If Valkey state is unclear -- use
    inspect-render.ts
    for detailed claimed job ages and
    check-queue.ts
    for queue-level counts.
  4. If you need worker logs -- query Cloud Run logs with
    gcloud
    using the render ID (see commands above). In local dev, use
    --logs
    flag or
    render-logs
    script.
  5. To retry -- use
    restart-render.ts
    to reset DB state and re-enqueue the initializer.
  6. For production DB access -- use
    telecine/scripts/debug-prod-web --use-prod-db --shell
    to get a container with production database connectivity.
  1. debug-render
    开始
    —— 获取数据库状态、错误详情和片段明细。
  2. 若任务卡在“rendering”状态 —— 添加
    --redis
    参数查看任务是否在排队、已认领(可能停滞)或在Valkey中静默失败。
  3. 若Valkey状态不明确 —— 使用
    inspect-render.ts
    查看已认领任务的详细时长,使用
    check-queue.ts
    查看队列级统计。
  4. 若需要Worker日志 —— 使用
    gcloud
    命令结合渲染ID查询Cloud Run日志(参见上方命令)。在本地开发环境,使用
    --logs
    参数或
    render-logs
    脚本。
  5. 若需要重试 —— 使用
    restart-render.ts
    重置数据库状态并重新将初始化任务加入队列。
  6. 若需要访问生产环境数据库 —— 使用
    telecine/scripts/debug-prod-web --use-prod-db --shell
    获取带有生产数据库连接权限的容器。

Diagnosing Electron RPC Failures

诊断Electron RPC故障

Fragment renders fail via RPC timeout. The stack trace tells you how far rendering got:
  • RPC.ts:182
    — initial 5 s timer fired, no keepalives received at all. Electron never started rendering. Causes: scheduler opened more connections than the single local container can handle (see below), Electron failed to load the page, or the render context couldn't be created.
  • RPC.ts:153
    — keepalive-reset timer fired mid-render. At least one frame rendered before the hang. Causes: a race condition aborted an in-flight fetch (the
    AbortError
    case), a frame took too long, or Electron crashed mid-segment.
片段渲染会因RPC超时失败。堆栈跟踪可以告诉你渲染进行到了哪一步:
  • RPC.ts:182
    —— 初始5秒计时器触发,未收到任何保活信号。Electron从未开始渲染。原因:调度器打开的连接数超过了单个本地容器的处理能力(参见下方),Electron加载页面失败,或无法创建渲染上下文。
  • RPC.ts:153
    —— 保活重置计时器在渲染过程中触发。至少有一帧在挂起前完成渲染。原因:竞态条件中止了正在进行的请求(
    AbortError
    情况),某一帧耗时过长,或Electron在分段渲染过程中崩溃。

AbortError / FrameController race

AbortError / FrameController竞态问题

If Electron logs show
[EF_FRAMEGEN.beginFrame] error: [object DOMException]
/
AbortError: The user aborted a request
, a fetch started during
seekForRender
was cancelled by an autonomous re-render firing concurrently. This is a timing-dependent race:
EFTemporal.updated()
or
EFTimegroup.updated()
fires when media loads and calls
FrameController.abort()
, killing the in-flight GCS fetch.
Fix: set
data-no-playback-controller
on the timegroup before
seekForRender
to suppress autonomous re-renders — the same attribute used on render clones. Check
EF_FRAMEGEN.ts
initialize()
and
EFTemporal.ts
/
EFTimegroup.ts
updated()
.
如果Electron日志显示
[EF_FRAMEGEN.beginFrame] error: [object DOMException]
/
AbortError: The user aborted a request
,则表示在
seekForRender
期间启动的请求被并发触发的自主重渲染取消了。这是一个依赖时序的竞态问题:当媒体加载时
EFTemporal.updated()
EFTimegroup.updated()
触发,并调用
FrameController.abort()
,终止了正在进行的GCS请求。
修复方案:在
seekForRender
之前为时间组设置
data-no-playback-controller
属性以抑制自主重渲染——该属性与渲染克隆中使用的属性相同。检查
EF_FRAMEGEN.ts
initialize()
方法以及
EFTemporal.ts
/
EFTimegroup.ts
updated()
方法。

Scheduler over-scaling in local dev

本地开发环境中调度器过度扩容

In production,
MAX_WORKER_COUNT
controls how many Cloud Run instances the scheduler spins up. Locally there is one container per queue. If the scheduler opens more WebSocket connections than the single container expects (e.g. 30 connections for 30 queued jobs), every concurrent
renderFragment
RPC call beyond the worker's
WORKER_CONCURRENCY
quota times out at
RPC.ts:182
before Electron starts processing it.
The two dials and how they interact:
DialProductionLocal dev
MAX_WORKER_COUNT
scales container countmust match
scale:
in docker-compose (usually 1)
WORKER_CONCURRENCY
jobs per containereffective parallelism when
MAX_WORKER_COUNT=1
Both are read from
telecine/.env
(via
env_file
in worker containers and
${VAR:-default}
substitution in
scheduler-go/docker-compose.yaml
). Changing them requires
telecine/scripts/docker-compose up -d
(not just
restart
) to force recreation with the new env.
在生产环境中,
MAX_WORKER_COUNT
控制调度器启动的Cloud Run实例数量。在本地环境中,每个队列对应一个容器。如果调度器打开的WebSocket连接数超过了单个容器的预期(例如,30个排队任务对应30个连接),那么超过Worker的
WORKER_CONCURRENCY
配额的所有并发
renderFragment
RPC调用都会在Electron开始处理前于
RPC.ts:182
处超时。
两个配置项及其交互方式:
配置项生产环境本地开发环境
MAX_WORKER_COUNT
控制容器数量扩容必须与docker-compose中的
scale:
值匹配(通常为1)
WORKER_CONCURRENCY
每个容器处理的任务数
MAX_WORKER_COUNT=1
时的有效并行度
两者均从
telecine/.env
读取(通过Worker容器中的
env_file
scheduler-go/docker-compose.yaml
中的
${VAR:-default}
替换)。修改后需要执行
telecine/scripts/docker-compose up -d
(不仅仅是
restart
)以强制使用新环境变量重新创建容器。

Key Source Files

核心源文件

  • telecine/scripts/debug-render.ts
    -- Primary debug tool implementation
  • telecine/scripts/inspect-render.ts
    -- Low-level Valkey render inspection
  • telecine/scripts/check-queue.ts
    -- Queue-level Valkey state checker
  • telecine/scripts/restart-render.ts
    -- Render restart/retry tool
  • telecine/lib/queues/Queue.ts
    -- Queue base class, key patterns
  • telecine/lib/queues/Workflow.ts
    -- Workflow base class, workflow key patterns
  • telecine/lib/queues/Job.ts
    -- Job serialization, enqueue, stall detection
  • telecine/lib/queues/units-of-work/Render/
    -- Render pipeline queue/worker definitions
  • telecine/lib/queues/units-of-work/ProcessHtml/
    -- HTML pipeline queue/worker definitions
  • telecine/deploy/resources/queues/configs.ts
    -- Production queue names and scaling config
  • telecine/deploy/resources/queues/defineWorker.ts
    -- Cloud Run service definition template
  • telecine/deploy/resources/constants.ts
    -- GCP project/region constants
  • telecine/lib/valkey/valkey.ts
    -- Valkey connection setup
  • telecine/scripts/debug-render.ts
    —— 主要调试工具实现
  • telecine/scripts/inspect-render.ts
    —— 底层Valkey渲染检查工具
  • telecine/scripts/check-queue.ts
    —— 队列级Valkey状态检查工具
  • telecine/scripts/restart-render.ts
    —— 渲染任务重启/重试工具
  • telecine/lib/queues/Queue.ts
    —— 队列基类、键模式
  • telecine/lib/queues/Workflow.ts
    —— 工作流基类、工作流键模式
  • telecine/lib/queues/Job.ts
    —— 任务序列化、入队、停滞检测
  • telecine/lib/queues/units-of-work/Render/
    —— 渲染流水线队列/Worker定义
  • telecine/lib/queues/units-of-work/ProcessHtml/
    —— HTML流水线队列/Worker定义
  • telecine/deploy/resources/queues/configs.ts
    —— 生产环境队列名称和扩容配置
  • telecine/deploy/resources/queues/defineWorker.ts
    —— Cloud Run服务定义模板
  • telecine/deploy/resources/constants.ts
    —— GCP项目/区域常量
  • telecine/lib/valkey/valkey.ts
    —— Valkey连接设置

When to Use This Skill

何时使用该技能

Use this skill when:
  • A production render is stuck, failed, or producing unexpected results
  • You need to trace a render through the pipeline to find where it stalled
  • You need to query Cloud Run logs for a specific render
  • You need to inspect or manipulate Valkey queue state
  • You need to restart a failed render
  • You need to understand the render pipeline architecture
当出现以下情况时使用该技能:
  • 生产环境渲染任务停滞、失败或产生意外结果
  • 需要追踪渲染任务在流水线中的流转过程以定位停滞点
  • 需要查询特定渲染任务的Cloud Run日志
  • 需要检查或操作Valkey队列状态
  • 需要重启失败的渲染任务
  • 需要理解渲染流水线架构