debug-production-renders

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Debug Production Renders

排查生产环境渲染故障

Telecine renders flow through a queue-based pipeline backed by Valkey (Redis-compatible). Each render is a workflow that progresses through multiple queues, each served by a dedicated Cloud Run worker service. Debugging a render means checking its state at three layers: the Postgres database, the Valkey queue state, and the Cloud Run worker logs.

Telecine的渲染任务通过基于Valkey（兼容Redis）的队列式流水线流转。每个渲染任务是一个工作流，会在多个队列中推进，每个队列由专门的Cloud Run Worker服务处理。调试渲染任务意味着要检查三个层级的状态：Postgres数据库、Valkey队列状态和Cloud Run Worker日志。

Quick Start

快速开始

All debug scripts run inside docker. From the monorepo root:

bash

undefined

所有调试脚本都在Docker内运行。从单体仓库根目录执行：

bash

undefined

Full render status: DB row, fragment breakdown, error detail

完整渲染状态：数据库行、片段明细、错误详情

telecine/scripts/debug-render <render-id>

Add live Valkey state (queued/claimed/completed/failed jobs)

添加实时Valkey状态（排队/已认领/已完成/失败的任务）

telecine/scripts/debug-render <render-id> --redis

Add docker compose log grep (local dev only)

添加Docker Compose日志过滤（仅本地开发环境）

telecine/scripts/debug-render <render-id> --logs

All three

同时查看以上三者

telecine/scripts/debug-render <render-id> --redis --logs

undefined

telecine/scripts/debug-render <render-id> --redis --logs

undefined

Render Pipeline Flow

渲染流水线流程

A render progresses through queues in this order:

process-html-initializer -- Preprocesses raw HTML from the API submission.
process-html-finalizer -- Sets workflow data, then enqueues the render-initializer job. This is the bridge from the HTML pipeline into the render pipeline.
render-initializer -- Checks out render source, extracts render info (dimensions, duration, fps) via Electron RPC, creates an assets metadata bundle, then fans out N fragment jobs. For still images (png/jpeg/webp), renders the image directly and skips fragments.
render-fragment (N parallel) -- Each job renders one time slice of the video via Electron RPC, writes the fragment file to storage, and reports progress.
render-finalizer -- Auto-triggered when all fragment jobs complete. Merges fragment files into the final output. Marks the render as complete.

The pipeline is triggered by a Hasura event on

INSERT

into

video2.renders

. The workflow system automatically routes the finalizer queue job when all child jobs complete or when any job fails with exhausted retries.

Queue names, worker service names, and resource allocations are defined in source files -- see "Key Source Files" below.

渲染任务按以下顺序在队列中推进：

process-html-initializer —— 预处理从API提交的原始HTML。
process-html-finalizer —— 设置工作流数据，然后将render-initializer任务加入队列。这是HTML流水线到渲染流水线的桥梁。
render-initializer —— 检出渲染源文件，通过Electron RPC提取渲染信息（尺寸、时长、帧率），创建资源元数据包，然后分发N个片段任务。对于静态图片（png/jpeg/webp），会直接渲染图片并跳过片段处理。
render-fragment（N个并行任务）—— 每个任务通过Electron RPC渲染视频的一个时间切片，将片段文件写入存储，并上报进度。
render-finalizer —— 当所有片段任务完成时自动触发。将片段文件合并为最终输出。标记渲染任务为完成状态。

流水线由

video2.renders

表的Hasura INSERT事件触发。当所有子任务完成或任何任务重试耗尽失败时，工作流系统会自动将终结器队列任务路由到对应队列。

队列名称、Worker服务名称和资源分配定义在源文件中——参见下方“核心源文件”。

Debug Scripts

调试脚本

Script	Purpose
`telecine/scripts/debug-render <id> [--redis] [--logs]`	Primary debug tool: DB state, fragments, errors, optional Valkey + logs
`telecine/scripts/inspect-render.ts`	Lower-level Valkey inspection: workflow data, claimed jobs with ages, all workflow keys
`telecine/scripts/check-queue.ts`	Queue-level Valkey state: queued/claimed/failed counts, org membership
`telecine/scripts/restart-render.ts`	Restart a failed render: resets DB status, re-enqueues initializer job
`telecine/scripts/create-render.ts`	Create a test render from an existing render's org context
`telecine/scripts/render-logs [-f] <id>`	Grep docker compose logs for initializer/fragment/finalizer services
`worktree smoke [branch]` (runs `telecine/scripts/smoke-test.ts` )	End-to-end smoke tests via the public API
`telecine/scripts/smoke-test-waveform.ts`	Waveform-specific smoke test: inserts renders directly into DB, exercises all ef-waveform modes concurrently
`telecine/scripts/console`	Node REPL with project imports (db, valkey, queues available)

Run

.ts

scripts via:

telecine/scripts/run tsx scripts/<script>.ts <args>

脚本	用途
`telecine/scripts/debug-render <id> [--redis] [--logs]`	主要调试工具：数据库状态、片段信息、错误详情，可选查看Valkey状态和日志
`telecine/scripts/inspect-render.ts`	底层Valkey检查工具：工作流数据、已认领任务的时长、所有工作流键
`telecine/scripts/check-queue.ts`	队列级Valkey状态：排队/已认领/失败任务数量、组织归属
`telecine/scripts/restart-render.ts`	重启失败的渲染任务：重置数据库状态，重新将初始化任务加入队列
`telecine/scripts/create-render.ts`	从现有渲染任务的组织上下文创建测试渲染任务
`telecine/scripts/render-logs [-f] <id>`	过滤Docker Compose日志中的初始化器/片段/终结器服务日志
`worktree smoke [branch]` （执行 `telecine/scripts/smoke-test.ts` ）	通过公开API进行端到端冒烟测试
`telecine/scripts/smoke-test-waveform.ts`	波形图专属冒烟测试：直接将渲染任务插入数据库，同时测试所有ef-waveform模式
`telecine/scripts/console`	带有项目导入的Node REPL（可使用db、valkey、queues）

.ts

脚本执行方式：

telecine/scripts/run tsx scripts/<script>.ts <args>

Querying Cloud Run Logs

查询Cloud Run日志

There are no dedicated log-querying scripts. Use

gcloud

directly. Workers log structured JSON via pino, so use

jsonPayload

filters:

bash

undefined

没有专门的日志查询脚本，请直接使用

gcloud

命令。Worker通过pino输出结构化JSON日志，因此可以使用

jsonPayload

进行过滤：

bash

undefined

All render workers for a specific render ID

特定渲染ID的所有渲染Worker日志

gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker-render" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100 --format=json

Specific worker stage

特定Worker阶段的日志

gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name="telecine-worker-render-initializer" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100

Errors only across all workers

所有Worker的错误日志

gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50


Cloud Run service names follow the pattern `telecine-worker-{queue-name}`. See `telecine/deploy/resources/queues/configs.ts` for the full list of queue names.

gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50


Cloud Run服务名称遵循`telecine-worker-{queue-name}`的格式。完整队列名称列表请查看`telecine/deploy/resources/queues/configs.ts`。

Valkey Key Schema

Valkey键架构

Queues and workflows use predictable Valkey key patterns. Understanding these lets you query state directly when scripts don't cover your case.

Queue-level keys (per queue name):

```
queues:{queueName}:queued
```
-- zset of job keys waiting to be claimed
```
queues:{queueName}:claimed
```
-- zset of job keys being processed (score = claim timestamp)
```
queues:{queueName}:completed
```
-- zset of completed job keys
```
queues:{queueName}:failed
```
-- zset of failed job keys
```
queues:{queueName}:jobs:{jobId}
```
-- serialized job data (SuperJSON)
```
queues:{queueName}:orgs
```
-- zset of org keys with active jobs

Workflow-level keys (per render ID):

```
workflows:{renderId}:queued
```
-- zset of workflow-level queued jobs
```
workflows:{renderId}:claimed
```
-- zset of claimed jobs
```
workflows:{renderId}:completed
```
-- zset of completed jobs
```
workflows:{renderId}:failed
```
-- zset of failed jobs
```
workflows:{renderId}:data
```
-- SuperJSON workflow payload (render config)
```
workflows:{renderId}:status
```
-- workflow status string

Progress tracking:

```
render:{renderId}
```
-- Redis stream for fragment completion progress

Stalled jobs are detected by checking claimed jobs whose score (claim timestamp) is older than 10 seconds.

队列和工作流使用可预测的Valkey键模式。理解这些模式可以让你在脚本无法覆盖场景时直接查询状态。

队列级键（按队列名称）：

```
queues:{queueName}:queued
```
—— 等待认领的任务键有序集合
```
queues:{queueName}:claimed
```
—— 正在处理的任务键有序集合（分数=认领时间戳）
```
queues:{queueName}:completed
```
—— 已完成任务键有序集合
```
queues:{queueName}:failed
```
—— 失败任务键有序集合
```
queues:{queueName}:jobs:{jobId}
```
—— 序列化的任务数据（SuperJSON）
```
queues:{queueName}:orgs
```
—— 包含活跃任务的组织键有序集合

工作流级键（按渲染ID）：

```
workflows:{renderId}:queued
```
—— 工作流级排队任务的有序集合
```
workflows:{renderId}:claimed
```
—— 已认领任务的有序集合
```
workflows:{renderId}:completed
```
—— 已完成任务的有序集合
```
workflows:{renderId}:failed
```
—— 失败任务的有序集合
```
workflows:{renderId}:data
```
—— SuperJSON格式的工作流负载（渲染配置）
```
workflows:{renderId}:status
```
—— 工作流状态字符串

进度追踪：

```
render:{renderId}
```
—— 片段完成进度的Redis流

停滞任务通过检查分数（认领时间戳）超过10秒的已认领任务来检测。

Database Tables

数据库表

Render state is persisted in Postgres via Kysely (not Prisma). The

db

client is imported from

@/sql-client.server

```
video2.renders
```
-- Main render record: status, html, org_id, dimensions, fps, duration_ms, failure_detail, timestamps
```
video2.render_fragments
```
-- Per-segment fragment records: render_id, segment_id, attempt_number, timestamps, last_error
```
video2.process_html
```
-- HTML processing records
```
video2.files
```
-- File records (used by process-isobmff and ingest-image)

Render status values:

created

queued

rendering

complete

failed

渲染状态通过Kysely（而非Prisma）持久化到Postgres中。

db

客户端从

@/sql-client.server

导入。

```
video2.renders
```
—— 主要渲染记录：状态、html、org_id、尺寸、帧率、时长（毫秒）、失败详情、时间戳
```
video2.render_fragments
```
—— 分段片段记录：render_id、segment_id、尝试次数、时间戳、最后一次错误
```
video2.process_html
```
—— HTML处理记录
```
video2.files
```
—— 文件记录（供process-isobmff和ingest-image使用）

渲染状态值：

created

queued

rendering

complete

failed

Debugging Workflow

调试流程

Start with
debug-render
-- get the DB status, error detail, and fragment breakdown.
If stuck in "rendering" -- add
```
--redis
```
to see if jobs are queued, claimed (possibly stalled), or silently failed in Valkey.
If Valkey state is unclear -- use
```
inspect-render.ts
```
for detailed claimed job ages and
```
check-queue.ts
```
for queue-level counts.
If you need worker logs -- query Cloud Run logs with
```
gcloud
```
using the render ID (see commands above). In local dev, use
```
--logs
```
flag or
```
render-logs
```
script.
To retry -- use
```
restart-render.ts
```
to reset DB state and re-enqueue the initializer.
For production DB access -- use
```
telecine/scripts/debug-prod-web --use-prod-db --shell
```
to get a container with production database connectivity.

从
debug-render
开始 —— 获取数据库状态、错误详情和片段明细。
若任务卡在“rendering”状态 —— 添加
```
--redis
```
参数查看任务是否在排队、已认领（可能停滞）或在Valkey中静默失败。
若Valkey状态不明确 —— 使用
```
inspect-render.ts
```
查看已认领任务的详细时长，使用
```
check-queue.ts
```
查看队列级统计。
若需要Worker日志 —— 使用
```
gcloud
```
命令结合渲染ID查询Cloud Run日志（参见上方命令）。在本地开发环境，使用
```
--logs
```
参数或
```
render-logs
```
脚本。
若需要重试 —— 使用
```
restart-render.ts
```
重置数据库状态并重新将初始化任务加入队列。
若需要访问生产环境数据库 —— 使用
```
telecine/scripts/debug-prod-web --use-prod-db --shell
```
获取带有生产数据库连接权限的容器。

Diagnosing Electron RPC Failures

诊断Electron RPC故障

Fragment renders fail via RPC timeout. The stack trace tells you how far rendering got:

RPC.ts:182
— initial 5 s timer fired, no keepalives received at all. Electron never started rendering. Causes: scheduler opened more connections than the single local container can handle (see below), Electron failed to load the page, or the render context couldn't be created.
RPC.ts:153
— keepalive-reset timer fired mid-render. At least one frame rendered before the hang. Causes: a race condition aborted an in-flight fetch (the
```
AbortError
```
case), a frame took too long, or Electron crashed mid-segment.

片段渲染会因RPC超时失败。堆栈跟踪可以告诉你渲染进行到了哪一步：

RPC.ts:182
—— 初始5秒计时器触发，未收到任何保活信号。Electron从未开始渲染。原因：调度器打开的连接数超过了单个本地容器的处理能力（参见下方），Electron加载页面失败，或无法创建渲染上下文。
RPC.ts:153
—— 保活重置计时器在渲染过程中触发。至少有一帧在挂起前完成渲染。原因：竞态条件中止了正在进行的请求（
```
AbortError
```
情况），某一帧耗时过长，或Electron在分段渲染过程中崩溃。

AbortError / FrameController race

AbortError / FrameController竞态问题

If Electron logs show

[EF_FRAMEGEN.beginFrame] error: [object DOMException]

AbortError: The user aborted a request

, a fetch started during

seekForRender

was cancelled by an autonomous re-render firing concurrently. This is a timing-dependent race:

EFTemporal.updated()

EFTimegroup.updated()

fires when media loads and calls

FrameController.abort()

, killing the in-flight GCS fetch.

Fix: set

data-no-playback-controller

on the timegroup before

seekForRender

to suppress autonomous re-renders — the same attribute used on render clones. Check

EF_FRAMEGEN.ts

initialize()

and

EFTemporal.ts

EFTimegroup.ts

updated()

如果Electron日志显示

[EF_FRAMEGEN.beginFrame] error: [object DOMException]

AbortError: The user aborted a request

，则表示在

seekForRender

期间启动的请求被并发触发的自主重渲染取消了。这是一个依赖时序的竞态问题：当媒体加载时

EFTemporal.updated()

或

EFTimegroup.updated()

触发，并调用

FrameController.abort()

，终止了正在进行的GCS请求。

修复方案：在

seekForRender

之前为时间组设置

data-no-playback-controller

属性以抑制自主重渲染——该属性与渲染克隆中使用的属性相同。检查

EF_FRAMEGEN.ts

的

initialize()

方法以及

EFTemporal.ts

EFTimegroup.ts

的

updated()

方法。

Scheduler over-scaling in local dev

本地开发环境中调度器过度扩容

In production,

MAX_WORKER_COUNT

controls how many Cloud Run instances the scheduler spins up. Locally there is one container per queue. If the scheduler opens more WebSocket connections than the single container expects (e.g. 30 connections for 30 queued jobs), every concurrent

renderFragment

RPC call beyond the worker's

WORKER_CONCURRENCY

quota times out at

RPC.ts:182

before Electron starts processing it.

The two dials and how they interact:

Dial	Production	Local dev
`MAX_WORKER_COUNT`	scales container count	must match `scale:` in docker-compose (usually 1)
`WORKER_CONCURRENCY`	jobs per container	effective parallelism when `MAX_WORKER_COUNT=1`

Both are read from

telecine/.env

(via

env_file

in worker containers and

${VAR:-default}

substitution in

scheduler-go/docker-compose.yaml

). Changing them requires

telecine/scripts/docker-compose up -d

(not just

restart

) to force recreation with the new env.

在生产环境中，

MAX_WORKER_COUNT

控制调度器启动的Cloud Run实例数量。在本地环境中，每个队列对应一个容器。如果调度器打开的WebSocket连接数超过了单个容器的预期（例如，30个排队任务对应30个连接），那么超过Worker的

WORKER_CONCURRENCY

配额的所有并发

renderFragment

RPC调用都会在Electron开始处理前于

RPC.ts:182

处超时。

两个配置项及其交互方式：

配置项	生产环境	本地开发环境
`MAX_WORKER_COUNT`	控制容器数量扩容	必须与docker-compose中的 `scale:` 值匹配（通常为1）
`WORKER_CONCURRENCY`	每个容器处理的任务数	当 `MAX_WORKER_COUNT=1` 时的有效并行度

两者均从

telecine/.env

读取（通过Worker容器中的

env_file

和

scheduler-go/docker-compose.yaml

中的

${VAR:-default}

替换）。修改后需要执行

telecine/scripts/docker-compose up -d

（不仅仅是

restart

）以强制使用新环境变量重新创建容器。

Key Source Files

核心源文件

```
telecine/scripts/debug-render.ts
```
-- Primary debug tool implementation
```
telecine/scripts/inspect-render.ts
```
-- Low-level Valkey render inspection
```
telecine/scripts/check-queue.ts
```
-- Queue-level Valkey state checker
```
telecine/scripts/restart-render.ts
```
-- Render restart/retry tool
```
telecine/lib/queues/Queue.ts
```
-- Queue base class, key patterns
```
telecine/lib/queues/Workflow.ts
```
-- Workflow base class, workflow key patterns
```
telecine/lib/queues/Job.ts
```
-- Job serialization, enqueue, stall detection
```
telecine/lib/queues/units-of-work/Render/
```
-- Render pipeline queue/worker definitions

telecine/lib/queues/units-of-work/ProcessHtml/

-- HTML pipeline queue/worker definitions

```
telecine/deploy/resources/queues/configs.ts
```
-- Production queue names and scaling config

telecine/deploy/resources/queues/defineWorker.ts

-- Cloud Run service definition template

```
telecine/deploy/resources/constants.ts
```
-- GCP project/region constants
```
telecine/lib/valkey/valkey.ts
```
-- Valkey connection setup

```
telecine/scripts/debug-render.ts
```
—— 主要调试工具实现
```
telecine/scripts/inspect-render.ts
```
—— 底层Valkey渲染检查工具
```
telecine/scripts/check-queue.ts
```
—— 队列级Valkey状态检查工具
```
telecine/scripts/restart-render.ts
```
—— 渲染任务重启/重试工具
```
telecine/lib/queues/Queue.ts
```
—— 队列基类、键模式
```
telecine/lib/queues/Workflow.ts
```
—— 工作流基类、工作流键模式
```
telecine/lib/queues/Job.ts
```
—— 任务序列化、入队、停滞检测

telecine/lib/queues/units-of-work/Render/

—— 渲染流水线队列/Worker定义

telecine/lib/queues/units-of-work/ProcessHtml/

—— HTML流水线队列/Worker定义

```
telecine/deploy/resources/queues/configs.ts
```
—— 生产环境队列名称和扩容配置

telecine/deploy/resources/queues/defineWorker.ts

—— Cloud Run服务定义模板

```
telecine/deploy/resources/constants.ts
```
—— GCP项目/区域常量
```
telecine/lib/valkey/valkey.ts
```
—— Valkey连接设置

When to Use This Skill

何时使用该技能

Use this skill when:

A production render is stuck, failed, or producing unexpected results
You need to trace a render through the pipeline to find where it stalled
You need to query Cloud Run logs for a specific render
You need to inspect or manipulate Valkey queue state
You need to restart a failed render
You need to understand the render pipeline architecture

当出现以下情况时使用该技能：

生产环境渲染任务停滞、失败或产生意外结果
需要追踪渲染任务在流水线中的流转过程以定位停滞点
需要查询特定渲染任务的Cloud Run日志
需要检查或操作Valkey队列状态
需要重启失败的渲染任务
需要理解渲染流水线架构