debug-production-renders
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDebug Production Renders
排查生产环境渲染故障
Telecine renders flow through a queue-based pipeline backed by Valkey (Redis-compatible). Each render is a workflow that progresses through multiple queues, each served by a dedicated Cloud Run worker service. Debugging a render means checking its state at three layers: the Postgres database, the Valkey queue state, and the Cloud Run worker logs.
Telecine的渲染任务通过基于Valkey(兼容Redis)的队列式流水线流转。每个渲染任务是一个工作流,会在多个队列中推进,每个队列由专门的Cloud Run Worker服务处理。调试渲染任务意味着要检查三个层级的状态:Postgres数据库、Valkey队列状态和Cloud Run Worker日志。
Quick Start
快速开始
All debug scripts run inside docker. From the monorepo root:
bash
undefined所有调试脚本都在Docker内运行。从单体仓库根目录执行:
bash
undefinedFull render status: DB row, fragment breakdown, error detail
完整渲染状态:数据库行、片段明细、错误详情
telecine/scripts/debug-render <render-id>
telecine/scripts/debug-render <render-id>
Add live Valkey state (queued/claimed/completed/failed jobs)
添加实时Valkey状态(排队/已认领/已完成/失败的任务)
telecine/scripts/debug-render <render-id> --redis
telecine/scripts/debug-render <render-id> --redis
Add docker compose log grep (local dev only)
添加Docker Compose日志过滤(仅本地开发环境)
telecine/scripts/debug-render <render-id> --logs
telecine/scripts/debug-render <render-id> --logs
All three
同时查看以上三者
telecine/scripts/debug-render <render-id> --redis --logs
undefinedtelecine/scripts/debug-render <render-id> --redis --logs
undefinedRender Pipeline Flow
渲染流水线流程
A render progresses through queues in this order:
- process-html-initializer -- Preprocesses raw HTML from the API submission.
- process-html-finalizer -- Sets workflow data, then enqueues the render-initializer job. This is the bridge from the HTML pipeline into the render pipeline.
- render-initializer -- Checks out render source, extracts render info (dimensions, duration, fps) via Electron RPC, creates an assets metadata bundle, then fans out N fragment jobs. For still images (png/jpeg/webp), renders the image directly and skips fragments.
- render-fragment (N parallel) -- Each job renders one time slice of the video via Electron RPC, writes the fragment file to storage, and reports progress.
- render-finalizer -- Auto-triggered when all fragment jobs complete. Merges fragment files into the final output. Marks the render as complete.
The pipeline is triggered by a Hasura event on into . The workflow system automatically routes the finalizer queue job when all child jobs complete or when any job fails with exhausted retries.
INSERTvideo2.rendersQueue names, worker service names, and resource allocations are defined in source files -- see "Key Source Files" below.
渲染任务按以下顺序在队列中推进:
- process-html-initializer —— 预处理从API提交的原始HTML。
- process-html-finalizer —— 设置工作流数据,然后将render-initializer任务加入队列。这是HTML流水线到渲染流水线的桥梁。
- render-initializer —— 检出渲染源文件,通过Electron RPC提取渲染信息(尺寸、时长、帧率),创建资源元数据包,然后分发N个片段任务。对于静态图片(png/jpeg/webp),会直接渲染图片并跳过片段处理。
- render-fragment(N个并行任务)—— 每个任务通过Electron RPC渲染视频的一个时间切片,将片段文件写入存储,并上报进度。
- render-finalizer —— 当所有片段任务完成时自动触发。将片段文件合并为最终输出。标记渲染任务为完成状态。
流水线由表的Hasura INSERT事件触发。当所有子任务完成或任何任务重试耗尽失败时,工作流系统会自动将终结器队列任务路由到对应队列。
video2.renders队列名称、Worker服务名称和资源分配定义在源文件中——参见下方“核心源文件”。
Debug Scripts
调试脚本
| Script | Purpose |
|---|---|
| Primary debug tool: DB state, fragments, errors, optional Valkey + logs |
| Lower-level Valkey inspection: workflow data, claimed jobs with ages, all workflow keys |
| Queue-level Valkey state: queued/claimed/failed counts, org membership |
| Restart a failed render: resets DB status, re-enqueues initializer job |
| Create a test render from an existing render's org context |
| Grep docker compose logs for initializer/fragment/finalizer services |
| End-to-end smoke tests via the public API |
| Waveform-specific smoke test: inserts renders directly into DB, exercises all ef-waveform modes concurrently |
| Node REPL with project imports (db, valkey, queues available) |
Run scripts via:
.tstelecine/scripts/run tsx scripts/<script>.ts <args>| 脚本 | 用途 |
|---|---|
| 主要调试工具:数据库状态、片段信息、错误详情,可选查看Valkey状态和日志 |
| 底层Valkey检查工具:工作流数据、已认领任务的时长、所有工作流键 |
| 队列级Valkey状态:排队/已认领/失败任务数量、组织归属 |
| 重启失败的渲染任务:重置数据库状态,重新将初始化任务加入队列 |
| 从现有渲染任务的组织上下文创建测试渲染任务 |
| 过滤Docker Compose日志中的初始化器/片段/终结器服务日志 |
| 通过公开API进行端到端冒烟测试 |
| 波形图专属冒烟测试:直接将渲染任务插入数据库,同时测试所有ef-waveform模式 |
| 带有项目导入的Node REPL(可使用db、valkey、queues) |
.tstelecine/scripts/run tsx scripts/<script>.ts <args>Querying Cloud Run Logs
查询Cloud Run日志
There are no dedicated log-querying scripts. Use directly. Workers log structured JSON via pino, so use filters:
gcloudjsonPayloadbash
undefined没有专门的日志查询脚本,请直接使用命令。Worker通过pino输出结构化JSON日志,因此可以使用进行过滤:
gcloudjsonPayloadbash
undefinedAll render workers for a specific render ID
特定渲染ID的所有渲染Worker日志
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker-render" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100 --format=json
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker-render" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100 --format=json
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker-render" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100 --format=json
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker-render" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100 --format=json
Specific worker stage
特定Worker阶段的日志
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name="telecine-worker-render-initializer" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100
'resource.type="cloud_run_revision" AND resource.labels.service_name="telecine-worker-render-initializer" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name="telecine-worker-render-initializer" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100
'resource.type="cloud_run_revision" AND resource.labels.service_name="telecine-worker-render-initializer" AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=100
Errors only across all workers
所有Worker的错误日志
gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50
Cloud Run service names follow the pattern `telecine-worker-{queue-name}`. See `telecine/deploy/resources/queues/configs.ts` for the full list of queue names.gcloud logging read
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50
'resource.type="cloud_run_revision" AND resource.labels.service_name=~"telecine-worker" AND severity>=ERROR AND jsonPayload.renderId="<RENDER-ID>"'
--project=editframe --limit=50
Cloud Run服务名称遵循`telecine-worker-{queue-name}`的格式。完整队列名称列表请查看`telecine/deploy/resources/queues/configs.ts`。Valkey Key Schema
Valkey键架构
Queues and workflows use predictable Valkey key patterns. Understanding these lets you query state directly when scripts don't cover your case.
Queue-level keys (per queue name):
- -- zset of job keys waiting to be claimed
queues:{queueName}:queued - -- zset of job keys being processed (score = claim timestamp)
queues:{queueName}:claimed - -- zset of completed job keys
queues:{queueName}:completed - -- zset of failed job keys
queues:{queueName}:failed - -- serialized job data (SuperJSON)
queues:{queueName}:jobs:{jobId} - -- zset of org keys with active jobs
queues:{queueName}:orgs
Workflow-level keys (per render ID):
- -- zset of workflow-level queued jobs
workflows:{renderId}:queued - -- zset of claimed jobs
workflows:{renderId}:claimed - -- zset of completed jobs
workflows:{renderId}:completed - -- zset of failed jobs
workflows:{renderId}:failed - -- SuperJSON workflow payload (render config)
workflows:{renderId}:data - -- workflow status string
workflows:{renderId}:status
Progress tracking:
- -- Redis stream for fragment completion progress
render:{renderId}
Stalled jobs are detected by checking claimed jobs whose score (claim timestamp) is older than 10 seconds.
队列和工作流使用可预测的Valkey键模式。理解这些模式可以让你在脚本无法覆盖场景时直接查询状态。
队列级键(按队列名称):
- —— 等待认领的任务键有序集合
queues:{queueName}:queued - —— 正在处理的任务键有序集合(分数=认领时间戳)
queues:{queueName}:claimed - —— 已完成任务键有序集合
queues:{queueName}:completed - —— 失败任务键有序集合
queues:{queueName}:failed - —— 序列化的任务数据(SuperJSON)
queues:{queueName}:jobs:{jobId} - —— 包含活跃任务的组织键有序集合
queues:{queueName}:orgs
工作流级键(按渲染ID):
- —— 工作流级排队任务的有序集合
workflows:{renderId}:queued - —— 已认领任务的有序集合
workflows:{renderId}:claimed - —— 已完成任务的有序集合
workflows:{renderId}:completed - —— 失败任务的有序集合
workflows:{renderId}:failed - —— SuperJSON格式的工作流负载(渲染配置)
workflows:{renderId}:data - —— 工作流状态字符串
workflows:{renderId}:status
进度追踪:
- —— 片段完成进度的Redis流
render:{renderId}
停滞任务通过检查分数(认领时间戳)超过10秒的已认领任务来检测。
Database Tables
数据库表
Render state is persisted in Postgres via Kysely (not Prisma). The client is imported from .
db@/sql-client.server- -- Main render record: status, html, org_id, dimensions, fps, duration_ms, failure_detail, timestamps
video2.renders - -- Per-segment fragment records: render_id, segment_id, attempt_number, timestamps, last_error
video2.render_fragments - -- HTML processing records
video2.process_html - -- File records (used by process-isobmff and ingest-image)
video2.files
Render status values: -> -> -> |
createdqueuedrenderingcompletefailed渲染状态通过Kysely(而非Prisma)持久化到Postgres中。客户端从导入。
db@/sql-client.server- —— 主要渲染记录:状态、html、org_id、尺寸、帧率、时长(毫秒)、失败详情、时间戳
video2.renders - —— 分段片段记录:render_id、segment_id、尝试次数、时间戳、最后一次错误
video2.render_fragments - —— HTML处理记录
video2.process_html - —— 文件记录(供process-isobmff和ingest-image使用)
video2.files
渲染状态值: -> -> -> |
createdqueuedrenderingcompletefailedDebugging Workflow
调试流程
- Start with -- get the DB status, error detail, and fragment breakdown.
debug-render - If stuck in "rendering" -- add to see if jobs are queued, claimed (possibly stalled), or silently failed in Valkey.
--redis - If Valkey state is unclear -- use for detailed claimed job ages and
inspect-render.tsfor queue-level counts.check-queue.ts - If you need worker logs -- query Cloud Run logs with using the render ID (see commands above). In local dev, use
gcloudflag or--logsscript.render-logs - To retry -- use to reset DB state and re-enqueue the initializer.
restart-render.ts - For production DB access -- use to get a container with production database connectivity.
telecine/scripts/debug-prod-web --use-prod-db --shell
- 从开始 —— 获取数据库状态、错误详情和片段明细。
debug-render - 若任务卡在“rendering”状态 —— 添加参数查看任务是否在排队、已认领(可能停滞)或在Valkey中静默失败。
--redis - 若Valkey状态不明确 —— 使用查看已认领任务的详细时长,使用
inspect-render.ts查看队列级统计。check-queue.ts - 若需要Worker日志 —— 使用命令结合渲染ID查询Cloud Run日志(参见上方命令)。在本地开发环境,使用
gcloud参数或--logs脚本。render-logs - 若需要重试 —— 使用重置数据库状态并重新将初始化任务加入队列。
restart-render.ts - 若需要访问生产环境数据库 —— 使用获取带有生产数据库连接权限的容器。
telecine/scripts/debug-prod-web --use-prod-db --shell
Diagnosing Electron RPC Failures
诊断Electron RPC故障
Fragment renders fail via RPC timeout. The stack trace tells you how far rendering got:
- — initial 5 s timer fired, no keepalives received at all. Electron never started rendering. Causes: scheduler opened more connections than the single local container can handle (see below), Electron failed to load the page, or the render context couldn't be created.
RPC.ts:182 - — keepalive-reset timer fired mid-render. At least one frame rendered before the hang. Causes: a race condition aborted an in-flight fetch (the
RPC.ts:153case), a frame took too long, or Electron crashed mid-segment.AbortError
片段渲染会因RPC超时失败。堆栈跟踪可以告诉你渲染进行到了哪一步:
- —— 初始5秒计时器触发,未收到任何保活信号。Electron从未开始渲染。原因:调度器打开的连接数超过了单个本地容器的处理能力(参见下方),Electron加载页面失败,或无法创建渲染上下文。
RPC.ts:182 - —— 保活重置计时器在渲染过程中触发。至少有一帧在挂起前完成渲染。原因:竞态条件中止了正在进行的请求(
RPC.ts:153情况),某一帧耗时过长,或Electron在分段渲染过程中崩溃。AbortError
AbortError / FrameController race
AbortError / FrameController竞态问题
If Electron logs show / , a fetch started during was cancelled by an autonomous re-render firing concurrently. This is a timing-dependent race: or fires when media loads and calls , killing the in-flight GCS fetch.
[EF_FRAMEGEN.beginFrame] error: [object DOMException]AbortError: The user aborted a requestseekForRenderEFTemporal.updated()EFTimegroup.updated()FrameController.abort()Fix: set on the timegroup before to suppress autonomous re-renders — the same attribute used on render clones. Check and / .
data-no-playback-controllerseekForRenderEF_FRAMEGEN.tsinitialize()EFTemporal.tsEFTimegroup.tsupdated()如果Electron日志显示 / ,则表示在期间启动的请求被并发触发的自主重渲染取消了。这是一个依赖时序的竞态问题:当媒体加载时或触发,并调用,终止了正在进行的GCS请求。
[EF_FRAMEGEN.beginFrame] error: [object DOMException]AbortError: The user aborted a requestseekForRenderEFTemporal.updated()EFTimegroup.updated()FrameController.abort()修复方案:在之前为时间组设置属性以抑制自主重渲染——该属性与渲染克隆中使用的属性相同。检查的方法以及/的方法。
seekForRenderdata-no-playback-controllerEF_FRAMEGEN.tsinitialize()EFTemporal.tsEFTimegroup.tsupdated()Scheduler over-scaling in local dev
本地开发环境中调度器过度扩容
In production, controls how many Cloud Run instances the scheduler spins up. Locally there is one container per queue. If the scheduler opens more WebSocket connections than the single container expects (e.g. 30 connections for 30 queued jobs), every concurrent RPC call beyond the worker's quota times out at before Electron starts processing it.
MAX_WORKER_COUNTrenderFragmentWORKER_CONCURRENCYRPC.ts:182The two dials and how they interact:
| Dial | Production | Local dev |
|---|---|---|
| scales container count | must match |
| jobs per container | effective parallelism when |
Both are read from (via in worker containers and substitution in ). Changing them requires (not just ) to force recreation with the new env.
telecine/.envenv_file${VAR:-default}scheduler-go/docker-compose.yamltelecine/scripts/docker-compose up -drestart在生产环境中,控制调度器启动的Cloud Run实例数量。在本地环境中,每个队列对应一个容器。如果调度器打开的WebSocket连接数超过了单个容器的预期(例如,30个排队任务对应30个连接),那么超过Worker的配额的所有并发 RPC调用都会在Electron开始处理前于处超时。
MAX_WORKER_COUNTWORKER_CONCURRENCYrenderFragmentRPC.ts:182两个配置项及其交互方式:
| 配置项 | 生产环境 | 本地开发环境 |
|---|---|---|
| 控制容器数量扩容 | 必须与docker-compose中的 |
| 每个容器处理的任务数 | 当 |
两者均从读取(通过Worker容器中的和中的替换)。修改后需要执行(不仅仅是)以强制使用新环境变量重新创建容器。
telecine/.envenv_filescheduler-go/docker-compose.yaml${VAR:-default}telecine/scripts/docker-compose up -drestartKey Source Files
核心源文件
- -- Primary debug tool implementation
telecine/scripts/debug-render.ts - -- Low-level Valkey render inspection
telecine/scripts/inspect-render.ts - -- Queue-level Valkey state checker
telecine/scripts/check-queue.ts - -- Render restart/retry tool
telecine/scripts/restart-render.ts - -- Queue base class, key patterns
telecine/lib/queues/Queue.ts - -- Workflow base class, workflow key patterns
telecine/lib/queues/Workflow.ts - -- Job serialization, enqueue, stall detection
telecine/lib/queues/Job.ts - -- Render pipeline queue/worker definitions
telecine/lib/queues/units-of-work/Render/ - -- HTML pipeline queue/worker definitions
telecine/lib/queues/units-of-work/ProcessHtml/ - -- Production queue names and scaling config
telecine/deploy/resources/queues/configs.ts - -- Cloud Run service definition template
telecine/deploy/resources/queues/defineWorker.ts - -- GCP project/region constants
telecine/deploy/resources/constants.ts - -- Valkey connection setup
telecine/lib/valkey/valkey.ts
- —— 主要调试工具实现
telecine/scripts/debug-render.ts - —— 底层Valkey渲染检查工具
telecine/scripts/inspect-render.ts - —— 队列级Valkey状态检查工具
telecine/scripts/check-queue.ts - —— 渲染任务重启/重试工具
telecine/scripts/restart-render.ts - —— 队列基类、键模式
telecine/lib/queues/Queue.ts - —— 工作流基类、工作流键模式
telecine/lib/queues/Workflow.ts - —— 任务序列化、入队、停滞检测
telecine/lib/queues/Job.ts - —— 渲染流水线队列/Worker定义
telecine/lib/queues/units-of-work/Render/ - —— HTML流水线队列/Worker定义
telecine/lib/queues/units-of-work/ProcessHtml/ - —— 生产环境队列名称和扩容配置
telecine/deploy/resources/queues/configs.ts - —— Cloud Run服务定义模板
telecine/deploy/resources/queues/defineWorker.ts - —— GCP项目/区域常量
telecine/deploy/resources/constants.ts - —— Valkey连接设置
telecine/lib/valkey/valkey.ts
When to Use This Skill
何时使用该技能
Use this skill when:
- A production render is stuck, failed, or producing unexpected results
- You need to trace a render through the pipeline to find where it stalled
- You need to query Cloud Run logs for a specific render
- You need to inspect or manipulate Valkey queue state
- You need to restart a failed render
- You need to understand the render pipeline architecture
当出现以下情况时使用该技能:
- 生产环境渲染任务停滞、失败或产生意外结果
- 需要追踪渲染任务在流水线中的流转过程以定位停滞点
- 需要查询特定渲染任务的Cloud Run日志
- 需要检查或操作Valkey队列状态
- 需要重启失败的渲染任务
- 需要理解渲染流水线架构