extract-frames

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Extract Frames

提取帧

Arguments

参数

Parse the user's request for:
  • video path (required): path to the video file
  • --last: extract last frame of each shot instead of first
  • --first --last or --all: extract both first and last frames
  • --shots N,N,N: only process specific shot numbers (1-indexed)
  • --threshold X: override auto-detected threshold (skip user confirmation)
Default: extract first frame of every shot.
解析用户的请求以获取:
  • video path(必填):视频文件的路径
  • --last:提取每个镜头的最后一帧而非第一帧
  • --first --last--all:同时提取第一帧和最后一帧
  • --shots N,N,N:仅处理指定编号的镜头(从1开始计数)
  • --threshold X:覆盖自动检测的阈值(跳过用户确认)
默认行为:提取每个镜头的第一帧。

Workflow

工作流程

Phase 0: Setup

阶段0:准备工作

  1. Validate the video file exists.
  2. Get video metadata via ffprobe:
bash
ffprobe -v error -show_entries format=duration -show_entries stream=r_frame_rate,width,height,codec_name -of default "$INPUT"
  1. Parse fps (needed for last-frame calculation) and duration.
  2. Create output directory next to the video:
    {video_basename}_frames/
    . If it already exists, check for
    scores.txt
    — if present, skip Phase 1.
  1. 验证视频文件是否存在。
  2. 通过ffprobe获取视频元数据:
bash
ffprobe -v error -show_entries format=duration -show_entries stream=r_frame_rate,width,height,codec_name -of default "$INPUT"
  1. 解析fps(用于计算最后一帧)和时长。
  2. 在视频所在位置创建输出目录:
    {video_basename}_frames/
    。如果目录已存在,检查是否有
    scores.txt
    文件——若存在,则跳过阶段1。

Phase 1: Score Dump

阶段1:分数导出

Dump per-frame scene scores for the entire video:
bash
ffmpeg -i "$INPUT" -vf "select='gte(scene,0)',metadata=print:file=$OUTPUT_DIR/scores.txt" -fps_mode vfr -f null - 2>&1
This is the expensive step. The output file
scores.txt
contains blocks like:
frame:0    pts:0       pts_time:0.000000
lavfi.scene_score=0.000000
导出整个视频的逐帧场景分数:
bash
ffmpeg -i "$INPUT" -vf "select='gte(scene,0)',metadata=print:file=$OUTPUT_DIR/scores.txt" -fps_mode vfr -f null - 2>&1
这是耗时较长的步骤。输出文件
scores.txt
包含如下格式的内容块:
frame:0    pts:0       pts_time:0.000000
lavfi.scene_score=0.000000

Phase 2: Cut Detection

阶段2:镜头切换检测

Parse scores.txt with this awk one-liner to get distribution + all candidate frames:
bash
awk '
/pts_time/ { split($0, a, "pts_time:"); ts=a[2]+0 }
/scene_score/ { split($0, a, "="); score=a[2]+0;
  if (score < 0.01) b1++;
  else if (score < 0.05) b2++;
  else if (score < 0.10) b3++;
  else if (score < 0.20) b4++;
  else b5++;
  total++;
  if (score > max) max=score;
  if (score > 0.05) printf "  ts=%.3fs  score=%.6f\n", ts, score;
}
END {
  print "\n--- Distribution ---";
  printf "< 0.01:      %d (%.1f%%)\n", b1, b1/total*100;
  printf "0.01 - 0.05: %d (%.1f%%)\n", b2, b2/total*100;
  printf "0.05 - 0.10: %d (%.1f%%)\n", b3, b3/total*100;
  printf "0.10 - 0.20: %d (%.1f%%)\n", b4, b4/total*100;
  printf "0.20+:       %d (%.1f%%)\n", b5, b5/total*100;
  printf "Max score:   %.6f\n", max;
}' "$OUTPUT_DIR/scores.txt"
Step 2a — Startup artifact filter: Discard any frames in the first 0.5s where
score > 0.05
(a fixed preliminary value — the final user-confirmed threshold is not known yet). Fade-ins from black or codec initialization commonly produce score=1.0 spikes at t=0.03-0.08s that are not real cuts. After discarding these, recompute max score from the remaining frames.
Step 2b — Branch on max score. If max score (after startup filter) < 0.05, the video has no cuts:
  • Report: "No shot boundaries detected — single continuous shot."
  • Extract only frame at t=0 (or t=1.0s if t=0 is a black frame — check file size, <50KB indicates black).
  • Skip threshold confirmation.
Step 2c — Gap analysis and threshold. If max score >= 0.05, use the raw candidates (all frames with score > 0.05, after startup filter) to find the noise ceiling and the lowest-scoring candidate. Place the proposed threshold at the midpoint of the gap. Present to user via AskUserQuestion:
  • Score distribution summary
  • Gap analysis (noise ceiling → lowest cut, gap width)
  • Proposed threshold and resulting shot count
  • Options: Accept proposed (Recommended), Lower threshold, Higher threshold, Custom value
If
--threshold
was provided, skip confirmation and set
$THRESHOLD_VALUE
directly. Otherwise set
$THRESHOLD_VALUE
to the user-confirmed value before proceeding.
Step 2d — Run-based dedup. Using
$THRESHOLD_VALUE
from Step 2c, identify cut points. A "run" is a sequence of consecutive frames where every frame scores above threshold. The principle: an aftershock immediately follows its parent cut (consecutive frames both above threshold), while a real cut always rises from the noise floor (preceded by at least one below-threshold frame).
bash
awk -v THRESHOLD="$THRESHOLD_VALUE" '
/pts_time/ { split($0, a, "pts_time:"); ts=a[2]+0 }
/scene_score/ { split($0, a, "="); score=a[2]+0;
  if (score > THRESHOLD) {
    if (!in_run) { run_best_ts = ts; run_best_score = score; in_run = 1; }
    else if (score > run_best_score) { run_best_ts = ts; run_best_score = score; }
  } else {
    if (in_run) { printf "%.3f %.6f\n", run_best_ts, run_best_score; in_run = 0; }
  }
}
END { if (in_run) printf "%.3f %.6f\n", run_best_ts, run_best_score }
' "$OUTPUT_DIR/scores.txt"
This keeps the peak frame of each run and discards aftershocks within the same run. It correctly handles:
  • Standard cuts: isolated spikes → each kept (run length 1)
  • Cuts with aftershocks: 2-3 consecutive high frames → peak kept, echoes discarded
  • Rapid montages: each cut separated by noise frames → all kept, even at 0.12s intervals
使用以下awk单行命令解析scores.txt,获取分数分布以及所有候选帧:
bash
awk '
/pts_time/ { split($0, a, "pts_time:"); ts=a[2]+0 }
/scene_score/ { split($0, a, "="); score=a[2]+0;
  if (score < 0.01) b1++;
  else if (score < 0.05) b2++;
  else if (score < 0.10) b3++;
  else if (score < 0.20) b4++;
  else b5++;
  total++;
  if (score > max) max=score;
  if (score > 0.05) printf "  ts=%.3fs  score=%.6f\n", ts, score;
}
END {
  print "\n--- Distribution ---";
  printf "< 0.01:      %d (%.1f%%)\n", b1, b1/total*100;
  printf "0.01 - 0.05: %d (%.1f%%)\n", b2, b2/total*100;
  printf "0.05 - 0.10: %d (%.1f%%)\n", b3, b3/total*100;
  printf "0.10 - 0.20: %d (%.1f%%)\n", b4, b4/total*100;
  printf "0.20+:       %d (%.1f%%)\n", b5, b5/total*100;
  printf "Max score:   %.6f\n", max;
}' "$OUTPUT_DIR/scores.txt"
步骤2a — 启动 artifact 过滤:丢弃前0.5秒内所有
score > 0.05
的帧(这是一个固定的初始值——此时最终的用户确认阈值尚未确定)。从黑屏淡入或编解码器初始化通常会在t=0.03-0.08s时产生score=1.0的峰值,但这并非真实的镜头切换。丢弃这些帧后,重新计算剩余帧的最高分数。
步骤2b — 根据最高分数分支处理:如果(经过启动过滤后的)最高分数 < 0.05,则视频无镜头切换:
  • 报告:“未检测到镜头边界——视频为单一连续镜头。”
  • 仅提取t=0处的帧(如果t=0处是黑屏帧——通过文件大小判断,小于50KB则为黑屏,则提取t=1.0s处的帧)。
  • 跳过阈值确认步骤。
步骤2c — 间隙分析与阈值确定:如果最高分数 >= 0.05,使用原始候选帧(经过启动过滤后所有score > 0.05的帧)找出噪声上限和得分最低的候选帧。将建议阈值设置在间隙的中点。通过AskUserQuestion向用户展示:
  • 分数分布摘要
  • 间隙分析(噪声上限 → 最低切换得分,间隙宽度)
  • 建议阈值及对应的镜头数量
  • 选项:接受建议值(推荐)、降低阈值、提高阈值、自定义值
如果用户提供了
--threshold
参数,则跳过确认步骤,直接将
$THRESHOLD_VALUE
设置为该参数值。否则,将
$THRESHOLD_VALUE
设置为用户确认的值后继续。
步骤2d — 基于连续序列的去重:使用步骤2c中得到的
$THRESHOLD_VALUE
识别镜头切换点。“连续序列”指的是一系列连续的帧,其中每一帧的分数都高于阈值。原理:余震会紧随其主切换帧(连续两帧分数都高于阈值),而真实的镜头切换总是从噪声底上升(至少有一帧低于阈值的前置帧)。
bash
awk -v THRESHOLD="$THRESHOLD_VALUE" '
/pts_time/ { split($0, a, "pts_time:"); ts=a[2]+0 }
/scene_score/ { split($0, a, "="); score=a[2]+0;
  if (score > THRESHOLD) {
    if (!in_run) { run_best_ts = ts; run_best_score = score; in_run = 1; }
    else if (score > run_best_score) { run_best_ts = ts; run_best_score = score; }
  } else {
    if (in_run) { printf "%.3f %.6f\n", run_best_ts, run_best_score; in_run = 0; }
  }
}
END { if (in_run) printf "%.3f %.6f\n", run_best_ts, run_best_score }
' "$OUTPUT_DIR/scores.txt"
此命令保留每个连续序列中的峰值帧,并丢弃同一序列内的余震。它可以正确处理:
  • 标准切换:孤立峰值 → 全部保留(序列长度为1)
  • 带余震的切换:2-3个连续高分帧 → 保留峰值,丢弃回声
  • 快速蒙太奇:每个切换之间有噪声帧分隔 → 全部保留,即使间隔为0.12秒

Phase 3: Frame Extraction

阶段3:帧提取

For each detected cut point (plus t=0 for shot 1):
First frame (default):
bash
ffmpeg -y -ss $TIMESTAMP -i "$INPUT" -frames:v 1 -update 1 "$OUTPUT_DIR/shot_${NUM}_${TIME}s.png"
Last frame (
--last
):
  • For shots 1 through N-1:
    last_time = next_shot_timestamp - (1/fps)
  • For final shot:
    last_time = duration - (2/fps)
    (use 2 frames back, not 1 — seeking to
    duration - 1/fps
    can produce empty files near the end of some videos)
bash
ffmpeg -y -ss $LAST_TIME -i "$INPUT" -frames:v 1 -update 1 "$OUTPUT_DIR/shot_${NUM}_last_${TIME}s.png"
If a last-frame extraction produces an empty file (0 bytes), back off by another frame and retry.
Both (
--first --last
or
--all
): extract both per shot.
Filtered (
--shots 3,5,7
): only extract for the specified shot numbers.
对于每个检测到的切换点(加上镜头1的t=0):
第一帧(默认):
bash
ffmpeg -y -ss $TIMESTAMP -i "$INPUT" -frames:v 1 -update 1 "$OUTPUT_DIR/shot_${NUM}_${TIME}s.png"
最后一帧
--last
参数):
  • 对于镜头1到N-1:
    last_time = next_shot_timestamp - (1/fps)
  • 对于最后一个镜头:
    last_time = duration - (2/fps)
    (后退2帧而非1帧——在某些视频的末尾,定位到
    duration - 1/fps
    可能会生成空文件)
bash
ffmpeg -y -ss $LAST_TIME -i "$INPUT" -frames:v 1 -update 1 "$OUTPUT_DIR/shot_${NUM}_last_${TIME}s.png"
如果提取最后一帧时生成了空文件(0字节),则再后退一帧并重试。
同时提取两者
--first --last
--all
参数):每个镜头同时提取第一帧和最后一帧。
过滤提取
--shots 3,5,7
参数):仅提取指定编号的镜头。

Output

输出结果

  • Directory:
    {video_basename}_frames/
    next to the input video
  • First frames:
    shot_01_0.00s.png
    ,
    shot_02_1.63s.png
    , ...
  • Last frames:
    shot_01_last_1.60s.png
    ,
    shot_02_last_3.77s.png
    , ...
  • scores.txt
    always retained for re-runs at different thresholds
Report to user: total shots detected, shot list with timestamps, output directory path.
  • 目录:输入视频所在位置的
    {video_basename}_frames/
    目录
  • 第一帧:
    shot_01_0.00s.png
    shot_02_1.63s.png
    ……
  • 最后一帧:
    shot_01_last_1.60s.png
    shot_02_last_3.77s.png
    ……
  • scores.txt
    文件始终保留,以便在不同阈值下重新运行
向用户报告:检测到的总镜头数、带时间戳的镜头列表、输出目录路径。

Re-run Behavior

重新运行行为

If
scores.txt
already exists in the output directory, skip Phase 1 entirely and go straight to Phase 2 analysis. This makes threshold iteration instant — the user can re-run with
--threshold 0.15
without waiting for the score dump again.
如果输出目录中已存在
scores.txt
文件,则完全跳过阶段1,直接进入阶段2分析。这使得阈值迭代瞬间完成——用户无需等待分数导出,即可使用
--threshold 0.15
重新运行。

Shell Portability

Shell可移植性

Use pipe-based loops for frame extraction instead of array syntax (zsh handles
for
over arrays differently than bash):
bash
echo "0.000 2.480 4.280" | tr ' ' '\n' | awk '{printf "%02d %s\n", NR, $1}' | while read LABEL TS; do
  TS_FMT=$(printf "%.2f" "$TS")
  ffmpeg -y -ss "$TS" -i "$INPUT" -frames:v 1 -update 1 "$OUTPUT_DIR/shot_${LABEL}_${TS_FMT}s.png" 2>/dev/null
done
使用基于管道的循环进行帧提取,而非数组语法(zsh处理数组
for
循环的方式与bash不同):
bash
echo "0.000 2.480 4.280" | tr ' ' '\n' | awk '{printf "%02d %s\n", NR, $1}' | while read LABEL TS; do
  TS_FMT=$(printf "%.2f" "$TS")
  ffmpeg -y -ss "$TS" -i "$INPUT" -frames:v 1 -update 1 "$OUTPUT_DIR/shot_${LABEL}_${TS_FMT}s.png" 2>/dev/null
done

Edge Cases

边缘情况

  • Single continuous shots: Handled by step 2b. Common in fashion videos with rack-focus reveals, slow wardrobe progression, or single-take lifestyle shots.
  • Startup artifacts: Fade-ins from black produce score=1.0 at t=0.03-0.08s. Step 2a discards frames with score > 0.05 in the first 0.5s. If t=0 itself is a black frame (PNG < 50KB), extract at t=1.0s instead.
  • Aftershock spikes: Consecutive above-threshold frames (a cut + its echo). Run-based dedup in step 2d keeps only the peak of each run — no temporal window needed.
  • Rapid montages: Videos where shots are 2-4 frames long (0.08-0.16s). Each cut rises from the noise floor with 1-2 noise frames between spikes. Run-based dedup correctly preserves every cut because no two spikes are frame-adjacent. Report montage segments to the user: "Detected rapid montage from Xs-Ys with N shots."
  • Dissolves/fades: Score gradually ramps over multiple frames — forms a single run. Run-based dedup takes the peak frame as the cut point.
  • Empty file on seek: If ffmpeg produces a 0-byte PNG (common near video end), back off by one frame interval and retry.
  • 单一连续镜头:由步骤2b处理。常见于时尚视频(如焦点渐变展示、缓慢的服装变化)或一镜到底的生活方式视频。
  • 启动 artifact:从黑屏淡入会在t=0.03-0.08s时产生score=1.0的峰值。步骤2a会丢弃前0.5秒内score > 0.05的帧。如果t=0处本身是黑屏帧(PNG文件小于50KB),则改为提取t=1.0s处的帧。
  • 余震峰值:连续的高于阈值的帧(一次切换及其回声)。步骤2d中的基于连续序列的去重仅保留每个序列的峰值帧——无需时间窗口。
  • 快速蒙太奇:镜头长度为2-4帧(0.08-0.16秒)的视频。每个切换都从噪声底上升,峰值之间有1-2个噪声帧。基于连续序列的去重可以正确保留每个切换,因为没有两个峰值是相邻帧。向用户报告蒙太奇片段:“检测到Xs-Ys时间段内的快速蒙太奇,包含N个镜头。”
  • 溶解/淡入淡出:分数在多帧中逐渐上升——形成一个单一序列。基于连续序列的去重将峰值帧作为切换点。
  • 定位时生成空文件:如果ffmpeg生成了0字节的PNG文件(在视频末尾常见),则后退一帧间隔并重试。