audio-quality-check

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Audio Recording Quality Analyzer

音频录制质量分析器

Comprehensive audio quality analysis for call recordings. Handles dual-track M4A files (system audio + mic), single-track recordings, and AEC-processed files.
针对通话录制的全面音频质量分析工具。支持双轨M4A文件(系统音频+麦克风)、单轨录制文件及经AEC处理的文件。

Quick Start

快速开始

Run the bundled analysis script on a recording directory:
bash
python <skill-path>/scripts/analyze_recording.py "/path/to/recording/directory"
Modes for focused analysis:
bash
python <skill-path>/scripts/analyze_recording.py /path --tracks   # track info only
python <skill-path>/scripts/analyze_recording.py /path --echo     # echo detection only
python <skill-path>/scripts/analyze_recording.py /path --quality  # quality metrics (skip echo)
For Blackbox recordings, the directory is typically:
~/Library/Application Support/Blackbox/Recordings/<timestamp-id>/
在录制目录上运行捆绑的分析脚本:
bash
python <skill-path>/scripts/analyze_recording.py "/path/to/recording/directory"
聚焦分析模式:
bash
python <skill-path>/scripts/analyze_recording.py /path --tracks   # 仅查看音轨信息
python <skill-path>/scripts/analyze_recording.py /path --echo     # 仅进行回声检测
python <skill-path>/scripts/analyze_recording.py /path --quality  # 仅分析质量指标(跳过回声检测)
对于Blackbox录制文件,目录通常为:
~/Library/Application Support/Blackbox/Recordings/<timestamp-id>/

Dependencies

依赖项

System:
ffmpeg
,
ffprobe
(brew install ffmpeg) Python:
numpy
,
soundfile
,
scipy
,
pyloudnorm
,
pesq
,
pystoi
,
librosa
Install all Python deps:
pip3 install numpy soundfile scipy pyloudnorm pesq pystoi librosa
系统工具:
ffmpeg
ffprobe
(可通过brew install ffmpeg安装) Python库:
numpy
soundfile
scipy
pyloudnorm
pesq
pystoi
librosa
安装所有Python依赖:
pip3 install numpy soundfile scipy pyloudnorm pesq pystoi librosa

What Each Metric Tells You

各指标说明

EBU R128 Loudness (pyloudnorm)

EBU R128 响度(pyloudnorm)

  • What: Perceptual loudness in LUFS (Loudness Units Full Scale)
  • Target: -16 to -24 LUFS for speech
  • Watch for: AEC/post-processed tracks being significantly louder than originals (indicates the processing is amplifying without normalizing)
  • 定义:以LUFS(全刻度响度单位)为单位的感知响度
  • 目标值:语音类内容为-16至-24 LUFS
  • 注意点:经AEC/后处理的音轨远高于原始音轨响度,表明处理过程仅放大信号未做归一化

Echo Detection - Autocorrelation

回声检测 - 自相关法

  • What: Detects delayed copies of the signal within a single track by correlating the signal with itself at various time offsets
  • How to read: Peaks in the 20-100ms range with correlation > 0.3 indicate signal duplication. The lag tells you the delay of the duplicate copy
  • Key insight: If you see a consistent peak at the same lag across multiple time segments, that's a systematic duplication (e.g., a virtual audio processor like Krisp introducing a delayed copy at ~53ms)
  • Normal values: Peaks below 0.15 are typically speech pitch harmonics (harmless). Peaks above 0.3 at consistent lags are echo
  • 定义:通过将信号与不同时间偏移的自身信号做相关性计算,检测单音轨内的延迟重复信号
  • 解读方式:20-100ms范围内相关性>0.3的峰值表明存在信号重复,延迟值代表重复信号的滞后时间
  • 关键结论:若多个时间段内同一延迟值处出现持续峰值,说明存在系统性重复(例如Krisp等虚拟音频处理器引入约53ms的延迟副本)
  • 正常范围:峰值低于0.15通常是语音基频谐波(无影响),0.3以上且延迟值稳定的峰值则为回声

Cross-Track Correlation

跨轨相关性

  • What: Measures how much one track's content appears in another (e.g., system audio bleeding into the mic track)
  • How to read: Values near 0 mean no bleed. Values above 0.1 indicate the mic is picking up system audio
  • Coherence: Frequency-domain version of the same test. Voice-band coherence (300-3400Hz) is most relevant for speech echo
  • 定义:衡量一个音轨的内容在另一音轨中的出现程度(例如系统音频串入麦克风音轨)
  • 解读方式:值接近0表示无串音,大于0.1表示麦克风拾取到系统音频
  • 相干性:该测试的频域版本,语音频段(300-3400Hz)的相干性对语音回声检测最具参考价值

PESQ - Speech Quality (requires reference + degraded)

PESQ - 语音质量分析(需参考信号+退化信号)

  • What: ITU-T P.862 standard. Gives a MOS (Mean Opinion Score) comparing a degraded signal against a reference
  • Scale: 1.0 (bad) to 4.5 (excellent). NB = narrowband (phone quality), WB = wideband
  • Use for: Comparing AEC-processed mic vs original mic to see if processing helps or hurts
  • Thresholds: 4.0+ excellent, 3.0+ good, 2.5-3.0 fair, <2.5 poor
  • 定义:ITU-T P.862标准,通过对比退化信号与参考信号给出MOS(平均意见得分)
  • 评分范围:1.0(差)至4.5(优)。NB=窄带(电话音质),WB=宽带
  • 适用场景:对比经AEC处理的麦克风音轨与原始麦克风音轨,判断处理效果优劣
  • 阈值:4.0+为优,3.0+为良,2.5-3.0为一般,<2.5为差

STOI - Speech Intelligibility (requires reference + degraded)

STOI - 语音可懂度分析(需参考信号+退化信号)

  • What: Short-Time Objective Intelligibility. Measures how understandable speech remains after processing
  • Scale: 0.0 to 1.0
  • Thresholds: >0.8 good, >0.6 fair, <0.6 poor
  • Key insight: If STOI drops significantly between original and processed, the processing is degrading intelligibility
  • 定义:短时客观可懂度,衡量处理后语音的可理解程度
  • 评分范围:0.0至1.0
  • 阈值:>0.8为良,>0.6为一般,<0.6为差
  • 关键结论:若原始信号与处理后信号的STOI值大幅下降,说明处理过程降低了语音可懂度

Spectral Analysis (librosa)

频谱分析(librosa)

  • Centroid: Average frequency weighted by amplitude. Higher = brighter/harsher audio
  • Rolloff (85%): Frequency below which 85% of spectral energy sits. Lower = more bass-heavy
  • Zero-crossing rate: How often the signal crosses zero. Higher = noisier signal. Speech is typically 0.05-0.20; values above 0.30 suggest significant noise
  • 质心:按振幅加权的平均频率,值越高表示音频越明亮/刺耳
  • 滚降(85%):85%频谱能量所处的最低频率,值越低表示音频越偏重低音
  • 过零率:信号穿越零点的频率,值越高表示信号噪声越大。语音的过零率通常为0.05-0.20,超过0.30则表明存在显著噪声

SNR - Signal-to-Noise Ratio

SNR - 信噪比

  • What: Ratio of speech energy to background noise energy (estimated via energy-based VAD)
  • Thresholds: >20dB excellent, >15dB good, >10dB fair, <10dB poor
  • Note: This measures background noise, not echo. A recording can have excellent SNR but still have echo problems
  • 定义:语音能量与背景噪声能量的比值(基于能量型VAD估算)
  • 阈值:>20dB为优,>15dB为良,>10dB为一般,<10dB为差
  • 注意:该指标衡量背景噪声,而非回声。录制文件可能SNR极佳但仍存在回声问题

Per-Minute Energy

每分钟能量分析

  • What: RMS energy and voice-band energy per minute of recording
  • Use for: Spotting segments that went silent (mic cut out), got unexpectedly loud (clipping risk), or had activity patterns that help identify when speakers were active
  • 定义:录制文件每分钟的RMS能量与语音频段能量
  • 适用场景:定位静音段(麦克风中断)、异常大声段(存在削波风险),或通过活动模式识别说话者的发言时段

Manual Analysis Recipes

手动分析方法

When you need analysis beyond what the script provides, these patterns are useful.
当脚本无法满足分析需求时,以下方法会有所帮助。

Extract individual tracks from dual-track M4A

从双轨M4A中提取单个音轨

bash
ffmpeg -y -i audio.m4a -map 0:0 -ac 1 -ar 16000 /tmp/system.wav
ffmpeg -y -i audio.m4a -map 0:1 -ac 1 -ar 16000 /tmp/mic.wav
bash
ffmpeg -y -i audio.m4a -map 0:0 -ac 1 -ar 16000 /tmp/system.wav
ffmpeg -y -i audio.m4a -map 0:1 -ac 1 -ar 16000 /tmp/mic.wav

Quick loudness check with sox

使用sox快速检查响度

bash
sox audio.wav -n stat 2>&1
bash
sox audio.wav -n stat 2>&1

Check specific time range for echo (Python)

检查特定时间段的回声(Python)

python
import numpy as np
import soundfile as sf
from scipy import signal

data, sr = sf.read('/tmp/system.wav')
python
import numpy as np
import soundfile as sf
from scipy import signal

data, sr = sf.read('/tmp/system.wav')

Analyze 5 seconds starting at 2 minutes

分析从第2分钟开始的5秒内容

start = 120 * sr seg = data[start:start + 5*sr] seg_norm = seg / (np.max(np.abs(seg)) + 1e-10) autocorr = np.correlate(seg_norm, seg_norm, mode='full') mid = len(seg_norm) - 1 autocorr = autocorr / autocorr[mid]
start = 120 * sr seg = data[start:start + 5*sr] seg_norm = seg / (np.max(np.abs(seg)) + 1e-10) autocorr = np.correlate(seg_norm, seg_norm, mode='full') mid = len(seg_norm) - 1 autocorr = autocorr / autocorr[mid]

Check 20-100ms range for echo peaks

检查20-100ms范围内的回声峰值

min_lag = int(0.020 * sr) max_lag = int(0.100 * sr) region = autocorr[mid + min_lag:mid + max_lag] peaks, props = signal.find_peaks(region, height=0.1) for i, p in enumerate(peaks[:5]): lag_ms = (p + min_lag) / sr * 1000 print(f" Peak at {lag_ms:.1f}ms, r={props['peak_heights'][i]:.3f}")
undefined
min_lag = int(0.020 * sr) max_lag = int(0.100 * sr) region = autocorr[mid + min_lag:mid + max_lag] peaks, props = signal.find_peaks(region, height=0.1) for i, p in enumerate(peaks[:5]): lag_ms = (p + min_lag) / sr * 1000 print(f" Peak at {lag_ms:.1f}ms, r={props['peak_heights'][i]:.3f}")
undefined

Common Issues and What Causes Them

常见问题及成因

SymptomLikely causeWhat to check
Speakers sound slightly doubled/echoedVirtual audio processor (Krisp) creating delayed copy in system audioAutocorrelation: consistent peak at 40-60ms
Mic track has remote speakers' voicesAcoustic echo (speakers to mic)Cross-track correlation > 0.1
AEC-processed file sounds worseDTLN-aec degrading signal qualityPESQ/STOI comparing original vs processed
AEC-processed file is too loudMissing loudness normalization after processingLoudness: processed > -10 LUFS
Recording has hiss/noiseLow SNR, noisy mic, or AGC artifactsSNR < 15dB, high zero-crossing rate
Quiet segments mid-recordingMic cut out or device changedPer-minute energy: sudden RMS drop
症状可能成因检查方向
说话声音轻微重影/回声虚拟音频处理器(如Krisp)在系统音频中创建延迟副本自相关分析:40-60ms处存在稳定峰值
麦克风音轨包含远端说话者声音声学回声(扬声器声音传入麦克风)跨轨相关性>0.1
经AEC处理的文件音质变差DTLN-aec处理导致信号质量下降对比原始与处理后文件的PESQ/STOI值
经AEC处理的文件音量过大处理后未进行响度归一化响度分析:处理后文件> -10 LUFS
录制文件存在嘶嘶声/噪声SNR低、麦克风噪声大或AGC artifactsSNR <15dB、过零率高
录制中途出现静音段麦克风中断或设备切换每分钟能量分析:RMS值骤降