asr-transcribe-to-text

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ASR Transcribe to Text

ASR 音视频转文本转录

Transcribe audio/video files to text using a configurable ASR endpoint (default: Qwen3-ASR-1.7B via vLLM). Configuration persists across sessions in
${CLAUDE_PLUGIN_DATA}/config.json
.
使用可配置的ASR端点(默认:通过vLLM部署的Qwen3-ASR-1.7B)将音视频文件转录为文本。配置信息会在会话间持久化存储于
${CLAUDE_PLUGIN_DATA}/config.json
中。

Step 0: Load or Initialize Configuration

步骤0:加载或初始化配置

bash
cat "${CLAUDE_PLUGIN_DATA}/config.json" 2>/dev/null
If config exists, read the values and proceed to Step 1.
If config does not exist (first run), use AskUserQuestion:
First-time setup for ASR transcription.
I need to know where your ASR service is running so I can send audio to it.

RECOMMENDATION: Use the defaults below if you have Qwen3-ASR on a 4090 via Tailscale.

Q1: ASR Endpoint URL?
  A) http://workstation-4090-wsl:8002/v1/audio/transcriptions (Default — Qwen3-ASR vLLM via Tailscale)
  B) http://localhost:8002/v1/audio/transcriptions (Local machine)
  C) Let me enter a custom URL

Q2: Does your network have an HTTP proxy that might intercept LAN/Tailscale traffic?
  A) Yes — add --noproxy to bypass it (Recommended if you use Shadowrocket/Clash/corporate proxy)
  B) No — direct connection is fine
Save the config:
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}"
python3 -c "
import json
config = {
    'endpoint': 'USER_PROVIDED_ENDPOINT',
    'model': 'USER_PROVIDED_MODEL_OR_DEFAULT',
    'noproxy': True,  # or False based on user answer
    'max_timeout': 900
}
with open('${CLAUDE_PLUGIN_DATA}/config.json', 'w') as f:
    json.dump(config, f, indent=2)
print('Config saved.')
"
bash
cat "${CLAUDE_PLUGIN_DATA}/config.json" 2>/dev/null
若配置已存在,读取配置值并进入步骤1。
若配置不存在(首次运行),使用AskUserQuestion工具询问用户:
ASR转录首次设置。
我需要了解你的ASR服务运行位置,以便发送音频至该服务。

推荐方案:若你已通过Tailscale在4090显卡上部署Qwen3-ASR,可使用以下默认选项。

问题1:ASR端点URL?
  A) http://workstation-4090-wsl:8002/v1/audio/transcriptions(默认选项——通过Tailscale部署的Qwen3-ASR vLLM)
  B) http://localhost:8002/v1/audio/transcriptions(本地机器部署)
  C) 让我输入自定义URL

问题2:你的网络是否存在可能拦截局域网/Tailscale流量的HTTP代理?
  A) 是——添加--noproxy参数绕过代理(若使用Shadowrocket/Clash/企业代理,推荐此选项)
  B) 否——直接连接即可
保存配置:
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}"
python3 -c "
import json
config = {
    'endpoint': 'USER_PROVIDED_ENDPOINT',
    'model': 'USER_PROVIDED_MODEL_OR_DEFAULT',
    'noproxy': True,  # 或根据用户回答设为False
    'max_timeout': 900
}
with open('${CLAUDE_PLUGIN_DATA}/config.json', 'w') as f:
    json.dump(config, f, indent=2)
print('Config saved.')
"

Step 1: Validate Input and Check Service Health

步骤1:验证输入并检查服务健康状态

Read config and health-check in a single command (shell variables don't persist across Bash calls):
bash
python3 -c "
import json, subprocess, sys
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
base = cfg['endpoint'].rsplit('/audio/', 1)[0]
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
result = subprocess.run(
    ['curl', '-s', '--max-time', '10'] + noproxy + [f'{base}/models'],
    capture_output=True, text=True
)
if result.returncode != 0 or not result.stdout.strip():
    print(f'HEALTH CHECK FAILED', file=sys.stderr)
    print(f'Endpoint: {base}/models', file=sys.stderr)
    print(f'stdout: {result.stdout[:200]}', file=sys.stderr)
    print(f'stderr: {result.stderr[:200]}', file=sys.stderr)
    sys.exit(1)
else:
    print(f'Service healthy: {base}')
    print(f'Model: {cfg[\"model\"]}')
"
If health check fails, use AskUserQuestion:
ASR service at [endpoint] is not responding.

Options:
A) Diagnose — check network, Tailscale, and service status step by step
B) Reconfigure — the endpoint URL might be wrong, let me re-enter it
C) Try anyway — send the transcription request and see what happens
D) Abort — I'll fix the service manually and come back later
For option A, diagnose in order:
  1. Network:
    ping -c 1 HOST
    or
    tailscale status | grep HOST
  2. Service:
    tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models"
  3. Proxy: retry with
    --noproxy '*'
    toggled
通过单个命令读取配置并进行健康检查(Shell变量不会在Bash调用间持久化):
bash
python3 -c "
import json, subprocess, sys
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
base = cfg['endpoint'].rsplit('/audio/', 1)[0]
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
result = subprocess.run(
    ['curl', '-s', '--max-time', '10'] + noproxy + [f'{base}/models'],
    capture_output=True, text=True
)
if result.returncode != 0 or not result.stdout.strip():
    print(f'HEALTH CHECK FAILED', file=sys.stderr)
    print(f'Endpoint: {base}/models', file=sys.stderr)
    print(f'stdout: {result.stdout[:200]}', file=sys.stderr)
    print(f'stderr: {result.stderr[:200]}', file=sys.stderr)
    sys.exit(1)
else:
    print(f'Service healthy: {base}')
    print(f'Model: {cfg[\"model\"]}')
"
若健康检查失败,使用AskUserQuestion工具:
位于[端点地址]的ASR服务无响应。

可选操作:
A) 诊断——逐步检查网络、Tailscale和服务状态
B) 重新配置——端点URL可能有误,让我重新输入
C) 尝试直接发送——直接发送转录请求,看结果如何
D) 中止——我将手动修复服务后再回来
若用户选择选项A,按以下顺序诊断:
  1. 网络:
    ping -c 1 HOST
    tailscale status | grep HOST
  2. 服务:
    tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models"
  3. 代理:切换
    --noproxy '*'
    参数后重试

Step 2: Extract Audio (if input is video)

步骤2:提取音频(若输入为视频)

For video files (mp4, mov, mkv, avi, webm), extract audio as 16kHz mono MP3:
bash
ffmpeg -i INPUT_VIDEO -vn -acodec libmp3lame -q:a 4 -ar 16000 -ac 1 OUTPUT.mp3 -y
For audio files (mp3, wav, m4a, flac, ogg), use directly — no conversion needed.
Get duration for progress estimation:
bash
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE
对于视频文件(mp4、mov、mkv、avi、webm),将音频提取为16kHz单声道MP3:
bash
ffmpeg -i INPUT_VIDEO -vn -acodec libmp3lame -q:a 4 -ar 16000 -ac 1 OUTPUT.mp3 -y
对于音频文件(mp3、wav、m4a、flac、ogg),可直接使用——无需转换。
获取时长以估算进度:
bash
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE

Step 3: Transcribe — Single Request First

步骤3:转录——先尝试单次请求

Always try full-length single request first. Chunking causes sentence truncation at every split boundary — the model forces the last sentence to close and loses words. Single request = zero truncation + fastest speed.
The Qwen3-ASR paper's "20-minute limit" is a training benchmark, not an inference hard limit. Empirically verified: 55 minutes transcribed in a single 76-second request on 4090 24GB.
bash
python3 -c "
import json, subprocess, sys, os, tempfile
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
timeout = str(cfg.get('max_timeout', 900))
audio_file = 'AUDIO_FILE_PATH'  # replace with actual path
output_json = tempfile.mktemp(suffix='.json', prefix='asr_')

result = subprocess.run(
    ['curl', '-s', '--max-time', timeout] + noproxy + [
        cfg['endpoint'],
        '-F', f'file=@{audio_file}',
        '-F', f'model={cfg[\"model\"]}',
        '-o', output_json
    ], capture_output=True, text=True
)

with open(output_json) as f:
    data = json.load(f)
if 'text' not in data:
    print(f'ERROR: {json.dumps(data)[:300]}', file=sys.stderr)
    sys.exit(1)
text = data['text']
duration = data.get('usage', {}).get('seconds', 0)
print(f'Transcribed: {len(text)} chars, {duration}s audio', file=sys.stderr)
print(text)
os.unlink(output_json)
" > OUTPUT.txt
Performance reference: ~400 characters per minute for Chinese speech; rates vary by language. Qwen3-ASR supports 52 languages including Chinese dialects, English, Japanese, Korean, and more.
始终优先尝试全量单次请求。分片转录会在每个分割边界处截断句子——模型会强制结束最后一句并丢失词汇。单次请求=零截断+最快速度。
Qwen3-ASR论文中提到的"20分钟限制"是训练基准,而非推理硬限制。实际验证:在4090 24GB显卡上,55分钟音频可通过单次76秒请求完成转录。
bash
python3 -c "
import json, subprocess, sys, os, tempfile
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
timeout = str(cfg.get('max_timeout', 900))
audio_file = 'AUDIO_FILE_PATH'  # 替换为实际路径
output_json = tempfile.mktemp(suffix='.json', prefix='asr_')

result = subprocess.run(
    ['curl', '-s', '--max-time', timeout] + noproxy + [
        cfg['endpoint'],
        '-F', f'file=@{audio_file}',
        '-F', f'model={cfg[\"model\"]}',
        '-o', output_json
    ], capture_output=True, text=True
)

with open(output_json) as f:
    data = json.load(f)
if 'text' not in data:
    print(f'ERROR: {json.dumps(data)[:300]}', file=sys.stderr)
    sys.exit(1)
text = data['text']
duration = data.get('usage', {}).get('seconds', 0)
print(f'Transcribed: {len(text)} chars, {duration}s audio', file=sys.stderr)
print(text)
os.unlink(output_json)
" > OUTPUT.txt
性能参考:中文语音约每分钟400字符;不同语言速率不同。Qwen3-ASR支持包括汉语方言、英语、日语、韩语在内的52种语言。

Step 4: Verify and Confirm Output

步骤4:验证并确认输出

After transcription, verify quality:
  1. Confirm the response contains a
    text
    field (not an error message)
  2. Check character count is plausible for the audio duration (~400 chars/min for Chinese)
  3. Show the user the first ~200 characters as a preview
If the output looks wrong (empty, garbled, or error), use AskUserQuestion:
Transcription may have an issue:
- Expected: ~[N] chars for [M] minutes of audio
- Got: [actual chars] chars
- Preview: "[first 100 chars...]"

Options:
A) Save as-is — the output looks fine to me
B) Retry with fallback — split into chunks and merge (handles long audio / OOM)
C) Reconfigure — try a different model or endpoint
D) Abort — something is wrong with the service
If output is good, save as
.txt
alongside the original file or to user-specified location.
转录完成后,验证输出质量:
  1. 确认响应包含
    text
    字段(而非错误信息)
  2. 检查字符数与音频时长是否匹配(中文约每分钟400字符)
  3. 向用户展示前约200字符作为预览
若输出存在异常(空内容、乱码或错误),使用AskUserQuestion工具:
转录可能存在问题:
- 预期:[M]分钟音频应生成约[N]个字符
- 实际得到:[实际字符数]个字符
- 预览:"[前100个字符...]"

可选操作:
A) 按原样保存——输出看起来没问题
B) 使用回退方案重试——分片后合并(处理长音频/内存不足问题)
C) 重新配置——尝试不同模型或端点
D) 中止——服务存在问题
若输出正常,将结果保存为
.txt
文件,可存储在原文件旁或用户指定位置。

Step 5: Fallback — Overlap-Merge for Very Long Audio

步骤5:回退方案——超长音频的重叠合并法

If single request fails (timeout, OOM, HTTP error), fall back to chunked transcription with overlap merging:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
  --config "${CLAUDE_PLUGIN_DATA}/config.json" \
  INPUT_AUDIO OUTPUT.txt
This splits into 18-minute chunks with 2-minute overlap, then merges using punctuation-stripped fuzzy matching. See references/overlap_merge_strategy.md for the algorithm details.
若单次请求失败(超时、内存不足、HTTP错误),使用分片转录+重叠合并的回退方案:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
  --config "${CLAUDE_PLUGIN_DATA}/config.json" \
  INPUT_AUDIO OUTPUT.txt
该方案会将音频分割为18分钟的分片,每个分片重叠2分钟,然后通过去除标点的模糊匹配进行合并。算法细节可参考references/overlap_merge_strategy.md

Reconfigure

重新配置

To change the ASR endpoint, model, or proxy settings:
bash
rm "${CLAUDE_PLUGIN_DATA}/config.json"
Then re-run Step 0 to collect new values via AskUserQuestion.
若需更改ASR端点、模型或代理设置:
bash
rm "${CLAUDE_PLUGIN_DATA}/config.json"
然后重新运行步骤0,通过AskUserQuestion工具收集新配置值。