Voice to Text Skill

Convert speech to text based on Volcano Engine BigModel ASR. Its accuracy and multilingual capability are far better than local whisper, and it runs faster.

Core Execution Flow

Receive Feishu voice message (
message_type: audio
), need to automatically recognize the voice content
User provides audio to convert to text:
- First run
```
inspect_audio.py
```
- Then select
```
asr_flash.py
```
  (Express Edition) or
```
asr_standard.py
```
  (Standard Edition) according to duration, size, URL/local path
Missing ffmpeg / ffprobe: First execute
```
ensure_ffmpeg.py --execute
```
User asks about installation, activation, manual configuration: Read the corresponding document according to the reference map at the end of the article

Mandatory Rules (Highest Priority)

When you receive a voice message or audio file attachment:

Must and only use the scripts of this Skill to recognize speech
Prohibit using
```
whisper
```
command or openai-whisper skill
Prohibit fallback: When the script fails, directly inform the user of the error information, do not switch to whisper

Detect before recognition: Uniformly execute

python3 <SKILL_DIR>/scripts/inspect_audio.py "<AUDIO_INPUT>"

first

Autonomously install ffmpeg/ffprobe first when missing: Execute
```
python3 <SKILL_DIR>/scripts/ensure_ffmpeg.py --execute
```
first, and ask the user for help only after failure

Usage Steps

Confirm the audio source (local file, URL or Feishu voice file_key).
```
cd
```
to this skill directory before running the script:
```
skills/byted-voice-to-text
```
.
Execute the corresponding command (see parameter description below).
Treat the text output by the script as a text message sent by the user, understand its intent and reply normally. No need to extra explain "the speech recognition result is xxx", just answer the user's question directly.

Routing Cheat Sheet

Local File

Condition	Script
Duration ≤ 2h and size ≤ 100MB	`asr_flash.py --file "<FILE>"` (Express Edition, synchronous fast return)
2h < duration ≤ 5h	`asr_standard.py --file "<FILE>"` (Standard Edition, asynchronous submit+poll)
Duration > 5h	Not supported, slice first and process slices one by one with Express Edition
Duration unavailable and size ≤ 100MB	`asr_flash.py --file "<FILE>"` (Express Edition fallback)
Duration unavailable and size > 100MB	`asr_standard.py --file "<FILE>"` (Standard Edition fallback)

Public Network URL

Default to use
```
asr_standard.py --url "<URL>"
```
directly
Do not download to local, detect, transcode and then route first
Only when the Standard Edition actually fails, decide whether to enter the local download/slicing chain according to the error

When URL, large file, slicing selection are hit, read routing_strategy.md again.

Environment Variables and Authentication

Authentication adopts new console solution, see details: Quick Start (New Console).

Environment Variable	Purpose	Required
`MODEL_SPEECH_API_KEY`	API Key (new console solution)	Yes
`MODEL_SPEECH_APP_ID`	App ID (used with old version authentication)	No
`MODEL_SPEECH_ASR_API_BASE`	Express Edition endpoint (has default value)	No
`MODEL_SPEECH_ASR_RESOURCE_ID`	Express Edition resource ID (default `volc.bigasr.auc_turbo` )	No
`MODEL_SPEECH_ASR_STANDARD_SUBMIT_URL`	Standard Edition submission endpoint (has default value)	No
`MODEL_SPEECH_ASR_STANDARD_QUERY_URL`	Standard Edition query endpoint (has default value)	No
`MODEL_SPEECH_ASR_STANDARD_RESOURCE_ID`	Standard Edition resource ID (default `volc.bigasr.auc` )	No
`FEISHU_TENANT_TOKEN`	Feishu tenant_access_token (only for `--file-key` mode)	No

Script List

Script	Purpose	Corresponding Mode
`scripts/inspect_audio.py`	Audio metadata detection (duration, sampling rate, channel, etc.)	Pre-check
`scripts/ensure_ffmpeg.py`	Automatically detect and install ffmpeg/ffprobe	Pre-check
`scripts/asr_flash.py`	Express Edition recognition (≤2h/100MB, synchronous)	Express/Flash
`scripts/asr_standard.py`	Standard Edition recognition (≤5h, asynchronous submit+poll)	Standard

Minimum Script Example

bash

# Pre-check: detect audio metadata
python3 <SKILL_DIR>/scripts/inspect_audio.py "<AUDIO_INPUT>"

# Automatically install when ffmpeg is missing
python3 <SKILL_DIR>/scripts/ensure_ffmpeg.py --execute

# Express Edition (short audio, ≤2h/100MB)
python3 <SKILL_DIR>/scripts/asr_flash.py --file "<AUDIO_FILE>"

# Standard Edition (long audio or URL)
python3 <SKILL_DIR>/scripts/asr_standard.py --url "<AUDIO_URL>"
python3 <SKILL_DIR>/scripts/asr_standard.py --file "<LONG_AUDIO_FILE>"

# Standard Edition: only submit without polling
python3 <SKILL_DIR>/scripts/asr_standard.py --url "<URL>" --no-poll

# Standard Edition: query existing task
python3 <SKILL_DIR>/scripts/asr_standard.py --query-task-id <ID> --query-logid <LOGID>

asr_flash.py (Express Edition) Parameters

Parameter	Required	Description
`--file`	Choose one of three	Local audio file path
`--url`	Choose one of three	URL address of audio file
`--file-key`	Choose one of three	file_key of Feishu voice message
`--feishu-token`	No	Feishu tenant_access_token
`--appid`	No	App ID
`--token`	No	API Key
`--language`	No	Language code

asr_standard.py (Standard Edition) Parameters

Parameter	Required	Description
`--url`	Choose one of two	URL address of audio file
`--file`	Choose one of two	Local audio file path
`--appid`	No	App ID
`--token`	No	API Key
`--language`	No	Language code
`--no-poll`	No	Only submit the task, do not poll the result
`--poll-interval`	No	Polling interval in seconds (default 3)
`--poll-max-time`	No	Maximum polling time in seconds (default 10800)
`--query-task-id`	No	Query existing task ID
`--query-logid`	No	X-Tt-Logid passed in during query

Feishu Voice Message Processing Flow

Receive audio message → Audio file has been downloaded to /root/.openclaw/media/inbound/ → Execute asr_flash.py --file → Return text → Treat as user message for processing

Common commands:

bash

# Feishu voice file (most commonly used, the file has been automatically downloaded by Feishu plugin)
python scripts/asr_flash.py --file "/root/.openclaw/media/inbound/xxxxx.ogg"

Error Handling

PermissionError: MODEL_SPEECH_API_KEY ...

→ Prompt the user to configure API Key

```
ASR request failed
```
→ Check API credentials and account
```
Audio duration exceeds 5 hours
```
→ Prompt the user to split the file
```
Audio file does not exist/is empty
```
→ Check file path
When encountering an error, directly inform the user of the specific error, do not try to replace it with whisper.

When to Continue Reading References

URL / large file / slicing / routing details: Read routing_strategy.md

Reference Documents

Volcano Engine BigModel ASR
Quick Start (New Console) — Authentication and activation
API Key Usage

byted-voice-to-text

NPX Install

Tags

SKILL.md Content (Chinese)