StepFun stepaudio-2.5-asr
Transcribe audio with StepFun's
(released 2026-04, verified 2026-04-23). Long audio in one call, no chunking — but
only if the request hits the right endpoint with the right body shape. The wrong endpoint returns an error that looks identical to "model doesn't exist", which is the #1 reason this skill exists.
Companion: for TTS with
(the sibling model), use the
skill — they share an API key but live on different endpoints with different body shapes.
Why this skill exists — three traps that cost hours
-
Wrong endpoint, wrong error.
does
not live on
(that endpoint serves the older
family). It lives on
— SSE streaming, JSON body, base64 audio. Sending it to the wrong endpoint returns
{"error":{"message":"model stepaudio-2.5-asr not supported"}}
, which is
identical in structure to a genuinely nonexistent model name. People waste hours filing whitelist tickets.
-
Plan key vs Normal key, silent failure. StepFun's "Plan" subscription keys (cheap, text-only) cannot call audio endpoints, but the failure manifests as a 4xx with no auth-shaped error message. If your account has a Plan subscription, you need a separate "Normal" key from the same console.
-
SSE error events are real. Censorship can fire on the ASR side too (rarely). Don't assume only
and
events arrive — handle
events in the stream or you'll silently drop them.
Config and auth
API key resolves in this order (fail-fast, no defaults):
- environment variable
${CLAUDE_PLUGIN_DATA}/config.json
with (cross-session persistence)
First-time setup:
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste Normal key here>"}
EOF
If the user has not set a key, ask them to paste it — do not guess or use a placeholder. Get keys at
https://platform.stepfun.com/ → API Keys.
Use a Normal key, not a Plan key.
Quick start — single file
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3
Output: plain text transcription on stdout.
For machine-readable output with usage / timing:
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --json
For non-Chinese audio:
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --language en
The script handles base64 encoding, the nested
{audio: {data, input: {transcription, format}}}
body, SSE parsing, and the misleading-endpoint pitfall. Prefer it over hand-rolled HTTP calls unless integrating into a larger pipeline.
Decision table
| Scenario | Action |
|---|
| Short clip (< 5 min), Chinese or English, mp3/wav/ogg/opus | python3 scripts/asr_transcribe.py audio.mp3
|
| Long audio (5-30 min) | Same script — 32K context handles it in a single call, no chunking needed |
| Audio > 30 min | Split with ffmpeg before sending; the API rejects oversized payloads |
| Need usage/billing data | Add to capture / from |
| Highly repetitive content (same phrase 5+ times, > 90s) | Cross-validate with — see repetition hallucination in references/known_issues.md
|
Hit model stepaudio-2.5-asr not supported
| Wrong endpoint. Switch from to |
| Hit silent 4xx auth failure | Verify your key is "Normal" not "Plan" — Plan keys cannot call audio endpoints |
| Need to write raw HTTP (no Python) | Read references/api_reference.md
for exact JSON body and SSE event shapes |
Supported audio formats
The script auto-detects from extension; pass
to override:
| Extension | Format flag | Notes |
|---|
| | Most common, default |
| | Lossless |
| | OGG container |
| | Opus codec in OGG container — pass through unchanged |
| | Raw PCM — also requires , , (see API reference) |
For mp4/m4a/webm/etc., transcode to one of the above first via ffmpeg. Production pipelines often pre-transcode everything to OGG/Opus 16kHz mono to minimize base64 payload size.
Capacity and performance (verified 2026-04-23)
- 32K context window — single-call upper limit, no chunking needed for ≤ 30 min audio
- ~85-101× RTF on long audio (17.4 min audio → 10.4s wall clock)
- ~5.3× speedup vs step-asr-1.1 at the 100s+ length range
- Only ~2× speedup at the 5-15s range — the LLM spin-up cost dominates short clips. If your workload is many short clips, the migration ROI is modest
Common error patterns
| Error response | Actual cause | Fix |
|---|
"model stepaudio-2.5-asr not supported"
on | Wrong endpoint | Switch to (script does this) |
| Silent 4xx with no auth message | Using a "Plan" key on audio endpoint | Get a "Normal" key from the StepFun console |
| ASR returns 3-4× expected character count | Repetition hallucination on highly-repetitive audio | Cross-validate with ; see references/known_issues.md
|
data: {"type":"error","message":"content blocked..."}
mid-stream | Censorship fired on user-uploaded content | Handle SSE event explicitly; don't assume only / arrive |
More edge cases in
references/known_issues.md
.
Design invariants (do not break)
- Always pass through SSE — don't try to buffer the response with a non-streaming client. The model emits for long audio; carries the authoritative full text and . Reject the SSE format entirely and you'll get nothing.
- Take final text from
transcript.text.done.text
— concatenated deltas can drift on edge cases. Deltas are for progressive UI; the event is the source of truth.
- Handle events in the stream — don't treat the SSE stream as if only success events arrive. A blocked-content event mid-stream returns with no event.
- Fail-fast on missing API key — never default to a placeholder or empty string. The script does this; preserve the behavior in any wrapper.
When to read references
references/api_reference.md
— exact JSON request body, all fields, all SSE event types, response examples. Read when writing raw HTTP calls instead of using the bundled script.
references/known_issues.md
— repetition hallucination details, the wrong-endpoint diagnostic trail, Plan-vs-Normal key gotcha, ASR-side censorship handling, pricing opacity. Read when debugging anomalous output or evaluating whether to migrate from .
Pricing (verified 2026-04-23, volatile)
is in invitation beta as of 2026-04-23 — no public per-minute rate. The
baseline is 2.2 元/小时. The invitation PDF mentions "成本直降 80%" implying ~0.4 元/小时, but this is not yet on the pricing page. Re-verify at
https://platform.stepfun.com/docs/zh/guides/pricing/details before quoting to stakeholders.
Next Step
ASR transcripts almost always benefit from one of two downstream steps. After producing the transcript, suggest the natural next skill:
Transcription complete. The output is raw text from the model — common next steps:
Options:
A) transcript-fixer — clean up ASR errors (homophones, segmentation, filler words). Recommended if the recording is a real-world conversation, podcast, or interview rather than read-aloud text
B) meeting-minutes-taker — turn the transcript into structured minutes with decisions, action items, and speaker attribution. Recommended if the recording is a meeting
C) No thanks — the raw transcript is what I needed
Skip the suggestion when the user has already specified the downstream tool, or when the transcription was clearly a one-off lookup (e.g., "what does this 15-second clip say?").