FunASR Speech-to-Text
This skill provides local speech recognition service to convert audio or video files into structured Markdown documents.
Feature Overview
- Supports multiple audio and video formats (mp4, mov, mp3, wav, m4a, flac, etc.)
- Automatically generates timestamps
- Supports speaker diarization
- Outputs in Markdown format for easy reading and editing
Usage Workflow
First-time Use: Install Dependencies and Download Models
Run the installation script to complete environment configuration:
The installation script will automatically:
- Check Python version (requires >= 3.8)
- Install dependency packages (FastAPI, Uvicorn, FunASR, PyTorch)
- Download ASR models to
~/.cache/modelscope/hub/models/
Verify installation status:
bash
python scripts/setup.py --verify
Start Transcription Service
The service runs on
by default
Smart Features:
- Auto-start: Automatically loads models on first request
- Idle Shutdown: Automatically shuts down after 10 minutes of inactivity by default to save resources
- Configurable Timeout: Use the parameter to customize idle timeout (in seconds)
Service Lifecycle:
- Enters idle monitoring state after startup
- Automatically loads models and executes transcription when receiving a request
- Resets the idle timer for each request
- Automatically shuts down if no requests are received for 10 consecutive minutes
- Restarts on next request
Important Notes:
- ⚠️ Do not manually shut down the service - Leave the service running after transcription is completed, it will automatically shut down after 10 minutes of inactivity
- This allows continuous transcription of multiple files without restarting the service repeatedly
- To shut down the service immediately, press or wait for the 10-minute idle timeout
Example: Customize 30-minute idle timeout
bash
python scripts/server.py --idle-timeout 1800
Execute Transcription
Use the client script to transcribe files:
bash
# Transcribe a single file
python scripts/transcribe.py /path/to/audio.mp3
# Specify output path
python scripts/transcribe.py /path/to/video.mp4 -o transcript.md
# Enable speaker diarization
python scripts/transcribe.py /path/to/meeting.m4a --diarize
# Batch transcribe a directory
python scripts/transcribe.py /path/to/media_folder/
AI Intelligent Summary (Claude Code Environment)
After transcription, you can generate an AI intelligent summary, making full use of Claude Code's native AI capabilities.
Workflow:
- After transcription, the script will automatically prepare summary prompts
- Send the prompts to Claude AI to generate structured summaries
- Paste the JSON result returned by Claude back into the script
- Automatically inject the summary into the Markdown file
Usage:
bash
# Transcribe a single file (will automatically prompt whether to generate a summary)
python scripts/transcribe.py /path/to/audio.mp3
# Enable speaker diarization and generate summary
python scripts/transcribe.py /path/to/meeting.m4a --diarize --summary
Summary Content Structure:
- Full Text Summary - Over 400 words, including background, issues, and key facts
- Speaker Summary - Each speaker's viewpoints, attitudes, and contributions
- Key Content - 6-10 core points
- Keywords - 5-8 key terms
Prompt Features:
- Optimized specifically for Chinese colloquial conversations
- Retains speaker context and dialogue flow
- Structured JSON output for easy parsing and formatting
For detailed documentation, please refer to: <references/api-reference.md>
Call via HTTP API
Check Service Status:
bash
curl http://127.0.0.1:8765/health
Call the API directly using curl:
bash
curl -X POST http://127.0.0.1:8765/transcribe \
-H "Content-Type: application/json" \
-d '{"file_path": "/path/to/audio.mp3"}'
API Documentation (Swagger UI):
FastAPI automatically generates interactive API documentation, visit:
http://127.0.0.1:8765/docs
On this page, you can:
- View all API endpoints
- Test APIs online (no curl required)
- View request/response formats
- View detailed parameter descriptions
Response Example (Health Check):
json
{
"status": "ok",
"service": "FunASR Transcribe",
"uptime": 300,
"idle_time": 120
}
Return field descriptions:
- : Service running time (in seconds)
- : Current idle time (in seconds)
Complete API Documentation
For detailed API reference documentation, please refer to: <references/api-reference.md>
Including:
- Complete specifications for all API endpoints
- Detailed explanations of request/response formats
- Parameter descriptions and examples
- Complete curl command examples
Script Description
| Script | Purpose |
|---|
| One-click installation of dependencies and model download |
| Start HTTP API service |
| Command-line client |
Configuration Files
| File | Description |
|---|
| ASR model configuration list |
| Python dependency list |
Output Format
Transcription results are saved as Markdown files, including:
- Title - File name (without transcription timestamp)
- Transcription Content - Format: followed by on a new line
- AI Summary (Optional) - Includes full text summary, speaker summary, key content, and keywords
Example Format:
markdown
# Transcription: filename.mp4
## Transcription Content
Speaker1 00:00:01
This is the content of the first sentence.
Speaker2 00:00:05
This is the content of the second sentence.
Model Information
Models are stored in the ModelScope default cache directory
~/.cache/modelscope/hub/models/
:
- ASR Main Model (Paraformer) - 867MB
- VAD Model - 4MB
- Punctuation Model - 283MB
- Speaker Diarization Model - 28MB
Troubleshooting
If the service fails to start, run the verification command to check the installation status:
bash
python scripts/setup.py --verify
Re-download models:
bash
python scripts/setup.py --skip-deps