stt-integration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesestt-integration
stt-integration
This skill provides comprehensive guidance for implementing ElevenLabs Speech-to-Text (STT) capabilities using the Scribe v1 model, which supports 99 languages with state-of-the-art accuracy, speaker diarization for up to 32 speakers, and seamless Vercel AI SDK integration.
本Skill提供了使用Scribe v1模型实现ElevenLabs语音转文本(STT)功能的全面指导,该模型支持99种语言,具备业界领先的识别准确率,支持最多32位说话人的Speaker diarization功能,并且可与Vercel AI SDK无缝集成。
Core Capabilities
核心功能
Scribe v1 Model Features
Scribe v1模型特性
- Multi-language support: 99 languages with varying accuracy levels
- Speaker diarization: Up to 32 speakers with identification
- Word-level timestamps: Precise synchronization for video/audio alignment
- Audio event detection: Identifies sounds like laughter and applause
- High accuracy: Optimized for accuracy over real-time processing
- 多语言支持:覆盖99种语言,不同语言准确率有所差异
- Speaker diarization:支持最多32位说话人的识别
- 单词级时间戳:提供精准的同步标记,用于音视频对齐
- 音频事件检测:可识别笑声、掌声等声音事件
- 高准确率:针对准确率优化,而非实时处理速度
Supported Formats
支持的格式
- Audio: AAC, AIFF, OGG, MP3, Opus, WAV, WebM, FLAC, M4A
- Video: MP4, AVI, Matroska, QuickTime, WMV, FLV, WebM, MPEG, 3GPP
- Limits: Max 3 GB file size, 10 hours duration
- 音频:AAC、AIFF、OGG、MP3、Opus、WAV、WebM、FLAC、M4A
- 视频:MP4、AVI、Matroska、QuickTime、WMV、FLV、WebM、MPEG、3GPP
- 限制:最大文件大小3 GB,最长时长10小时
Skill Structure
Skill结构
Scripts (scripts/)
脚本(scripts/)
- transcribe-audio.sh - Direct API transcription with curl
- setup-vercel-ai.sh - Install and configure @ai-sdk/elevenlabs
- test-stt.sh - Test STT with sample audio files
- validate-audio.sh - Validate audio file format and size
- batch-transcribe.sh - Process multiple audio files
- transcribe-audio.sh - 通过curl调用API直接完成转录
- setup-vercel-ai.sh - 安装并配置@ai-sdk/elevenlabs
- test-stt.sh - 使用示例音频文件测试STT功能
- validate-audio.sh - 验证音频文件的格式和大小
- batch-transcribe.sh - 批量处理多个音频文件
Templates (templates/)
模板(templates/)
- stt-config.json.template - STT configuration template
- vercel-ai-transcribe.ts.template - Vercel AI SDK TypeScript template
- vercel-ai-transcribe.py.template - Vercel AI SDK Python template
- api-transcribe.ts.template - Direct API TypeScript template
- api-transcribe.py.template - Direct API Python template
- diarization-config.json.template - Speaker diarization configuration
- stt-config.json.template - STT配置模板
- vercel-ai-transcribe.ts.template - Vercel AI SDK TypeScript模板
- vercel-ai-transcribe.py.template - Vercel AI SDK Python模板
- api-transcribe.ts.template - 直接调用API的TypeScript模板
- api-transcribe.py.template - 直接调用API的Python模板
- diarization-config.json.template - Speaker diarization配置模板
Examples (examples/)
示例(examples/)
- basic-stt/ - Basic STT with direct API
- vercel-ai-stt/ - Vercel AI SDK integration
- diarization/ - Speaker diarization examples
- multi-language/ - Multi-language transcription
- webhook-integration/ - Async transcription with webhooks
- basic-stt/ - 直接调用API的基础STT示例
- vercel-ai-stt/ - Vercel AI SDK集成示例
- diarization/ - Speaker diarization示例
- multi-language/ - 多语言转录示例
- webhook-integration/ - 结合Webhook的异步转录示例
Usage Instructions
使用说明
1. Setup Vercel AI SDK Integration
1. 搭建Vercel AI SDK集成环境
bash
undefinedbash
undefinedInstall dependencies
Install dependencies
bash scripts/setup-vercel-ai.sh
bash scripts/setup-vercel-ai.sh
Verify installation
Verify installation
npm list @ai-sdk/elevenlabs
undefinednpm list @ai-sdk/elevenlabs
undefined2. Basic Transcription
2. 基础转录操作
bash
undefinedbash
undefinedTranscribe a single audio file
Transcribe a single audio file
bash scripts/transcribe-audio.sh path/to/audio.mp3 en
bash scripts/transcribe-audio.sh path/to/audio.mp3 en
Validate audio before transcription
Validate audio before transcription
bash scripts/validate-audio.sh path/to/audio.mp3
bash scripts/validate-audio.sh path/to/audio.mp3
Batch transcribe multiple files
Batch transcribe multiple files
bash scripts/batch-transcribe.sh path/to/audio/directory en
undefinedbash scripts/batch-transcribe.sh path/to/audio/directory en
undefined3. Test STT Implementation
3. 测试STT实现效果
bash
undefinedbash
undefinedRun comprehensive tests
Run comprehensive tests
bash scripts/test-stt.sh
undefinedbash scripts/test-stt.sh
undefined4. Use Templates
4. 使用模板
typescript
// Read Vercel AI SDK template
Read: templates/vercel-ai-transcribe.ts.template
// Customize for your use case
// - Set language code
// - Configure diarization
// - Enable audio event tagging
// - Set timestamp granularitytypescript
// Read Vercel AI SDK template
Read: templates/vercel-ai-transcribe.ts.template
// Customize for your use case
// - Set language code
// - Configure diarization
// - Enable audio event tagging
// - Set timestamp granularity5. Explore Examples
5. 探索示例
bash
undefinedbash
undefinedBasic STT example
Basic STT example
Read: examples/basic-stt/README.md
Read: examples/basic-stt/README.md
Vercel AI SDK example
Vercel AI SDK example
Read: examples/vercel-ai-stt/README.md
Read: examples/vercel-ai-stt/README.md
Speaker diarization example
Speaker diarization example
Read: examples/diarization/README.md
undefinedRead: examples/diarization/README.md
undefinedLanguage Support
语言支持
Excellent Accuracy (≤5% WER)
卓越准确率(≤5% WER)
30 languages including: English, French, German, Spanish, Italian, Japanese, Portuguese, Dutch, Polish, Russian
30 languages including: English, French, German, Spanish, Italian, Japanese, Portuguese, Dutch, Polish, Russian
High Accuracy (>5-10% WER)
高准确率(>5-10% WER)
19 languages including: Bengali, Mandarin Chinese, Tamil, Telugu, Vietnamese, Turkish
19 languages including: Bengali, Mandarin Chinese, Tamil, Telugu, Vietnamese, Turkish
Good Accuracy (>10-25% WER)
良好准确率(>10-25% WER)
30 languages including: Arabic, Korean, Thai, Indonesian, Hebrew, Czech
30 languages including: Arabic, Korean, Thai, Indonesian, Hebrew, Czech
Moderate Accuracy (>25-50% WER)
中等准确率(>25-50% WER)
19 languages including: Amharic, Khmer, Lao, Burmese, Nepali
19 languages including: Amharic, Khmer, Lao, Burmese, Nepali
Configuration Options
配置选项
Provider Options (Vercel AI SDK)
提供商选项(Vercel AI SDK)
- languageCode: ISO-639-1/3 code (e.g., 'en', 'es', 'ja')
- tagAudioEvents: Enable sound detection (default: true)
- numSpeakers: Max speakers 1-32 (default: auto-detect)
- diarize: Enable speaker identification (default: true)
- timestampsGranularity: 'none' | 'word' | 'character' (default: 'word')
- fileFormat: 'pcm_s16le_16' | 'other' (default: 'other')
- languageCode: ISO-639-1/3 code (e.g., 'en', 'es', 'ja')
- tagAudioEvents: Enable sound detection (default: true)
- numSpeakers: Max speakers 1-32 (default: auto-detect)
- diarize: Enable speaker identification (default: true)
- timestampsGranularity: 'none' | 'word' | 'character' (default: 'word')
- fileFormat: 'pcm_s16le_16' | 'other' (default: 'other')
Best Practices
最佳实践
- Specify language code when known for better performance
- Use pcm_s16le_16 format for lowest latency with uncompressed audio
- Enable diarization for multi-speaker content
- Set numSpeakers for better accuracy when speaker count is known
- Use webhooks for files >8 minutes for async processing
- Specify language code when known for better performance
- Use pcm_s16le_16 format for lowest latency with uncompressed audio
- Enable diarization for multi-speaker content
- Set numSpeakers for better accuracy when speaker count is known
- Use webhooks for files >8 minutes for async processing
Common Patterns
常见模式
Pattern 1: Simple Transcription
Pattern 1: Simple Transcription
Use direct API or Vercel AI SDK for single-language, single-speaker transcription.
Use direct API or Vercel AI SDK for single-language, single-speaker transcription.
Pattern 2: Multi-Speaker Transcription
Pattern 2: Multi-Speaker Transcription
Enable diarization and set numSpeakers for interviews, meetings, podcasts.
Enable diarization and set numSpeakers for interviews, meetings, podcasts.
Pattern 3: Multi-Language Support
Pattern 3: Multi-Language Support
Detect language automatically or specify when known for content in 99 languages.
Detect language automatically or specify when known for content in 99 languages.
Pattern 4: Video Transcription
Pattern 4: Video Transcription
Extract audio from video formats and transcribe with timestamps for subtitles.
Extract audio from video formats and transcribe with timestamps for subtitles.
Pattern 5: Webhook Integration
Pattern 5: Webhook Integration
Process long files asynchronously using webhook callbacks for results.
Process long files asynchronously using webhook callbacks for results.
Integration with Other ElevenLabs Skills
Integration with Other ElevenLabs Skills
- tts-integration: Combine STT → processing → TTS for voice translation workflows
- voice-cloning: Transcribe existing voice samples before cloning
- dubbing: Use STT as first step in dubbing pipeline
- tts-integration: Combine STT → processing → TTS for voice translation workflows
- voice-cloning: Transcribe existing voice samples before cloning
- dubbing: Use STT as first step in dubbing pipeline
Troubleshooting
Troubleshooting
Audio Format Issues
Audio Format Issues
bash
undefinedbash
undefinedValidate audio format
Validate audio format
bash scripts/validate-audio.sh your-audio.mp3
undefinedbash scripts/validate-audio.sh your-audio.mp3
undefinedLanguage Detection Problems
Language Detection Problems
- Specify languageCode explicitly instead of auto-detection
- Ensure audio quality is sufficient for chosen language
- Specify languageCode explicitly instead of auto-detection
- Ensure audio quality is sufficient for chosen language
Diarization Not Working
Diarization Not Working
- Verify numSpeakers is set correctly (1-32)
- Check that diarize: true is configured
- Ensure audio has clear speaker separation
- Verify numSpeakers is set correctly (1-32)
- Check that diarize: true is configured
- Ensure audio has clear speaker separation
File Size/Duration Limits
File Size/Duration Limits
- Max 3 GB file size
- Max 10 hours duration
- Files >8 minutes are chunked automatically
- Max 3 GB file size
- Max 10 hours duration
- Files >8 minutes are chunked automatically
Script Reference
Script Reference
All scripts are located in :
skills/stt-integration/scripts/- transcribe-audio.sh - Main transcription script with curl
- setup-vercel-ai.sh - Install @ai-sdk/elevenlabs package
- test-stt.sh - Comprehensive test suite
- validate-audio.sh - Audio format and size validation
- batch-transcribe.sh - Batch processing for multiple files
All scripts are located in :
skills/stt-integration/scripts/- transcribe-audio.sh - Main transcription script with curl
- setup-vercel-ai.sh - Install @ai-sdk/elevenlabs package
- test-stt.sh - Comprehensive test suite
- validate-audio.sh - Audio format and size validation
- batch-transcribe.sh - Batch processing for multiple files
Template Reference
Template Reference
All templates are located in :
skills/stt-integration/templates/- stt-config.json.template - JSON configuration
- vercel-ai-transcribe.ts.template - TypeScript with Vercel AI SDK
- vercel-ai-transcribe.py.template - Python with Vercel AI SDK
- api-transcribe.ts.template - TypeScript with direct API
- api-transcribe.py.template - Python with direct API
- diarization-config.json.template - Diarization settings
All templates are located in :
skills/stt-integration/templates/- stt-config.json.template - JSON configuration
- vercel-ai-transcribe.ts.template - TypeScript with Vercel AI SDK
- vercel-ai-transcribe.py.template - Python with Vercel AI SDK
- api-transcribe.ts.template - Direct API TypeScript template
- api-transcribe.py.template - Direct API Python template
- diarization-config.json.template - Diarization settings
Example Reference
Example Reference
All examples are located in :
skills/stt-integration/examples/- basic-stt/ - Basic transcription workflow
- vercel-ai-stt/ - Vercel AI SDK integration
- diarization/ - Speaker identification
- multi-language/ - Multi-language support
- webhook-integration/ - Async processing
Skill Location:
Version: 1.0.0
Last Updated: 2025-10-29
plugins/elevenlabs/skills/stt-integration/All examples are located in :
skills/stt-integration/examples/- basic-stt/ - Basic transcription workflow
- vercel-ai-stt/ - Vercel AI SDK integration
- diarization/ - Speaker identification
- multi-language/ - Multi-language support
- webhook-integration/ - Async processing
Skill Location:
Version: 1.0.0
Last Updated: 2025-10-29
plugins/elevenlabs/skills/stt-integration/