stt-integration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

stt-integration

stt-integration

This skill provides comprehensive guidance for implementing ElevenLabs Speech-to-Text (STT) capabilities using the Scribe v1 model, which supports 99 languages with state-of-the-art accuracy, speaker diarization for up to 32 speakers, and seamless Vercel AI SDK integration.
本Skill提供了使用Scribe v1模型实现ElevenLabs语音转文本(STT)功能的全面指导,该模型支持99种语言,具备业界领先的识别准确率,支持最多32位说话人的Speaker diarization功能,并且可与Vercel AI SDK无缝集成。

Core Capabilities

核心功能

Scribe v1 Model Features

Scribe v1模型特性

  • Multi-language support: 99 languages with varying accuracy levels
  • Speaker diarization: Up to 32 speakers with identification
  • Word-level timestamps: Precise synchronization for video/audio alignment
  • Audio event detection: Identifies sounds like laughter and applause
  • High accuracy: Optimized for accuracy over real-time processing
  • 多语言支持:覆盖99种语言,不同语言准确率有所差异
  • Speaker diarization:支持最多32位说话人的识别
  • 单词级时间戳:提供精准的同步标记,用于音视频对齐
  • 音频事件检测:可识别笑声、掌声等声音事件
  • 高准确率:针对准确率优化,而非实时处理速度

Supported Formats

支持的格式

  • Audio: AAC, AIFF, OGG, MP3, Opus, WAV, WebM, FLAC, M4A
  • Video: MP4, AVI, Matroska, QuickTime, WMV, FLV, WebM, MPEG, 3GPP
  • Limits: Max 3 GB file size, 10 hours duration
  • 音频:AAC、AIFF、OGG、MP3、Opus、WAV、WebM、FLAC、M4A
  • 视频:MP4、AVI、Matroska、QuickTime、WMV、FLV、WebM、MPEG、3GPP
  • 限制:最大文件大小3 GB,最长时长10小时

Skill Structure

Skill结构

Scripts (scripts/)

脚本(scripts/)

  1. transcribe-audio.sh - Direct API transcription with curl
  2. setup-vercel-ai.sh - Install and configure @ai-sdk/elevenlabs
  3. test-stt.sh - Test STT with sample audio files
  4. validate-audio.sh - Validate audio file format and size
  5. batch-transcribe.sh - Process multiple audio files
  1. transcribe-audio.sh - 通过curl调用API直接完成转录
  2. setup-vercel-ai.sh - 安装并配置@ai-sdk/elevenlabs
  3. test-stt.sh - 使用示例音频文件测试STT功能
  4. validate-audio.sh - 验证音频文件的格式和大小
  5. batch-transcribe.sh - 批量处理多个音频文件

Templates (templates/)

模板(templates/)

  1. stt-config.json.template - STT configuration template
  2. vercel-ai-transcribe.ts.template - Vercel AI SDK TypeScript template
  3. vercel-ai-transcribe.py.template - Vercel AI SDK Python template
  4. api-transcribe.ts.template - Direct API TypeScript template
  5. api-transcribe.py.template - Direct API Python template
  6. diarization-config.json.template - Speaker diarization configuration
  1. stt-config.json.template - STT配置模板
  2. vercel-ai-transcribe.ts.template - Vercel AI SDK TypeScript模板
  3. vercel-ai-transcribe.py.template - Vercel AI SDK Python模板
  4. api-transcribe.ts.template - 直接调用API的TypeScript模板
  5. api-transcribe.py.template - 直接调用API的Python模板
  6. diarization-config.json.template - Speaker diarization配置模板

Examples (examples/)

示例(examples/)

  1. basic-stt/ - Basic STT with direct API
  2. vercel-ai-stt/ - Vercel AI SDK integration
  3. diarization/ - Speaker diarization examples
  4. multi-language/ - Multi-language transcription
  5. webhook-integration/ - Async transcription with webhooks
  1. basic-stt/ - 直接调用API的基础STT示例
  2. vercel-ai-stt/ - Vercel AI SDK集成示例
  3. diarization/ - Speaker diarization示例
  4. multi-language/ - 多语言转录示例
  5. webhook-integration/ - 结合Webhook的异步转录示例

Usage Instructions

使用说明

1. Setup Vercel AI SDK Integration

1. 搭建Vercel AI SDK集成环境

bash
undefined
bash
undefined

Install dependencies

Install dependencies

bash scripts/setup-vercel-ai.sh
bash scripts/setup-vercel-ai.sh

Verify installation

Verify installation

npm list @ai-sdk/elevenlabs
undefined
npm list @ai-sdk/elevenlabs
undefined

2. Basic Transcription

2. 基础转录操作

bash
undefined
bash
undefined

Transcribe a single audio file

Transcribe a single audio file

bash scripts/transcribe-audio.sh path/to/audio.mp3 en
bash scripts/transcribe-audio.sh path/to/audio.mp3 en

Validate audio before transcription

Validate audio before transcription

bash scripts/validate-audio.sh path/to/audio.mp3
bash scripts/validate-audio.sh path/to/audio.mp3

Batch transcribe multiple files

Batch transcribe multiple files

bash scripts/batch-transcribe.sh path/to/audio/directory en
undefined
bash scripts/batch-transcribe.sh path/to/audio/directory en
undefined

3. Test STT Implementation

3. 测试STT实现效果

bash
undefined
bash
undefined

Run comprehensive tests

Run comprehensive tests

bash scripts/test-stt.sh
undefined
bash scripts/test-stt.sh
undefined

4. Use Templates

4. 使用模板

typescript
// Read Vercel AI SDK template
Read: templates/vercel-ai-transcribe.ts.template

// Customize for your use case
// - Set language code
// - Configure diarization
// - Enable audio event tagging
// - Set timestamp granularity
typescript
// Read Vercel AI SDK template
Read: templates/vercel-ai-transcribe.ts.template

// Customize for your use case
// - Set language code
// - Configure diarization
// - Enable audio event tagging
// - Set timestamp granularity

5. Explore Examples

5. 探索示例

bash
undefined
bash
undefined

Basic STT example

Basic STT example

Read: examples/basic-stt/README.md
Read: examples/basic-stt/README.md

Vercel AI SDK example

Vercel AI SDK example

Read: examples/vercel-ai-stt/README.md
Read: examples/vercel-ai-stt/README.md

Speaker diarization example

Speaker diarization example

Read: examples/diarization/README.md
undefined
Read: examples/diarization/README.md
undefined

Language Support

语言支持

Excellent Accuracy (≤5% WER)

卓越准确率(≤5% WER)

30 languages including: English, French, German, Spanish, Italian, Japanese, Portuguese, Dutch, Polish, Russian
30 languages including: English, French, German, Spanish, Italian, Japanese, Portuguese, Dutch, Polish, Russian

High Accuracy (>5-10% WER)

高准确率(>5-10% WER)

19 languages including: Bengali, Mandarin Chinese, Tamil, Telugu, Vietnamese, Turkish
19 languages including: Bengali, Mandarin Chinese, Tamil, Telugu, Vietnamese, Turkish

Good Accuracy (>10-25% WER)

良好准确率(>10-25% WER)

30 languages including: Arabic, Korean, Thai, Indonesian, Hebrew, Czech
30 languages including: Arabic, Korean, Thai, Indonesian, Hebrew, Czech

Moderate Accuracy (>25-50% WER)

中等准确率(>25-50% WER)

19 languages including: Amharic, Khmer, Lao, Burmese, Nepali
19 languages including: Amharic, Khmer, Lao, Burmese, Nepali

Configuration Options

配置选项

Provider Options (Vercel AI SDK)

提供商选项(Vercel AI SDK)

  • languageCode: ISO-639-1/3 code (e.g., 'en', 'es', 'ja')
  • tagAudioEvents: Enable sound detection (default: true)
  • numSpeakers: Max speakers 1-32 (default: auto-detect)
  • diarize: Enable speaker identification (default: true)
  • timestampsGranularity: 'none' | 'word' | 'character' (default: 'word')
  • fileFormat: 'pcm_s16le_16' | 'other' (default: 'other')
  • languageCode: ISO-639-1/3 code (e.g., 'en', 'es', 'ja')
  • tagAudioEvents: Enable sound detection (default: true)
  • numSpeakers: Max speakers 1-32 (default: auto-detect)
  • diarize: Enable speaker identification (default: true)
  • timestampsGranularity: 'none' | 'word' | 'character' (default: 'word')
  • fileFormat: 'pcm_s16le_16' | 'other' (default: 'other')

Best Practices

最佳实践

  1. Specify language code when known for better performance
  2. Use pcm_s16le_16 format for lowest latency with uncompressed audio
  3. Enable diarization for multi-speaker content
  4. Set numSpeakers for better accuracy when speaker count is known
  5. Use webhooks for files >8 minutes for async processing
  1. Specify language code when known for better performance
  2. Use pcm_s16le_16 format for lowest latency with uncompressed audio
  3. Enable diarization for multi-speaker content
  4. Set numSpeakers for better accuracy when speaker count is known
  5. Use webhooks for files >8 minutes for async processing

Common Patterns

常见模式

Pattern 1: Simple Transcription

Pattern 1: Simple Transcription

Use direct API or Vercel AI SDK for single-language, single-speaker transcription.
Use direct API or Vercel AI SDK for single-language, single-speaker transcription.

Pattern 2: Multi-Speaker Transcription

Pattern 2: Multi-Speaker Transcription

Enable diarization and set numSpeakers for interviews, meetings, podcasts.
Enable diarization and set numSpeakers for interviews, meetings, podcasts.

Pattern 3: Multi-Language Support

Pattern 3: Multi-Language Support

Detect language automatically or specify when known for content in 99 languages.
Detect language automatically or specify when known for content in 99 languages.

Pattern 4: Video Transcription

Pattern 4: Video Transcription

Extract audio from video formats and transcribe with timestamps for subtitles.
Extract audio from video formats and transcribe with timestamps for subtitles.

Pattern 5: Webhook Integration

Pattern 5: Webhook Integration

Process long files asynchronously using webhook callbacks for results.
Process long files asynchronously using webhook callbacks for results.

Integration with Other ElevenLabs Skills

Integration with Other ElevenLabs Skills

  • tts-integration: Combine STT → processing → TTS for voice translation workflows
  • voice-cloning: Transcribe existing voice samples before cloning
  • dubbing: Use STT as first step in dubbing pipeline
  • tts-integration: Combine STT → processing → TTS for voice translation workflows
  • voice-cloning: Transcribe existing voice samples before cloning
  • dubbing: Use STT as first step in dubbing pipeline

Troubleshooting

Troubleshooting

Audio Format Issues

Audio Format Issues

bash
undefined
bash
undefined

Validate audio format

Validate audio format

bash scripts/validate-audio.sh your-audio.mp3
undefined
bash scripts/validate-audio.sh your-audio.mp3
undefined

Language Detection Problems

Language Detection Problems

  • Specify languageCode explicitly instead of auto-detection
  • Ensure audio quality is sufficient for chosen language
  • Specify languageCode explicitly instead of auto-detection
  • Ensure audio quality is sufficient for chosen language

Diarization Not Working

Diarization Not Working

  • Verify numSpeakers is set correctly (1-32)
  • Check that diarize: true is configured
  • Ensure audio has clear speaker separation
  • Verify numSpeakers is set correctly (1-32)
  • Check that diarize: true is configured
  • Ensure audio has clear speaker separation

File Size/Duration Limits

File Size/Duration Limits

  • Max 3 GB file size
  • Max 10 hours duration
  • Files >8 minutes are chunked automatically
  • Max 3 GB file size
  • Max 10 hours duration
  • Files >8 minutes are chunked automatically

Script Reference

Script Reference

All scripts are located in
skills/stt-integration/scripts/
:
  1. transcribe-audio.sh - Main transcription script with curl
  2. setup-vercel-ai.sh - Install @ai-sdk/elevenlabs package
  3. test-stt.sh - Comprehensive test suite
  4. validate-audio.sh - Audio format and size validation
  5. batch-transcribe.sh - Batch processing for multiple files
All scripts are located in
skills/stt-integration/scripts/
:
  1. transcribe-audio.sh - Main transcription script with curl
  2. setup-vercel-ai.sh - Install @ai-sdk/elevenlabs package
  3. test-stt.sh - Comprehensive test suite
  4. validate-audio.sh - Audio format and size validation
  5. batch-transcribe.sh - Batch processing for multiple files

Template Reference

Template Reference

All templates are located in
skills/stt-integration/templates/
:
  1. stt-config.json.template - JSON configuration
  2. vercel-ai-transcribe.ts.template - TypeScript with Vercel AI SDK
  3. vercel-ai-transcribe.py.template - Python with Vercel AI SDK
  4. api-transcribe.ts.template - TypeScript with direct API
  5. api-transcribe.py.template - Python with direct API
  6. diarization-config.json.template - Diarization settings
All templates are located in
skills/stt-integration/templates/
:
  1. stt-config.json.template - JSON configuration
  2. vercel-ai-transcribe.ts.template - TypeScript with Vercel AI SDK
  3. vercel-ai-transcribe.py.template - Python with Vercel AI SDK
  4. api-transcribe.ts.template - Direct API TypeScript template
  5. api-transcribe.py.template - Direct API Python template
  6. diarization-config.json.template - Diarization settings

Example Reference

Example Reference

All examples are located in
skills/stt-integration/examples/
:
  1. basic-stt/ - Basic transcription workflow
  2. vercel-ai-stt/ - Vercel AI SDK integration
  3. diarization/ - Speaker identification
  4. multi-language/ - Multi-language support
  5. webhook-integration/ - Async processing

Skill Location:
plugins/elevenlabs/skills/stt-integration/
Version: 1.0.0 Last Updated: 2025-10-29
All examples are located in
skills/stt-integration/examples/
:
  1. basic-stt/ - Basic transcription workflow
  2. vercel-ai-stt/ - Vercel AI SDK integration
  3. diarization/ - Speaker identification
  4. multi-language/ - Multi-language support
  5. webhook-integration/ - Async processing

Skill Location:
plugins/elevenlabs/skills/stt-integration/
Version: 1.0.0 Last Updated: 2025-10-29