Whisper - Robust Speech Recognition

Whisper - 稳健的语音识别

OpenAI's multilingual speech recognition model.

OpenAI推出的多语种语音识别模型。

When to use Whisper

何时使用Whisper

Use when:

Speech-to-text transcription (99 languages)
Podcast/video transcription
Meeting notes automation
Translation to English
Noisy audio transcription
Multilingual audio processing

Metrics:

72,900+ GitHub stars
99 languages supported
Trained on 680,000 hours of audio
MIT License

Use alternatives instead:

AssemblyAI: Managed API, speaker diarization
Deepgram: Real-time streaming ASR
Google Speech-to-Text: Cloud-based

适用场景：

语音转文字转录（支持99种语言）
播客/视频转录
会议记录自动化
翻译成英文
嘈杂环境下的音频转录
多语言音频处理

指标数据：

GitHub星标数72,900+
支持99种语言
基于68万小时音频训练
MIT许可证

可替代方案：

AssemblyAI：托管式API，支持说话人分离
Deepgram：实时流式ASR
Google Speech-to-Text：云端服务

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Requires Python 3.8-3.11

pip install -U openai-whisper

Requires ffmpeg

macOS: brew install ffmpeg

Ubuntu: sudo apt install ffmpeg

Windows: choco install ffmpeg

undefined

undefined

Basic transcription

基础转录

python

import whisper

python

import whisper

Load model

model = whisper.load_model("base")

Transcribe

result = model.transcribe("audio.mp3")

Print text

print(result["text"])

Access segments

for segment in result["segments"]: print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

undefined

for segment in result["segments"]: print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

undefined

Model sizes

模型尺寸

python

undefined

python

undefined

Available models

models = ["tiny", "base", "small", "medium", "large", "turbo"]

Load specific model

model = whisper.load_model("turbo") # Fastest, good quality


| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|-------|------------|--------------|--------------|-------|------|
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |

**Recommendation**: Use `turbo` for best speed/quality, `base` for prototyping

model = whisper.load_model("turbo") # Fastest, good quality


| 模型 | 参数规模 | 仅支持英文 | 多语种支持 | 速度 | 显存需求 |
|-------|------------|--------------|--------------|-------|------|
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |

**推荐选择**：追求速度与质量平衡选`turbo`，原型开发选`base`

Transcription options

转录选项

Language specification

指定语言

python

undefined

python

undefined

Auto-detect language

result = model.transcribe("audio.mp3")

Specify language (faster)

result = model.transcribe("audio.mp3", language="en")

Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more

undefined

undefined

Task selection

任务选择

python

undefined

python

undefined

Transcription (default)

result = model.transcribe("audio.mp3", task="transcribe")

Translation to English

result = model.transcribe("spanish.mp3", task="translate")

Input: Spanish audio → Output: English text

undefined

undefined

Initial prompt

初始提示词

python

undefined

python

undefined

Improve accuracy with context

result = model.transcribe( "audio.mp3", initial_prompt="This is a technical podcast about machine learning and AI." )

Helps with:

- Technical terms

- Proper nouns

- Domain-specific vocabulary

undefined

undefined

Timestamps

时间戳

python

undefined

python

undefined

Word-level timestamps

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]: for word in segment["words"]: print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")

undefined

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]: for word in segment["words"]: print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")

undefined

Temperature fallback

温度 fallback

python

undefined

python

undefined

Retry with different temperatures if confidence low

result = model.transcribe( "audio.mp3", temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) )

undefined

result = model.transcribe( "audio.mp3", temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) )

undefined

Command line usage

命令行使用

bash

undefined

bash

undefined

Basic transcription

whisper audio.mp3

Specify model

whisper audio.mp3 --model turbo

Output formats

whisper audio.mp3 --output_format txt # Plain text whisper audio.mp3 --output_format srt # Subtitles whisper audio.mp3 --output_format vtt # WebVTT whisper audio.mp3 --output_format json # JSON with timestamps

Language

whisper audio.mp3 --language Spanish

Translation

whisper spanish.mp3 --task translate

undefined

whisper spanish.mp3 --task translate

undefined

Batch processing

批量处理

python

import os

audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]

for audio_file in audio_files:
    print(f"Transcribing {audio_file}...")
    result = model.transcribe(audio_file)

    # Save to file
    output_file = audio_file.replace(".mp3", ".txt")
    with open(output_file, "w") as f:
        f.write(result["text"])

python

import os

audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]

for audio_file in audio_files:
    print(f"Transcribing {audio_file}...")
    result = model.transcribe(audio_file)

    # Save to file
    output_file = audio_file.replace(".mp3", ".txt")
    with open(output_file, "w") as f:
        f.write(result["text"])

Real-time transcription

实时转录

python

undefined

python

undefined

For streaming audio, use faster-whisper

pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cuda", compute_type="float16")

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cuda", compute_type="float16")

Transcribe with streaming

segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments: print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

undefined

segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments: print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

undefined

GPU acceleration

GPU加速

python

import whisper

python

import whisper

Automatically uses GPU if available

model = whisper.load_model("turbo")

Force CPU

model = whisper.load_model("turbo", device="cpu")

Force GPU

model = whisper.load_model("turbo", device="cuda")

10-20× faster on GPU

undefined

undefined

Integration with other tools

与其他工具集成

Subtitle generation

字幕生成

bash

undefined

bash

undefined

Generate SRT subtitles

whisper video.mp4 --output_format srt --language English

Output: video.srt

undefined

undefined

With LangChain

与LangChain集成

python

from langchain.document_loaders import WhisperTranscriptionLoader

loader = WhisperTranscriptionLoader(file_path="audio.mp3")
docs = loader.load()

python

from langchain.document_loaders import WhisperTranscriptionLoader

loader = WhisperTranscriptionLoader(file_path="audio.mp3")
docs = loader.load()

Use transcription in RAG

from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

undefined

from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

undefined

Extract audio from video

从视频提取音频

bash

undefined

bash

undefined

Use ffmpeg to extract audio

ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav

Then transcribe

whisper audio.wav

undefined

whisper audio.wav

undefined

Best practices

最佳实践

Use turbo model - Best speed/quality for English
Specify language - Faster than auto-detect
Add initial prompt - Improves technical terms
Use GPU - 10-20× faster
Batch process - More efficient
Convert to WAV - Better compatibility
Split long audio - <30 min chunks
Check language support - Quality varies by language
Use faster-whisper - 4× faster than openai-whisper
Monitor VRAM - Scale model size to hardware

使用turbo模型 - 英文场景下速度与质量最佳
指定语言 - 比自动检测更快
添加初始提示词 - 提升专业术语识别准确率
使用GPU - 速度提升10-20倍
批量处理 - 更高效
转换为WAV格式 - 兼容性更好
分割长音频 - 分成小于30分钟的片段
确认语言支持 - 不同语言的识别质量有差异
使用faster-whisper - 速度比openai-whisper快4倍
监控显存 - 根据硬件选择合适的模型尺寸

Performance

性能表现

Model	Real-time factor (CPU)	Real-time factor (GPU)
tiny	~0.32	~0.01
base	~0.16	~0.01
turbo	~0.08	~0.01
large	~1.0	~0.05

Real-time factor: 0.1 = 10× faster than real-time

模型	CPU实时因子	GPU实时因子
tiny	~0.32	~0.01
base	~0.16	~0.01
turbo	~0.08	~0.01
large	~1.0	~0.05

实时因子：0.1表示比实时速度快10倍

Language support

语言支持

Top-supported languages:

English (en)
Spanish (es)
French (fr)
German (de)
Italian (it)
Portuguese (pt)
Russian (ru)
Japanese (ja)
Korean (ko)
Chinese (zh)

Full list: 99 languages total

优先支持的语言：

英文（en）
西班牙文（es）
法文（fr）
德文（de）
意大利文（it）
葡萄牙文（pt）
俄文（ru）
日文（ja）
韩文（ko）
中文（zh）

完整列表：共支持99种语言

Limitations

局限性

Hallucinations - May repeat or invent text
Long-form accuracy - Degrades on >30 min audio
Speaker identification - No diarization
Accents - Quality varies
Background noise - Can affect accuracy
Real-time latency - Not suitable for live captioning

幻觉问题 - 可能会重复或生成不存在的文本
长音频准确率下降 - 超过30分钟的音频识别质量降低
无说话人识别 - 不支持说话人分离功能
口音影响 - 不同口音的识别质量有差异
背景噪音干扰 - 会影响识别准确率
实时延迟问题 - 不适合实时字幕生成

Resources

资源

GitHub: https://github.com/openai/whisper ⭐ 72,900+
Paper: https://arxiv.org/abs/2212.04356
Model Card: https://github.com/openai/whisper/blob/main/model-card.md
Colab: Available in repo
License: MIT

GitHub：https://github.com/openai/whisper ⭐ 72,900+
论文：https://arxiv.org/abs/2212.04356
模型卡片：https://github.com/openai/whisper/blob/main/model-card.md
Colab示例：仓库中提供
许可证：MIT

whisper

Original

Translation

Whisper - Robust Speech Recognition

Whisper - 稳健的语音识别

When to use Whisper

何时使用Whisper

Quick start

快速开始

Installation

安装

Requires Python 3.8-3.11

Requires Python 3.8-3.11

Requires ffmpeg

Requires ffmpeg

macOS: brew install ffmpeg

macOS: brew install ffmpeg

Ubuntu: sudo apt install ffmpeg

Ubuntu: sudo apt install ffmpeg

Windows: choco install ffmpeg

Windows: choco install ffmpeg

Basic transcription

基础转录

Load model

Load model

Transcribe

Transcribe

Print text

Print text

Access segments

Access segments

Model sizes

模型尺寸

Available models

Available models

Load specific model

Load specific model

Transcription options

转录选项

Language specification

指定语言

Auto-detect language

Auto-detect language

Specify language (faster)

Specify language (faster)

Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more

Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more

Task selection

任务选择

Transcription (default)

Transcription (default)

Translation to English

Translation to English

Input: Spanish audio → Output: English text

Input: Spanish audio → Output: English text

Initial prompt

初始提示词

Improve accuracy with context

Improve accuracy with context

Helps with:

Helps with:

- Technical terms

- Technical terms

- Proper nouns

- Proper nouns

- Domain-specific vocabulary

- Domain-specific vocabulary

Timestamps

时间戳

Word-level timestamps

Word-level timestamps

Temperature fallback

温度 fallback

Retry with different temperatures if confidence low

Retry with different temperatures if confidence low

Command line usage

命令行使用

Basic transcription

Basic transcription

Specify model