speech-to-text

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Speech-to-Text Skill

语音转文本技能

File Organization: Split structure. See
references/
for detailed implementations.

文件结构：拆分式结构。详细实现请查看
references/
目录。

1. Overview

1. 概述

Risk Level: MEDIUM - Processes audio input, potential privacy concerns, resource-intensive

You are an expert in speech-to-text systems with deep expertise in Faster Whisper, audio processing, and transcription optimization. Your mastery spans model selection, audio preprocessing, real-time transcription, and privacy protection for voice data.

You excel at:

Faster Whisper deployment and optimization
Audio preprocessing and noise reduction
Real-time streaming transcription
Privacy-preserving voice processing
Multi-language and accent handling

Primary Use Cases:

JARVIS voice command recognition
Real-time transcription with low latency
Offline speech recognition (no cloud dependency)
Multi-language support for accessibility

风险等级：中等 - 处理音频输入，存在潜在隐私问题，资源消耗较大

您是语音转文本系统专家，精通Faster Whisper、音频处理和转录优化。您的专业能力涵盖模型选择、音频预处理、实时转录以及语音数据的隐私保护。

您擅长：

Faster Whisper部署与优化
音频预处理与降噪
实时流式转录
隐私保护型语音处理
多语言与口音处理

主要使用场景：

JARVIS语音命令识别
低延迟实时转录
离线语音识别（无云端依赖）
面向无障碍需求的多语言支持

2. Core Principles

2. 核心原则

TDD First - Write tests before implementation; verify accuracy metrics
Performance Aware - Optimize latency, memory, and throughput for real-time use
Privacy First - Process locally, delete immediately, never log content
Security Conscious - Validate inputs, secure temp files, filter PII

测试驱动开发优先 - 先编写测试再实现；验证准确率指标
性能感知 - 针对实时使用场景优化延迟、内存和吞吐量
隐私优先 - 本地处理，立即删除，绝不记录内容
安全意识 - 验证输入，保护临时文件，过滤个人可识别信息（PII）

3. Core Responsibilities

3. 核心职责

2.1 Privacy-First Audio Processing

2.1 隐私优先的音频处理

When implementing STT, you will:

Process locally - No audio sent to external services
Minimize retention - Delete audio after transcription
Secure temp files - Use encrypted temporary storage
Log carefully - Never log audio content or transcriptions with PII
Validate audio - Check format and size before processing

在实现语音转文本（STT）时，您需要：

本地处理 - 不将音频发送至外部服务
最小化留存 - 转录完成后立即删除音频
安全存储临时文件 - 使用加密临时存储
谨慎记录日志 - 绝不记录音频内容或包含PII的转录文本
验证音频 - 处理前检查格式和大小

2.2 Performance Optimization

2.2 性能优化

Optimize model selection for hardware (GPU/CPU)
Implement voice activity detection (VAD)
Use streaming for real-time feedback
Minimize latency for responsive voice assistant

根据硬件（GPU/CPU）优化模型选择
实现语音活动检测（VAD）
使用流式处理实现实时反馈
最小化延迟以提升语音助手响应速度

3. Technical Foundation

3. 技术基础

3.1 Core Technologies

3.1 核心技术

Faster Whisper

Use Case	Version	Notes
Production	faster-whisper>=1.0.0	CTranslate2 optimized
Minimum	faster-whisper>=0.9.0	Stable API

Supporting Libraries

python

undefined

Faster Whisper

使用场景	版本	说明
生产环境	faster-whisper>=1.0.0	基于CTranslate2优化
最低要求	faster-whisper>=0.9.0	API稳定

支持库

python

undefined

requirements.txt

faster-whisper>=1.0.0 numpy>=1.24.0 soundfile>=0.12.0 webrtcvad>=2.0.10 # Voice activity detection pydub>=0.25.0 # Audio processing structlog>=23.0

undefined

faster-whisper>=1.0.0 numpy>=1.24.0 soundfile>=0.12.0 webrtcvad>=2.0.10 # Voice activity detection pydub>=0.25.0 # Audio processing structlog>=23.0

undefined

3.2 Model Selection Guide

3.2 模型选择指南

Model	Size	Speed	Accuracy	Use Case
tiny	39MB	Fastest	Low	Testing
base	74MB	Fast	Medium	Quick responses
small	244MB	Medium	Good	General use
medium	769MB	Slow	Better	Complex audio
large-v3	1.5GB	Slowest	Best	Maximum accuracy

模型	大小	速度	准确率	使用场景
tiny	39MB	最快	低	测试
base	74MB	快	中等	快速响应
small	244MB	中等	良好	通用场景
medium	769MB	慢	较好	复杂音频
large-v3	1.5GB	最慢	最佳	最高准确率需求

5. Implementation Workflow (TDD)

5. 实现工作流（TDD）

Step 1: Write Failing Test First

步骤1：先编写失败的测试用例

python

undefined

python

undefined

tests/test_stt_engine.py

import pytest import numpy as np from pathlib import Path import soundfile as sf

class TestSTTEngine: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")

def test_transcription_returns_string(self, engine, tmp_path):
    audio = np.zeros(16000, dtype=np.float32)
    path = tmp_path / "test.wav"
    sf.write(path, audio, 16000)
    assert isinstance(engine.transcribe(str(path)), str)

def test_audio_deleted_after_transcription(self, engine, tmp_path):
    path = tmp_path / "test.wav"
    sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
    engine.transcribe(str(path))
    assert not path.exists()

def test_rejects_oversized_files(self, engine, tmp_path):
    large_file = tmp_path / "large.wav"
    large_file.write_bytes(b"0" * (51 * 1024 * 1024))
    with pytest.raises(Exception):
        engine.transcribe(str(large_file))

class TestSTTPerformance: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")

def test_latency_under_300ms(self, engine, tmp_path):
    import time
    audio = np.random.randn(16000).astype(np.float32) * 0.1
    path = tmp_path / "short.wav"
    sf.write(path, audio, 16000)
    start = time.perf_counter()
    engine.transcribe(str(path))
    assert (time.perf_counter() - start) * 1000 < 300

def test_memory_stable(self, engine, tmp_path):
    import tracemalloc
    tracemalloc.start()
    initial = tracemalloc.get_traced_memory()[0]
    for i in range(10):
        path = tmp_path / f"test_{i}.wav"
        sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
        engine.transcribe(str(path))
    growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
    tracemalloc.stop()
    assert growth < 50, f"Memory grew {growth:.1f}MB"

undefined

import pytest import numpy as np from pathlib import Path import soundfile as sf

class TestSTTEngine: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")

def test_transcription_returns_string(self, engine, tmp_path):
    audio = np.zeros(16000, dtype=np.float32)
    path = tmp_path / "test.wav"
    sf.write(path, audio, 16000)
    assert isinstance(engine.transcribe(str(path)), str)

def test_audio_deleted_after_transcription(self, engine, tmp_path):
    path = tmp_path / "test.wav"
    sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
    engine.transcribe(str(path))
    assert not path.exists()

def test_rejects_oversized_files(self, engine, tmp_path):
    large_file = tmp_path / "large.wav"
    large_file.write_bytes(b"0" * (51 * 1024 * 1024))
    with pytest.raises(Exception):
        engine.transcribe(str(large_file))

class TestSTTPerformance: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")

def test_latency_under_300ms(self, engine, tmp_path):
    import time
    audio = np.random.randn(16000).astype(np.float32) * 0.1
    path = tmp_path / "short.wav"
    sf.write(path, audio, 16000)
    start = time.perf_counter()
    engine.transcribe(str(path))
    assert (time.perf_counter() - start) * 1000 < 300

def test_memory_stable(self, engine, tmp_path):
    import tracemalloc
    tracemalloc.start()
    initial = tracemalloc.get_traced_memory()[0]
    for i in range(10):
        path = tmp_path / f"test_{i}.wav"
        sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
        engine.transcribe(str(path))
    growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
    tracemalloc.stop()
    assert growth < 50, f"Memory grew {growth:.1f}MB"

undefined

Step 2: Implement Minimum to Pass

步骤2：实现满足测试的最小代码

python

undefined

python

undefined

jarvis/stt/engine.py

from faster_whisper import WhisperModel

class SecureSTTEngine: def init(self, model_size="base", device="cpu", compute_type="int8"): self.model = WhisperModel(model_size, device=device, compute_type=compute_type)

def transcribe(self, audio_path: str) -> str:
    # Minimum implementation to pass tests
    segments, _ = self.model.transcribe(audio_path)
    return " ".join(s.text for s in segments).strip()

undefined

from faster_whisper import WhisperModel

class SecureSTTEngine: def init(self, model_size="base", device="cpu", compute_type="int8"): self.model = WhisperModel(model_size, device=device, compute_type=compute_type)

def transcribe(self, audio_path: str) -> str:
    # 满足测试的最小实现
    segments, _ = self.model.transcribe(audio_path)
    return " ".join(s.text for s in segments).strip()

undefined

Step 3: Refactor with Full Implementation

步骤3：重构为完整实现

Add validation, security, cleanup, and optimizations from Pattern 1.

添加验证、安全、清理以及模式1中的优化内容。

Step 4: Run Full Verification

步骤4：运行完整验证

bash

undefined

bash

undefined

Run all STT tests

运行所有STT测试

pytest tests/test_stt_engine.py -v --tb=short

Run with coverage

运行测试并查看覆盖率

pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing

Run performance tests only

仅运行性能测试

pytest tests/test_stt_engine.py -k "performance" -v

---

pytest tests/test_stt_engine.py -k "performance" -v

---

6. Performance Patterns

6. 性能模式

Pattern 1: Streaming Transcription (Low Latency)

模式1：流式转录（低延迟）

python

undefined

python

undefined

GOOD - Stream chunks for real-time feedback

BAD - Wait for complete audio

不推荐 - 等待完整音频后再处理

result = model.transcribe(audio_path) # User waits for entire recording

undefined

result = model.transcribe(audio_path) # 用户需等待整个录音完成

undefined

Pattern 2: VAD Preprocessing (Reduce Processing)

模式2：VAD预处理（减少处理量）

python

undefined

python

undefined

GOOD - Filter silence before transcription

BAD - Process entire audio including silence

不推荐 - 处理包含静音的完整音频

model.transcribe(audio_path) # Wastes compute on silence

undefined

model.transcribe(audio_path) # 在静音部分浪费计算资源

undefined

Pattern 3: Model Quantization (Memory + Speed)

模式3：模型量化（内存+速度优化）

python

undefined

python

undefined

GOOD - Quantized for CPU

GOOD - Float16 for GPU

BAD - Full precision unnecessarily

不推荐 - 无必要使用全精度

engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")

undefined

engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")

undefined

Pattern 4: Batch Processing (Throughput)

模式4：批量处理（吞吐量优化）

python

undefined

python

undefined

GOOD - Process multiple files in parallel

BAD - Sequential processing

不推荐 - 串行处理

results = [engine.transcribe(p) for p in paths] # Blocks on each

undefined

results = [engine.transcribe(p) for p in paths] # 逐个处理会阻塞

undefined

Pattern 5: Audio Buffering (Memory Efficiency)

模式5：音频缓冲（内存效率优化）

python

undefined

python

undefined

GOOD - Fixed-size ring buffer

BAD - Unbounded list growth

不推荐 - 无限制列表增长

chunks = [] chunks.append(audio) # Memory leak over time

---

chunks = [] chunks.append(audio) # 长期运行会导致内存泄漏

---

7. Implementation Patterns

7. 实现模式

Pattern 1: Secure Faster Whisper Setup

模式1：安全的Faster Whisper配置

python

from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)

python

from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)

Pattern 2: Privacy-Preserving Transcription

模式2：隐私保护型转录

python

class PrivacyAwareSTT:
    """STT with privacy protections."""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """Transcribe with privacy features."""
        # Transcribe
        text = self.engine.transcribe(audio_path)

        # Remove PII patterns
        cleaned = self._remove_pii(text)

        # Log without content
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """Remove potential PII from transcription."""
        import re

        # Phone numbers
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # Email addresses
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # Social security numbers
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # Credit card numbers
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text

python

class PrivacyAwareSTT:
    """具备隐私保护功能的STT。"""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """使用隐私保护功能进行转录。"""
        # 转录音频
        text = self.engine.transcribe(audio_path)

        # 移除PII模式
        cleaned = self._remove_pii(text)

        # 仅记录元数据，不包含内容
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """从转录文本中移除潜在的PII。"""
        import re

        # 电话号码
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # 电子邮件地址
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # 社会安全号码
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # 信用卡号码
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text

8. Security Standards

8. 安全标准

Privacy Concerns: Audio contains sensitive conversations, voice biometrics are PII, transcriptions may leak data.

Required Mitigations:

python

undefined

隐私风险：音频包含敏感对话，语音生物特征属于PII，转录文本可能泄露数据。

必要缓解措施：

python

undefined

Always delete after processing

处理完成后始终删除音频

def transcribe_and_delete(audio_path: str) -> str: try: return engine.transcribe(audio_path) finally: Path(audio_path).unlink(missing_ok=True)

Validate before processing

处理前验证音频

def validate_audio(path: str) -> bool: p = Path(path) if p.stat().st_size > 50 * 1024 * 1024: raise ValidationError("File too large") if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}: raise ValidationError("Invalid format") return True

---

def validate_audio(path: str) -> bool: p = Path(path) if p.stat().st_size > 50 * 1024 * 1024: raise ValidationError("文件过大") if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}: raise ValidationError("格式无效") return True

---

9. Common Mistakes

9. 常见错误

NEVER: Keep Audio Files

严禁：保留音频文件

python

undefined

python

undefined

BAD - Audio persists

错误 - 音频文件留存

def transcribe(path): return model.transcribe(path) # File remains

def transcribe(path): return model.transcribe(path) # 文件会保留

GOOD - Delete after use

正确 - 使用后立即删除

def transcribe(path): try: return model.transcribe(path) finally: Path(path).unlink()

undefined

def transcribe(path): try: return model.transcribe(path) finally: Path(path).unlink()

undefined

NEVER: Log Transcription Content

严禁：记录转录内容

python

undefined

python

undefined

BAD - Logs sensitive content

错误 - 记录敏感内容

logger.info(f"Transcribed: {text}")

GOOD - Log metadata only

正确 - 仅记录元数据

logger.info("stt.complete", word_count=len(text.split()))

---

logger.info("stt.complete", word_count=len(text.split()))

---

10. Pre-Implementation Checklist

10. 预实现检查清单

Phase 1: Before Writing Code

阶段1：编写代码前

Phase 2: During Implementation

阶段2：实现过程中

Phase 3: Before Committing

阶段3：提交代码前

All tests pass:
```
pytest tests/test_stt_engine.py -v
```
Coverage above 80%:
```
pytest --cov=jarvis.stt
```
Latency under 300ms for short audio
Memory stable over repeated transcriptions
No audio files persist after processing
Security review completed (no PII leaks)

所有测试通过：
```
pytest tests/test_stt_engine.py -v
```
测试覆盖率超过80%：
```
pytest --cov=jarvis.stt
```
短音频处理延迟低于300ms
重复转录时内存稳定
处理后无音频文件留存
完成安全审查（无PII泄露）

11. Summary

11. 总结

Your goal is to create STT systems that are:

Private: Audio processed locally, deleted immediately
Fast: Optimized for real-time voice assistant responses
Accurate: Appropriate model and preprocessing for context

You understand that voice data requires special privacy protection. Always delete audio after processing, never log transcription content, and filter PII from outputs.

Critical Reminders:

Delete audio files immediately after transcription
Never log transcription content
Filter PII from transcription results
Use secure temp directories with restricted permissions
Validate all audio input (size, format, duration)

您的目标是创建具备以下特性的STT系统：

隐私安全：音频本地处理，立即删除
快速响应：针对语音助手实时场景优化
准确可靠：根据场景选择合适模型和预处理方案

您需了解语音数据需要特殊的隐私保护。始终在转录后立即删除音频，绝不记录转录内容，并过滤输出中的PII。

关键提醒：

转录完成后立即删除音频文件
绝不记录转录内容
过滤转录结果中的PII
使用具有受限权限的安全临时目录
验证所有音频输入（大小、格式、时长）

speech-to-text

Original

Translation

Speech-to-Text Skill

语音转文本技能

1. Overview

1. 概述

2. Core Principles

2. 核心原则

3. Core Responsibilities

3. 核心职责

2.1 Privacy-First Audio Processing

2.1 隐私优先的音频处理

2.2 Performance Optimization

2.2 性能优化

3. Technical Foundation

3. 技术基础

3.1 Core Technologies

3.1 核心技术

requirements.txt

requirements.txt

3.2 Model Selection Guide

3.2 模型选择指南

5. Implementation Workflow (TDD)

5. 实现工作流（TDD）

Step 1: Write Failing Test First

步骤1：先编写失败的测试用例

tests/test_stt_engine.py

tests/test_stt_engine.py

Step 2: Implement Minimum to Pass

步骤2：实现满足测试的最小代码

jarvis/stt/engine.py

jarvis/stt/engine.py

Step 3: Refactor with Full Implementation

步骤3：重构为完整实现

Step 4: Run Full Verification

步骤4：运行完整验证

Run all STT tests

运行所有STT测试

Run with coverage

运行测试并查看覆盖率

Run performance tests only

仅运行性能测试

6. Performance Patterns

6. 性能模式

Pattern 1: Streaming Transcription (Low Latency)

模式1：流式转录（低延迟）

GOOD - Stream chunks for real-time feedback

推荐 - 流式处理音频块以实现实时反馈

BAD - Wait for complete audio

不推荐 - 等待完整音频后再处理

Pattern 2: VAD Preprocessing (Reduce Processing)

模式2：VAD预处理（减少处理量）

GOOD - Filter silence before transcription

推荐 - 转录前过滤静音部分

BAD - Process entire audio including silence

不推荐 - 处理包含静音的完整音频

Pattern 3: Model Quantization (Memory + Speed)

模式3：模型量化（内存+速度优化）

GOOD - Quantized for CPU

推荐 - 针对CPU使用量化模型

GOOD - Float16 for GPU

推荐 - 针对GPU使用Float16精度

BAD - Full precision unnecessarily

不推荐 - 无必要使用全精度

Pattern 4: Batch Processing (Throughput)

模式4：批量处理（吞吐量优化）

GOOD - Process multiple files in parallel

推荐 - 并行处理多个文件

BAD - Sequential processing

不推荐 - 串行处理

Pattern 5: Audio Buffering (Memory Efficiency)

模式5：音频缓冲（内存效率优化）

GOOD - Fixed-size ring buffer

推荐 - 固定大小环形缓冲区

BAD - Unbounded list growth

不推荐 - 无限制列表增长

7. Implementation Patterns

7. 实现模式

Pattern 1: Secure Faster Whisper Setup