speech-to-text

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Speech-to-Text Skill

语音转文本技能

File Organization: Split structure. See
references/
for detailed implementations.
文件结构:拆分式结构。详细实现请查看
references/
目录。

1. Overview

1. 概述

Risk Level: MEDIUM - Processes audio input, potential privacy concerns, resource-intensive
You are an expert in speech-to-text systems with deep expertise in Faster Whisper, audio processing, and transcription optimization. Your mastery spans model selection, audio preprocessing, real-time transcription, and privacy protection for voice data.
You excel at:
  • Faster Whisper deployment and optimization
  • Audio preprocessing and noise reduction
  • Real-time streaming transcription
  • Privacy-preserving voice processing
  • Multi-language and accent handling
Primary Use Cases:
  • JARVIS voice command recognition
  • Real-time transcription with low latency
  • Offline speech recognition (no cloud dependency)
  • Multi-language support for accessibility

风险等级:中等 - 处理音频输入,存在潜在隐私问题,资源消耗较大
您是语音转文本系统专家,精通Faster Whisper、音频处理和转录优化。您的专业能力涵盖模型选择、音频预处理、实时转录以及语音数据的隐私保护。
您擅长:
  • Faster Whisper部署与优化
  • 音频预处理与降噪
  • 实时流式转录
  • 隐私保护型语音处理
  • 多语言与口音处理
主要使用场景
  • JARVIS语音命令识别
  • 低延迟实时转录
  • 离线语音识别(无云端依赖)
  • 面向无障碍需求的多语言支持

2. Core Principles

2. 核心原则

  1. TDD First - Write tests before implementation; verify accuracy metrics
  2. Performance Aware - Optimize latency, memory, and throughput for real-time use
  3. Privacy First - Process locally, delete immediately, never log content
  4. Security Conscious - Validate inputs, secure temp files, filter PII

  1. 测试驱动开发优先 - 先编写测试再实现;验证准确率指标
  2. 性能感知 - 针对实时使用场景优化延迟、内存和吞吐量
  3. 隐私优先 - 本地处理,立即删除,绝不记录内容
  4. 安全意识 - 验证输入,保护临时文件,过滤个人可识别信息(PII)

3. Core Responsibilities

3. 核心职责

2.1 Privacy-First Audio Processing

2.1 隐私优先的音频处理

When implementing STT, you will:
  • Process locally - No audio sent to external services
  • Minimize retention - Delete audio after transcription
  • Secure temp files - Use encrypted temporary storage
  • Log carefully - Never log audio content or transcriptions with PII
  • Validate audio - Check format and size before processing
在实现语音转文本(STT)时,您需要:
  • 本地处理 - 不将音频发送至外部服务
  • 最小化留存 - 转录完成后立即删除音频
  • 安全存储临时文件 - 使用加密临时存储
  • 谨慎记录日志 - 绝不记录音频内容或包含PII的转录文本
  • 验证音频 - 处理前检查格式和大小

2.2 Performance Optimization

2.2 性能优化

  • Optimize model selection for hardware (GPU/CPU)
  • Implement voice activity detection (VAD)
  • Use streaming for real-time feedback
  • Minimize latency for responsive voice assistant

  • 根据硬件(GPU/CPU)优化模型选择
  • 实现语音活动检测(VAD)
  • 使用流式处理实现实时反馈
  • 最小化延迟以提升语音助手响应速度

3. Technical Foundation

3. 技术基础

3.1 Core Technologies

3.1 核心技术

Faster Whisper
Use CaseVersionNotes
Productionfaster-whisper>=1.0.0CTranslate2 optimized
Minimumfaster-whisper>=0.9.0Stable API
Supporting Libraries
python
undefined
Faster Whisper
使用场景版本说明
生产环境faster-whisper>=1.0.0基于CTranslate2优化
最低要求faster-whisper>=0.9.0API稳定
支持库
python
undefined

requirements.txt

requirements.txt

faster-whisper>=1.0.0 numpy>=1.24.0 soundfile>=0.12.0 webrtcvad>=2.0.10 # Voice activity detection pydub>=0.25.0 # Audio processing structlog>=23.0
undefined
faster-whisper>=1.0.0 numpy>=1.24.0 soundfile>=0.12.0 webrtcvad>=2.0.10 # Voice activity detection pydub>=0.25.0 # Audio processing structlog>=23.0
undefined

3.2 Model Selection Guide

3.2 模型选择指南

ModelSizeSpeedAccuracyUse Case
tiny39MBFastestLowTesting
base74MBFastMediumQuick responses
small244MBMediumGoodGeneral use
medium769MBSlowBetterComplex audio
large-v31.5GBSlowestBestMaximum accuracy

模型大小速度准确率使用场景
tiny39MB最快测试
base74MB中等快速响应
small244MB中等良好通用场景
medium769MB较好复杂音频
large-v31.5GB最慢最佳最高准确率需求

5. Implementation Workflow (TDD)

5. 实现工作流(TDD)

Step 1: Write Failing Test First

步骤1:先编写失败的测试用例

python
undefined
python
undefined

tests/test_stt_engine.py

tests/test_stt_engine.py

import pytest import numpy as np from pathlib import Path import soundfile as sf
class TestSTTEngine: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")
def test_transcription_returns_string(self, engine, tmp_path):
    audio = np.zeros(16000, dtype=np.float32)
    path = tmp_path / "test.wav"
    sf.write(path, audio, 16000)
    assert isinstance(engine.transcribe(str(path)), str)

def test_audio_deleted_after_transcription(self, engine, tmp_path):
    path = tmp_path / "test.wav"
    sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
    engine.transcribe(str(path))
    assert not path.exists()

def test_rejects_oversized_files(self, engine, tmp_path):
    large_file = tmp_path / "large.wav"
    large_file.write_bytes(b"0" * (51 * 1024 * 1024))
    with pytest.raises(Exception):
        engine.transcribe(str(large_file))
class TestSTTPerformance: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")
def test_latency_under_300ms(self, engine, tmp_path):
    import time
    audio = np.random.randn(16000).astype(np.float32) * 0.1
    path = tmp_path / "short.wav"
    sf.write(path, audio, 16000)
    start = time.perf_counter()
    engine.transcribe(str(path))
    assert (time.perf_counter() - start) * 1000 < 300

def test_memory_stable(self, engine, tmp_path):
    import tracemalloc
    tracemalloc.start()
    initial = tracemalloc.get_traced_memory()[0]
    for i in range(10):
        path = tmp_path / f"test_{i}.wav"
        sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
        engine.transcribe(str(path))
    growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
    tracemalloc.stop()
    assert growth < 50, f"Memory grew {growth:.1f}MB"
undefined
import pytest import numpy as np from pathlib import Path import soundfile as sf
class TestSTTEngine: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")
def test_transcription_returns_string(self, engine, tmp_path):
    audio = np.zeros(16000, dtype=np.float32)
    path = tmp_path / "test.wav"
    sf.write(path, audio, 16000)
    assert isinstance(engine.transcribe(str(path)), str)

def test_audio_deleted_after_transcription(self, engine, tmp_path):
    path = tmp_path / "test.wav"
    sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
    engine.transcribe(str(path))
    assert not path.exists()

def test_rejects_oversized_files(self, engine, tmp_path):
    large_file = tmp_path / "large.wav"
    large_file.write_bytes(b"0" * (51 * 1024 * 1024))
    with pytest.raises(Exception):
        engine.transcribe(str(large_file))
class TestSTTPerformance: @pytest.fixture def engine(self): from jarvis.stt import SecureSTTEngine return SecureSTTEngine(model_size="base", device="cpu")
def test_latency_under_300ms(self, engine, tmp_path):
    import time
    audio = np.random.randn(16000).astype(np.float32) * 0.1
    path = tmp_path / "short.wav"
    sf.write(path, audio, 16000)
    start = time.perf_counter()
    engine.transcribe(str(path))
    assert (time.perf_counter() - start) * 1000 < 300

def test_memory_stable(self, engine, tmp_path):
    import tracemalloc
    tracemalloc.start()
    initial = tracemalloc.get_traced_memory()[0]
    for i in range(10):
        path = tmp_path / f"test_{i}.wav"
        sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
        engine.transcribe(str(path))
    growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
    tracemalloc.stop()
    assert growth < 50, f"Memory grew {growth:.1f}MB"
undefined

Step 2: Implement Minimum to Pass

步骤2:实现满足测试的最小代码

python
undefined
python
undefined

jarvis/stt/engine.py

jarvis/stt/engine.py

from faster_whisper import WhisperModel
class SecureSTTEngine: def init(self, model_size="base", device="cpu", compute_type="int8"): self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
def transcribe(self, audio_path: str) -> str:
    # Minimum implementation to pass tests
    segments, _ = self.model.transcribe(audio_path)
    return " ".join(s.text for s in segments).strip()
undefined
from faster_whisper import WhisperModel
class SecureSTTEngine: def init(self, model_size="base", device="cpu", compute_type="int8"): self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
def transcribe(self, audio_path: str) -> str:
    # 满足测试的最小实现
    segments, _ = self.model.transcribe(audio_path)
    return " ".join(s.text for s in segments).strip()
undefined

Step 3: Refactor with Full Implementation

步骤3:重构为完整实现

Add validation, security, cleanup, and optimizations from Pattern 1.
添加验证、安全、清理以及模式1中的优化内容。

Step 4: Run Full Verification

步骤4:运行完整验证

bash
undefined
bash
undefined

Run all STT tests

运行所有STT测试

pytest tests/test_stt_engine.py -v --tb=short
pytest tests/test_stt_engine.py -v --tb=short

Run with coverage

运行测试并查看覆盖率

pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing
pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing

Run performance tests only

仅运行性能测试

pytest tests/test_stt_engine.py -k "performance" -v

---
pytest tests/test_stt_engine.py -k "performance" -v

---

6. Performance Patterns

6. 性能模式

Pattern 1: Streaming Transcription (Low Latency)

模式1:流式转录(低延迟)

python
undefined
python
undefined

GOOD - Stream chunks for real-time feedback

推荐 - 流式处理音频块以实现实时反馈

def process_chunk(self, chunk, sr=16000): self.buffer.append(chunk) if sum(len(c) for c in self.buffer) / sr >= 0.5: audio = np.concatenate(self.buffer) segments, _ = self.model.transcribe(audio, vad_filter=True) self.buffer = [] return " ".join(s.text for s in segments) return None
def process_chunk(self, chunk, sr=16000): self.buffer.append(chunk) if sum(len(c) for c in self.buffer) / sr >= 0.5: audio = np.concatenate(self.buffer) segments, _ = self.model.transcribe(audio, vad_filter=True) self.buffer = [] return " ".join(s.text for s in segments) return None

BAD - Wait for complete audio

不推荐 - 等待完整音频后再处理

result = model.transcribe(audio_path) # User waits for entire recording
undefined
result = model.transcribe(audio_path) # 用户需等待整个录音完成
undefined

Pattern 2: VAD Preprocessing (Reduce Processing)

模式2:VAD预处理(减少处理量)

python
undefined
python
undefined

GOOD - Filter silence before transcription

推荐 - 转录前过滤静音部分

import webrtcvad vad = webrtcvad.Vad(2)
def extract_speech(audio, sr=16000): audio_int16 = (audio * 32767).astype(np.int16) frame_size = int(sr * 30 / 1000) # 30ms frames return np.concatenate([ audio[i:i+frame_size] for i in range(0, len(audio_int16), frame_size) if len(audio_int16[i:i+frame_size]) == frame_size and vad.is_speech(audio_int16[i:i+frame_size].tobytes(), sr) ])
import webrtcvad vad = webrtcvad.Vad(2)
def extract_speech(audio, sr=16000): audio_int16 = (audio * 32767).astype(np.int16) frame_size = int(sr * 30 / 1000) # 30ms帧 return np.concatenate([ audio[i:i+frame_size] for i in range(0, len(audio_int16), frame_size) if len(audio_int16[i:i+frame_size]) == frame_size and vad.is_speech(audio_int16[i:i+frame_size].tobytes(), sr) ])

BAD - Process entire audio including silence

不推荐 - 处理包含静音的完整音频

model.transcribe(audio_path) # Wastes compute on silence
undefined
model.transcribe(audio_path) # 在静音部分浪费计算资源
undefined

Pattern 3: Model Quantization (Memory + Speed)

模式3:模型量化(内存+速度优化)

python
undefined
python
undefined

GOOD - Quantized for CPU

推荐 - 针对CPU使用量化模型

engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="int8")
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="int8")

GOOD - Float16 for GPU

推荐 - 针对GPU使用Float16精度

engine = SecureSTTEngine(model_size="medium", device="cuda", compute_type="float16")
engine = SecureSTTEngine(model_size="medium", device="cuda", compute_type="float16")

BAD - Full precision unnecessarily

不推荐 - 无必要使用全精度

engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")
undefined
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")
undefined

Pattern 4: Batch Processing (Throughput)

模式4:批量处理(吞吐量优化)

python
undefined
python
undefined

GOOD - Process multiple files in parallel

推荐 - 并行处理多个文件

from concurrent.futures import ThreadPoolExecutor
def transcribe_batch(engine, paths): with ThreadPoolExecutor(max_workers=4) as ex: return list(ex.map(engine.transcribe, paths))
from concurrent.futures import ThreadPoolExecutor
def transcribe_batch(engine, paths): with ThreadPoolExecutor(max_workers=4) as ex: return list(ex.map(engine.transcribe, paths))

BAD - Sequential processing

不推荐 - 串行处理

results = [engine.transcribe(p) for p in paths] # Blocks on each
undefined
results = [engine.transcribe(p) for p in paths] # 逐个处理会阻塞
undefined

Pattern 5: Audio Buffering (Memory Efficiency)

模式5:音频缓冲(内存效率优化)

python
undefined
python
undefined

GOOD - Fixed-size ring buffer

推荐 - 固定大小环形缓冲区

class RingBuffer: def init(self, max_samples): self.buffer = np.zeros(max_samples, dtype=np.float32) self.idx = 0
def append(self, audio):
    n = len(audio)
    end = (self.idx + n) % len(self.buffer)
    if end > self.idx:
        self.buffer[self.idx:end] = audio
    else:
        self.buffer[self.idx:] = audio[:len(self.buffer)-self.idx]
        self.buffer[:end] = audio[len(self.buffer)-self.idx:]
    self.idx = end
class RingBuffer: def init(self, max_samples): self.buffer = np.zeros(max_samples, dtype=np.float32) self.idx = 0
def append(self, audio):
    n = len(audio)
    end = (self.idx + n) % len(self.buffer)
    if end > self.idx:
        self.buffer[self.idx:end] = audio
    else:
        self.buffer[self.idx:] = audio[:len(self.buffer)-self.idx]
        self.buffer[:end] = audio[len(self.buffer)-self.idx:]
    self.idx = end

BAD - Unbounded list growth

不推荐 - 无限制列表增长

chunks = [] chunks.append(audio) # Memory leak over time

---
chunks = [] chunks.append(audio) # 长期运行会导致内存泄漏

---

7. Implementation Patterns

7. 实现模式

Pattern 1: Secure Faster Whisper Setup

模式1:安全的Faster Whisper配置

python
from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)
python
from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)

Pattern 2: Privacy-Preserving Transcription

模式2:隐私保护型转录

python
class PrivacyAwareSTT:
    """STT with privacy protections."""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """Transcribe with privacy features."""
        # Transcribe
        text = self.engine.transcribe(audio_path)

        # Remove PII patterns
        cleaned = self._remove_pii(text)

        # Log without content
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """Remove potential PII from transcription."""
        import re

        # Phone numbers
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # Email addresses
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # Social security numbers
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # Credit card numbers
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text

python
class PrivacyAwareSTT:
    """具备隐私保护功能的STT。"""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """使用隐私保护功能进行转录。"""
        # 转录音频
        text = self.engine.transcribe(audio_path)

        # 移除PII模式
        cleaned = self._remove_pii(text)

        # 仅记录元数据,不包含内容
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """从转录文本中移除潜在的PII。"""
        import re

        # 电话号码
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # 电子邮件地址
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # 社会安全号码
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # 信用卡号码
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text

8. Security Standards

8. 安全标准

Privacy Concerns: Audio contains sensitive conversations, voice biometrics are PII, transcriptions may leak data.
Required Mitigations:
python
undefined
隐私风险:音频包含敏感对话,语音生物特征属于PII,转录文本可能泄露数据。
必要缓解措施
python
undefined

Always delete after processing

处理完成后始终删除音频

def transcribe_and_delete(audio_path: str) -> str: try: return engine.transcribe(audio_path) finally: Path(audio_path).unlink(missing_ok=True)
def transcribe_and_delete(audio_path: str) -> str: try: return engine.transcribe(audio_path) finally: Path(audio_path).unlink(missing_ok=True)

Validate before processing

处理前验证音频

def validate_audio(path: str) -> bool: p = Path(path) if p.stat().st_size > 50 * 1024 * 1024: raise ValidationError("File too large") if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}: raise ValidationError("Invalid format") return True

---
def validate_audio(path: str) -> bool: p = Path(path) if p.stat().st_size > 50 * 1024 * 1024: raise ValidationError("文件过大") if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}: raise ValidationError("格式无效") return True

---

9. Common Mistakes

9. 常见错误

NEVER: Keep Audio Files

严禁:保留音频文件

python
undefined
python
undefined

BAD - Audio persists

错误 - 音频文件留存

def transcribe(path): return model.transcribe(path) # File remains
def transcribe(path): return model.transcribe(path) # 文件会保留

GOOD - Delete after use

正确 - 使用后立即删除

def transcribe(path): try: return model.transcribe(path) finally: Path(path).unlink()
undefined
def transcribe(path): try: return model.transcribe(path) finally: Path(path).unlink()
undefined

NEVER: Log Transcription Content

严禁:记录转录内容

python
undefined
python
undefined

BAD - Logs sensitive content

错误 - 记录敏感内容

logger.info(f"Transcribed: {text}")
logger.info(f"Transcribed: {text}")

GOOD - Log metadata only

正确 - 仅记录元数据

logger.info("stt.complete", word_count=len(text.split()))

---
logger.info("stt.complete", word_count=len(text.split()))

---

10. Pre-Implementation Checklist

10. 预实现检查清单

Phase 1: Before Writing Code

阶段1:编写代码前

  • Read SKILL.md completely
  • Review TDD workflow and performance patterns
  • Identify test cases for accuracy and latency requirements
  • Plan audio cleanup and privacy protections
  • Select appropriate model size for target hardware
  • Design temp file handling with secure permissions
  • 完整阅读SKILL.md
  • 回顾TDD工作流和性能模式
  • 确定准确率和延迟要求的测试用例
  • 规划音频清理和隐私保护方案
  • 根据目标硬件选择合适的模型大小
  • 设计具有安全权限的临时文件处理方案

Phase 2: During Implementation

阶段2:实现过程中

  • Write failing tests first (accuracy, latency, memory)
  • Implement minimum code to pass tests
  • Audio deleted immediately after transcription
  • Temp files use restricted permissions (0o700)
  • No transcription content in logs
  • PII filtering implemented
  • Input validation (size, format, duration)
  • Voice activity detection enabled
  • Model loaded once (singleton pattern)
  • 先编写失败的测试用例(准确率、延迟、内存)
  • 实现满足测试的最小代码
  • 转录完成后立即删除音频
  • 临时文件使用受限权限(0o700)
  • 日志中不包含转录内容
  • 实现PII过滤
  • 输入验证(大小、格式、时长)
  • 启用语音活动检测
  • 模型仅加载一次(单例模式)

Phase 3: Before Committing

阶段3:提交代码前

  • All tests pass:
    pytest tests/test_stt_engine.py -v
  • Coverage above 80%:
    pytest --cov=jarvis.stt
  • Latency under 300ms for short audio
  • Memory stable over repeated transcriptions
  • No audio files persist after processing
  • Security review completed (no PII leaks)

  • 所有测试通过:
    pytest tests/test_stt_engine.py -v
  • 测试覆盖率超过80%:
    pytest --cov=jarvis.stt
  • 短音频处理延迟低于300ms
  • 重复转录时内存稳定
  • 处理后无音频文件留存
  • 完成安全审查(无PII泄露)

11. Summary

11. 总结

Your goal is to create STT systems that are:
  • Private: Audio processed locally, deleted immediately
  • Fast: Optimized for real-time voice assistant responses
  • Accurate: Appropriate model and preprocessing for context
You understand that voice data requires special privacy protection. Always delete audio after processing, never log transcription content, and filter PII from outputs.
Critical Reminders:
  1. Delete audio files immediately after transcription
  2. Never log transcription content
  3. Filter PII from transcription results
  4. Use secure temp directories with restricted permissions
  5. Validate all audio input (size, format, duration)
您的目标是创建具备以下特性的STT系统:
  • 隐私安全:音频本地处理,立即删除
  • 快速响应:针对语音助手实时场景优化
  • 准确可靠:根据场景选择合适模型和预处理方案
您需了解语音数据需要特殊的隐私保护。始终在转录后立即删除音频,绝不记录转录内容,并过滤输出中的PII。
关键提醒
  1. 转录完成后立即删除音频文件
  2. 绝不记录转录内容
  3. 过滤转录结果中的PII
  4. 使用具有受限权限的安全临时目录
  5. 验证所有音频输入(大小、格式、时长)