text-to-speech

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Text-to-Speech Skill

文本转语音技能

File Organization: Split structure. See
references/
for detailed implementations.

文件结构: 拆分式结构。详细实现请查看
references/
目录。

1. Overview

1. 概述

Risk Level: MEDIUM - Generates audio output, potential for inappropriate content synthesis, resource-intensive

You are an expert in text-to-speech systems with deep expertise in Kokoro TTS, voice synthesis, and audio generation optimization. Your mastery spans model configuration, voice customization, streaming audio output, and secure handling of synthesized speech.

You excel at:

Kokoro TTS deployment and voice configuration
Real-time streaming synthesis for low latency
Voice customization and prosody control
Audio output optimization and format conversion
Content filtering for appropriate synthesis

Primary Use Cases:

JARVIS voice responses
Real-time speech synthesis with natural prosody
Offline TTS (no cloud dependency)
Multi-voice support for different contexts

风险等级: 中等 - 生成音频输出，存在合成不当内容的潜在风险，且资源消耗较大

您是文本转语音系统专家，在Kokoro TTS、语音合成和音频生成优化方面具备深厚专业知识。您精通模型配置、语音定制、流式音频输出以及合成语音的安全处理。

您擅长：

Kokoro TTS部署与语音配置
低延迟实时流式合成
语音定制与韵律控制
音频输出优化与格式转换
内容过滤以确保合规合成

主要使用场景:

JARVIS语音响应
带自然韵律的实时语音合成
离线TTS（无云依赖）
多语音支持适配不同场景

2. Core Principles

2. 核心原则

TDD First - Write tests before implementation. Verify synthesis output, audio quality, and error handling.
Performance Aware - Optimize for latency: streaming synthesis, model caching, audio chunking.
Security First - Filter content, validate inputs, clean up generated files.
Resource Efficient - Manage GPU/CPU usage, limit concurrency, timeout protection.

测试驱动开发优先 - 先编写测试再实现功能。验证合成输出、音频质量和错误处理。
性能感知 - 针对延迟优化：流式合成、模型缓存、音频分块。
安全优先 - 过滤内容、验证输入、清理生成文件。
资源高效 - 管理GPU/CPU使用、限制并发、超时保护。

3. Implementation Workflow (TDD)

3. 实现工作流（测试驱动开发）

Step 1: Write Failing Test First

步骤1：先编写失败的测试

python

undefined

python

undefined

tests/test_tts_engine.py

import pytest from pathlib import Path

class TestSecureTTSEngine: def test_synthesize_returns_valid_audio(self, tts_engine): audio_path = tts_engine.synthesize("Hello test") assert Path(audio_path).exists() assert audio_path.endswith('.wav')

def test_audio_has_correct_sample_rate(self, tts_engine):
    import soundfile as sf
    audio_path = tts_engine.synthesize("Test")
    _, sample_rate = sf.read(audio_path)
    assert sample_rate == 24000

def test_rejects_empty_text(self, tts_engine):
    with pytest.raises(ValidationError):
        tts_engine.synthesize("")

def test_rejects_text_exceeding_limit(self, tts_engine):
    with pytest.raises(ValidationError):
        tts_engine.synthesize("x" * 6000)

def test_filters_sensitive_content(self, tts_engine):
    audio_path = tts_engine.synthesize("password: secret123")
    assert Path(audio_path).exists()

def test_cleanup_removes_temp_files(self, tts_engine):
    tts_engine.synthesize("Test")
    temp_dir = tts_engine.temp_dir
    tts_engine.cleanup()
    assert not Path(temp_dir).exists()

@pytest.fixture def tts_engine(): from jarvis.tts import SecureTTSEngine engine = SecureTTSEngine(voice="af_heart") yield engine engine.cleanup()

undefined

import pytest from pathlib import Path

def test_audio_has_correct_sample_rate(self, tts_engine):
    import soundfile as sf
    audio_path = tts_engine.synthesize("Test")
    _, sample_rate = sf.read(audio_path)
    assert sample_rate == 24000

def test_rejects_empty_text(self, tts_engine):
    with pytest.raises(ValidationError):
        tts_engine.synthesize("")

def test_rejects_text_exceeding_limit(self, tts_engine):
    with pytest.raises(ValidationError):
        tts_engine.synthesize("x" * 6000)

def test_filters_sensitive_content(self, tts_engine):
    audio_path = tts_engine.synthesize("password: secret123")
    assert Path(audio_path).exists()

def test_cleanup_removes_temp_files(self, tts_engine):
    tts_engine.synthesize("Test")
    temp_dir = tts_engine.temp_dir
    tts_engine.cleanup()
    assert not Path(temp_dir).exists()

@pytest.fixture def tts_engine(): from jarvis.tts import SecureTTSEngine engine = SecureTTSEngine(voice="af_heart") yield engine engine.cleanup()

undefined

Step 2: Implement Minimum to Pass

步骤2：实现最小功能以通过测试

Implement SecureTTSEngine with required methods. Focus only on making tests pass.

实现SecureTTSEngine及所需方法，仅聚焦于让测试通过。

Step 3: Refactor Following Patterns

步骤3：遵循模式重构

After tests pass, refactor for streaming output, caching, and async compatibility.

测试通过后，针对流式输出、缓存和异步兼容性进行重构。

Step 4: Run Full Verification

步骤4：运行完整验证

bash

pytest tests/test_tts_engine.py -v                    # Run tests
pytest --cov=jarvis.tts --cov-report=term-missing     # Coverage
mypy src/jarvis/tts/                                  # Type check
python -m jarvis.tts --test "Hello JARVIS"            # Integration

bash

pytest tests/test_tts_engine.py -v                    # 运行测试
pytest --cov=jarvis.tts --cov-report=term-missing     # 覆盖率检查
mypy src/jarvis/tts/                                  # 类型检查
python -m jarvis.tts --test "Hello JARVIS"            # 集成测试

4. Performance Patterns

4. 性能模式

Pattern: Streaming Synthesis (Low Latency)

模式：流式合成（低延迟）

python

undefined

python

undefined

BAD - Wait for full audio

不佳 - 等待完整音频生成

audio_chunks = [] for _, _, audio in pipeline(text): audio_chunks.append(audio) play_audio(np.concatenate(audio_chunks)) # Long wait

audio_chunks = [] for _, _, audio in pipeline(text): audio_chunks.append(audio) play_audio(np.concatenate(audio_chunks)) # 等待时间长

GOOD - Stream chunks immediately

良好 - 立即流式输出分块

with sd.OutputStream(samplerate=24000, channels=1) as stream: for _, _, audio in pipeline(text): stream.write(audio) # Play as generated

undefined

with sd.OutputStream(samplerate=24000, channels=1) as stream: for _, _, audio in pipeline(text): stream.write(audio) # 生成即播放

undefined

Pattern: Model Caching (Faster Startup)

模式：模型缓存（更快启动）

python

undefined

python

undefined

BAD: pipeline = KPipeline(lang_code="a") # Reload each time

不佳: pipeline = KPipeline(lang_code="a") # 每次重新加载

GOOD - Singleton pattern

良好 - 单例模式

class TTSEngine: _pipeline = None @classmethod def get_pipeline(cls): if cls._pipeline is None: cls._pipeline = KPipeline(lang_code="a") return cls._pipeline

undefined

class TTSEngine: _pipeline = None @classmethod def get_pipeline(cls): if cls._pipeline is None: cls._pipeline = KPipeline(lang_code="a") return cls._pipeline

undefined

Pattern: Audio Chunking (Memory Efficient)

模式：音频分块（内存高效）

python

undefined

python

undefined

BAD: data, sr = sf.read(audio_path) # Full file in RAM

不佳: data, sr = sf.read(audio_path) # 全文件加载到内存

GOOD - Process in chunks

良好 - 分块处理

with sf.SoundFile(audio_path) as f: while f.tell() < len(f): yield process(f.read(24000))

undefined

with sf.SoundFile(audio_path) as f: while f.tell() < len(f): yield process(f.read(24000))

undefined

Pattern: Async Generation (Non-blocking)

模式：异步生成（非阻塞）

python

undefined

python

undefined

BAD: audio = engine.synthesize(text) # Blocks event loop

不佳: audio = engine.synthesize(text) # 阻塞事件循环

GOOD - Run in executor

良好 - 在执行器中运行

audio = await loop.run_in_executor(None, engine.synthesize, text)

undefined

audio = await loop.run_in_executor(None, engine.synthesize, text)

undefined

Pattern: Voice Preloading (Instant Response)

模式：语音预加载（即时响应）

python

undefined

python

undefined

BAD: return SecureTTSEngine(voice=VOICES[voice_type]) # Cold start

不佳: return SecureTTSEngine(voice=VOICES[voice_type]) # 冷启动

GOOD - Preload at startup

良好 - 启动时预加载

def _preload_voices(self, types: list[str]): for t in types: self.engines[t] = SecureTTSEngine(voice=VOICES[t])

---

def _preload_voices(self, types: list[str]): for t in types: self.engines[t] = SecureTTSEngine(voice=VOICES[t])

---

5. Core Responsibilities

5. 核心职责

5.1 Secure Audio Generation

5.1 安全音频生成

When implementing TTS, you will:

Filter input text - Block inappropriate or harmful content
Validate text length - Prevent DoS via excessive generation
Secure output storage - Proper permissions on generated audio
Clean up files - Delete generated audio after playback
Log safely - Don't log sensitive text content

实现TTS时，您需要：

过滤输入文本 - 拦截不当或有害内容
验证文本长度 - 防止因过度生成导致的服务拒绝攻击
安全输出存储 - 为生成的音频设置正确权限
清理文件 - 播放后删除生成的音频
安全日志 - 不记录敏感文本内容

5.2 Performance Optimization

5.2 性能优化

Optimize for real-time streaming output
Implement audio caching for repeated phrases
Balance quality vs. latency for voice assistant use
Manage GPU/CPU resources efficiently

针对实时流式输出优化
为重复短语实现音频缓存
平衡语音助手场景下的质量与延迟
高效管理GPU/CPU资源

6. Technical Foundation

6. 技术基础

6.1 Core Technologies

6.1 核心技术

Kokoro TTS

Use Case	Version	Notes
Production	kokoro>=0.3.0	Latest stable

Supporting Libraries

python

undefined

Kokoro TTS

使用场景	版本	说明
生产环境	kokoro>=0.3.0	最新稳定版

支持库

python

undefined

requirements.txt

kokoro>=0.3.0 numpy>=1.24.0 soundfile>=0.12.0 sounddevice>=0.4.6 scipy>=1.10.0 pydantic>=2.0 structlog>=23.0

undefined

kokoro>=0.3.0 numpy>=1.24.0 soundfile>=0.12.0 sounddevice>=0.4.6 scipy>=1.10.0 pydantic>=2.0 structlog>=23.0

undefined

6.2 Voice Configuration

6.2 语音配置

Voice	Style	Use Case
af_heart	Warm, friendly	Default JARVIS
af_bella	Professional	Formal responses
am_adam	Male	Alternative voice
bf_emma	British	Accent variation

语音ID	风格	使用场景
af_heart	温暖友好	JARVIS默认语音
af_bella	专业正式	正式响应
am_adam	男性	备选语音
bf_emma	英式口音	口音变体

7. Implementation Patterns

7. 实现模式

Pattern 1: Secure TTS Engine

模式1：安全TTS引擎

python

from kokoro import KPipeline
import soundfile as sf
import numpy as np
from pathlib import Path
import tempfile
import os
import structlog

logger = structlog.get_logger()

class SecureTTSEngine:
    """Secure text-to-speech with content filtering."""

    def __init__(self, voice: str = "af_heart", lang_code: str = "a"):
        # Initialize Kokoro pipeline
        self.pipeline = KPipeline(lang_code=lang_code)
        self.voice = voice

        # Content filter patterns
        self.blocked_patterns = [
            r"password\s*[:=]",
            r"api[_-]?key\s*[:=]",
            r"secret\s*[:=]",
        ]

        # Create secure temp directory
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_tts_")
        os.chmod(self.temp_dir, 0o700)

        logger.info("tts.initialized", voice=voice)

    def synthesize(self, text: str) -> str:
        """Synthesize text to audio file."""
        # Validate and filter input
        if not self._validate_text(text):
            raise ValidationError("Invalid text input")

        filtered_text = self._filter_sensitive(text)

        # Generate audio
        audio_path = Path(self.temp_dir) / f"{uuid.uuid4()}.wav"

        generator = self.pipeline(
            filtered_text,
            voice=self.voice,
            speed=1.0
        )

        # Collect audio chunks
        audio_chunks = []
        for _, _, audio in generator:
            audio_chunks.append(audio)

        if not audio_chunks:
            raise TTSError("No audio generated")

        # Concatenate and save
        full_audio = np.concatenate(audio_chunks)
        sf.write(str(audio_path), full_audio, 24000)

        logger.info("tts.synthesized",
                   text_length=len(text),
                   audio_duration=len(full_audio) / 24000)

        return str(audio_path)

    def _validate_text(self, text: str) -> bool:
        """Validate text input."""
        if not text or not text.strip():
            return False

        # Length limit (prevent DoS)
        if len(text) > 5000:
            logger.warning("tts.text_too_long", length=len(text))
            return False

        return True

    def _filter_sensitive(self, text: str) -> str:
        """Filter sensitive content from text."""
        import re

        filtered = text
        for pattern in self.blocked_patterns:
            if re.search(pattern, filtered, re.IGNORECASE):
                logger.warning("tts.sensitive_content_filtered")
                filtered = re.sub(pattern + r'\S+', '[FILTERED]', filtered, flags=re.IGNORECASE)

        return filtered

    def cleanup(self):
        """Clean up temp files."""
        import shutil
        if os.path.exists(self.temp_dir):
            shutil.rmtree(self.temp_dir)

python

from kokoro import KPipeline
import soundfile as sf
import numpy as np
from pathlib import Path
import tempfile
import os
import structlog

logger = structlog.get_logger()

class SecureTTSEngine:
    """带内容过滤的安全文本转语音引擎。"""

    def __init__(self, voice: str = "af_heart", lang_code: str = "a"):
        # 初始化Kokoro管道
        self.pipeline = KPipeline(lang_code=lang_code)
        self.voice = voice

        # 内容过滤规则
        self.blocked_patterns = [
            r"password\s*[:=]",
            r"api[_-]?key\s*[:=]",
            r"secret\s*[:=]",
        ]

        # 创建安全临时目录
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_tts_")
        os.chmod(self.temp_dir, 0o700)

        logger.info("tts.initialized", voice=voice)

    def synthesize(self, text: str) -> str:
        """将文本合成为音频文件。"""
        # 验证并过滤输入
        if not self._validate_text(text):
            raise ValidationError("无效文本输入")

        filtered_text = self._filter_sensitive(text)

        # 生成音频
        audio_path = Path(self.temp_dir) / f"{uuid.uuid4()}.wav"

        generator = self.pipeline(
            filtered_text,
            voice=self.voice,
            speed=1.0
        )

        # 收集音频分块
        audio_chunks = []
        for _, _, audio in generator:
            audio_chunks.append(audio)

        if not audio_chunks:
            raise TTSError("未生成任何音频")

        # 拼接并保存
        full_audio = np.concatenate(audio_chunks)
        sf.write(str(audio_path), full_audio, 24000)

        logger.info("tts.synthesized",
                   text_length=len(text),
                   audio_duration=len(full_audio) / 24000)

        return str(audio_path)

    def _validate_text(self, text: str) -> bool:
        """验证文本输入。"""
        if not text or not text.strip():
            return False

        # 长度限制（防止服务拒绝攻击）
        if len(text) > 5000:
            logger.warning("tts.text_too_long", length=len(text))
            return False

        return True

    def _filter_sensitive(self, text: str) -> str:
        """过滤文本中的敏感内容。"""
        import re

        filtered = text
        for pattern in self.blocked_patterns:
            if re.search(pattern, filtered, re.IGNORECASE):
                logger.warning("tts.sensitive_content_filtered")
                filtered = re.sub(pattern + r'\S+', '[已过滤]', filtered, flags=re.IGNORECASE)

        return filtered

    def cleanup(self):
        """清理临时文件。"""
        import shutil
        if os.path.exists(self.temp_dir):
            shutil.rmtree(self.temp_dir)

Pattern 2: Streaming TTS

模式2：流式TTS

python

undefined

python

undefined

Stream audio chunks as generated for low latency

生成时立即流式输出音频分块以实现低延迟

with sd.OutputStream(samplerate=24000, channels=1) as stream: for _, _, audio in pipeline(text, voice=voice): stream.write(audio) # Play immediately

undefined

with sd.OutputStream(samplerate=24000, channels=1) as stream: for _, _, audio in pipeline(text, voice=voice): stream.write(audio) # 立即播放

undefined

Pattern 3: Audio Caching

模式3：音频缓存

python

undefined

python

undefined

Cache common phrases with hash key

使用哈希键缓存常见短语

cache_key = hashlib.sha256(f"{text}:{voice}".encode()).hexdigest() cache_path = cache_dir / f"{cache_key}.wav" if cache_path.exists(): return str(cache_path) # Cache hit

cache_key = hashlib.sha256(f"{text}:{voice}".encode()).hexdigest() cache_path = cache_dir / f"{cache_key}.wav" if cache_path.exists(): return str(cache_path) # 缓存命中

Generate, save to cache, return path

生成音频，保存到缓存，返回路径

undefined

undefined

Pattern 4: Voice Manager

模式4：语音管理器

python

undefined

python

undefined

Lazy-load engines per voice type

按语音类型懒加载引擎

VOICES = {"default": "af_heart", "formal": "af_bella"}

def get_engine(voice_type: str) -> SecureTTSEngine: if voice_type not in engines: engines[voice_type] = SecureTTSEngine(voice=VOICES[voice_type]) return engines[voice_type]

undefined

VOICES = {"default": "af_heart", "formal": "af_bella"}

def get_engine(voice_type: str) -> SecureTTSEngine: if voice_type not in engines: engines[voice_type] = SecureTTSEngine(voice=VOICES[voice_type]) return engines[voice_type]

undefined

Pattern 5: Resource Limits

模式5：资源限制

python

undefined

python

undefined

Semaphore for concurrency + timeout for protection

信号量控制并发 + 超时保护

async with asyncio.Semaphore(2): result = await asyncio.wait_for( loop.run_in_executor(None, engine.synthesize, text), timeout=30.0 )

---

async with asyncio.Semaphore(2): result = await asyncio.wait_for( loop.run_in_executor(None, engine.synthesize, text), timeout=30.0 )

---

8. Security Standards

8. 安全标准

8.1 Content Filtering

8.1 内容过滤

Prevent synthesis of inappropriate content:

python

class ContentFilter:
    """Filter inappropriate content before synthesis."""

    BLOCKED_CATEGORIES = [
        "violence",
        "hate_speech",
        "explicit",
    ]

    def filter(self, text: str) -> tuple[str, bool]:
        """Filter text and return (filtered_text, was_modified)."""
        # Remove potential command injection
        text = text.replace(";", "").replace("|", "").replace("&", "")

        # Check for blocked patterns
        for pattern in self.blocked_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return "[Content filtered]", True

        return text, False

防止不当内容合成:

python

class ContentFilter:
    """合成前过滤不当内容。"""

    BLOCKED_CATEGORIES = [
        "violence",
        "hate_speech",
        "explicit",
    ]

    def filter(self, text: str) -> tuple[str, bool]:
        """过滤文本并返回（过滤后文本, 是否已修改）。"""
        # 移除潜在命令注入字符
        text = text.replace(";", "").replace("|", "").replace("&", "")

        # 检查被拦截的规则
        for pattern in self.blocked_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return "[内容已过滤]", True

        return text, False

8.2 Input Validation

8.2 输入验证

python

def validate_tts_input(text: str) -> bool:
    """Validate text for TTS synthesis."""
    # Length limit
    if len(text) > 5000:
        raise ValidationError("Text too long (max 5000 chars)")

    # Character validation
    if not all(c.isprintable() or c in '\n\t' for c in text):
        raise ValidationError("Invalid characters in text")

    return True

python

def validate_tts_input(text: str) -> bool:
    """验证文本是否适合TTS合成。"""
    # 长度限制
    if len(text) > 5000:
        raise ValidationError("文本过长（最大5000字符）")

    # 字符验证
    if not all(c.isprintable() or c in '\n\t' for c in text):
        raise ValidationError("文本包含无效字符")

    return True

9. Common Mistakes

9. 常见错误

NEVER: Synthesize Untrusted Input Directly

禁止：直接合成不可信输入

python

undefined

python

undefined

BAD - No filtering

不佳 - 无过滤

def speak(user_input: str): engine.synthesize(user_input)

GOOD - Filter first

良好 - 先过滤

def speak(user_input: str): filtered = content_filter.filter(user_input) engine.synthesize(filtered)

undefined

def speak(user_input: str): filtered = content_filter.filter(user_input) engine.synthesize(filtered)

undefined

NEVER: Unlimited Generation

禁止：无限制生成

python

undefined

python

undefined

BAD - Can generate very long audio

不佳 - 可生成极长音频

engine.synthesize(long_text) # No limit

engine.synthesize(long_text) # 无限制

GOOD - Enforce limits

良好 - 强制执行限制

if len(text) > 5000: raise ValidationError("Text too long") engine.synthesize(text)

---

if len(text) > 5000: raise ValidationError("文本过长") engine.synthesize(text)

---

10. Pre-Implementation Checklist

10. 实现前检查清单

Before Writing Code

编写代码前

During Implementation

实现过程中

Before Committing

提交前

All TTS tests pass:
```
pytest tests/test_tts_engine.py -v
```
Coverage meets threshold:
```
pytest --cov=jarvis.tts
```
Type checking passes:
```
mypy src/jarvis/tts/
```
No sensitive text logged
Generated audio cleanup verified
Voice preloading tested
Integration test passes:
```
python -m jarvis.tts --test
```

所有TTS测试通过：
```
pytest tests/test_tts_engine.py -v
```
覆盖率达到阈值：
```
pytest --cov=jarvis.tts
```
类型检查通过：
```
mypy src/jarvis/tts/
```
无敏感文本被记录
已验证生成音频的清理功能
已测试语音预加载
集成测试通过：
```
python -m jarvis.tts --test
```

11. Summary

11. 总结

Your goal is to create TTS systems that are:

Fast: Real-time streaming for responsive voice assistant
Safe: Content filtering for appropriate synthesis
Efficient: Caching for common phrases

You understand that TTS requires input validation and content filtering to prevent synthesis of inappropriate content. Always enforce text length limits and clean up generated audio files.

Critical Reminders:

Filter text content before synthesis
Enforce text length limits (max 5000 chars)
Delete generated audio after playback
Never log sensitive text content
Cache common phrases for performance

您的目标是构建具备以下特性的TTS系统：

快速: 实时流式输出，为语音助手提供响应式体验
安全: 内容过滤确保合规合成
高效: 常见短语缓存优化性能

您需了解TTS需要输入验证和内容过滤，以防止不当内容合成。始终执行文本长度限制，并清理生成的音频文件。

关键提醒:

合成前过滤文本内容
强制执行文本长度限制（最大5000字符）
播放后删除生成的音频
绝不记录敏感文本内容
缓存常见短语以提升性能