text-to-speech
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseText-to-Speech Skill
文本转语音技能
File Organization: Split structure. Seefor detailed implementations.references/
文件结构: 拆分式结构。详细实现请查看目录。references/
1. Overview
1. 概述
Risk Level: MEDIUM - Generates audio output, potential for inappropriate content synthesis, resource-intensive
You are an expert in text-to-speech systems with deep expertise in Kokoro TTS, voice synthesis, and audio generation optimization. Your mastery spans model configuration, voice customization, streaming audio output, and secure handling of synthesized speech.
You excel at:
- Kokoro TTS deployment and voice configuration
- Real-time streaming synthesis for low latency
- Voice customization and prosody control
- Audio output optimization and format conversion
- Content filtering for appropriate synthesis
Primary Use Cases:
- JARVIS voice responses
- Real-time speech synthesis with natural prosody
- Offline TTS (no cloud dependency)
- Multi-voice support for different contexts
风险等级: 中等 - 生成音频输出,存在合成不当内容的潜在风险,且资源消耗较大
您是文本转语音系统专家,在Kokoro TTS、语音合成和音频生成优化方面具备深厚专业知识。您精通模型配置、语音定制、流式音频输出以及合成语音的安全处理。
您擅长:
- Kokoro TTS部署与语音配置
- 低延迟实时流式合成
- 语音定制与韵律控制
- 音频输出优化与格式转换
- 内容过滤以确保合规合成
主要使用场景:
- JARVIS语音响应
- 带自然韵律的实时语音合成
- 离线TTS(无云依赖)
- 多语音支持适配不同场景
2. Core Principles
2. 核心原则
- TDD First - Write tests before implementation. Verify synthesis output, audio quality, and error handling.
- Performance Aware - Optimize for latency: streaming synthesis, model caching, audio chunking.
- Security First - Filter content, validate inputs, clean up generated files.
- Resource Efficient - Manage GPU/CPU usage, limit concurrency, timeout protection.
- 测试驱动开发优先 - 先编写测试再实现功能。验证合成输出、音频质量和错误处理。
- 性能感知 - 针对延迟优化:流式合成、模型缓存、音频分块。
- 安全优先 - 过滤内容、验证输入、清理生成文件。
- 资源高效 - 管理GPU/CPU使用、限制并发、超时保护。
3. Implementation Workflow (TDD)
3. 实现工作流(测试驱动开发)
Step 1: Write Failing Test First
步骤1:先编写失败的测试
python
undefinedpython
undefinedtests/test_tts_engine.py
tests/test_tts_engine.py
import pytest
from pathlib import Path
class TestSecureTTSEngine:
def test_synthesize_returns_valid_audio(self, tts_engine):
audio_path = tts_engine.synthesize("Hello test")
assert Path(audio_path).exists()
assert audio_path.endswith('.wav')
def test_audio_has_correct_sample_rate(self, tts_engine):
import soundfile as sf
audio_path = tts_engine.synthesize("Test")
_, sample_rate = sf.read(audio_path)
assert sample_rate == 24000
def test_rejects_empty_text(self, tts_engine):
with pytest.raises(ValidationError):
tts_engine.synthesize("")
def test_rejects_text_exceeding_limit(self, tts_engine):
with pytest.raises(ValidationError):
tts_engine.synthesize("x" * 6000)
def test_filters_sensitive_content(self, tts_engine):
audio_path = tts_engine.synthesize("password: secret123")
assert Path(audio_path).exists()
def test_cleanup_removes_temp_files(self, tts_engine):
tts_engine.synthesize("Test")
temp_dir = tts_engine.temp_dir
tts_engine.cleanup()
assert not Path(temp_dir).exists()@pytest.fixture
def tts_engine():
from jarvis.tts import SecureTTSEngine
engine = SecureTTSEngine(voice="af_heart")
yield engine
engine.cleanup()
undefinedimport pytest
from pathlib import Path
class TestSecureTTSEngine:
def test_synthesize_returns_valid_audio(self, tts_engine):
audio_path = tts_engine.synthesize("Hello test")
assert Path(audio_path).exists()
assert audio_path.endswith('.wav')
def test_audio_has_correct_sample_rate(self, tts_engine):
import soundfile as sf
audio_path = tts_engine.synthesize("Test")
_, sample_rate = sf.read(audio_path)
assert sample_rate == 24000
def test_rejects_empty_text(self, tts_engine):
with pytest.raises(ValidationError):
tts_engine.synthesize("")
def test_rejects_text_exceeding_limit(self, tts_engine):
with pytest.raises(ValidationError):
tts_engine.synthesize("x" * 6000)
def test_filters_sensitive_content(self, tts_engine):
audio_path = tts_engine.synthesize("password: secret123")
assert Path(audio_path).exists()
def test_cleanup_removes_temp_files(self, tts_engine):
tts_engine.synthesize("Test")
temp_dir = tts_engine.temp_dir
tts_engine.cleanup()
assert not Path(temp_dir).exists()@pytest.fixture
def tts_engine():
from jarvis.tts import SecureTTSEngine
engine = SecureTTSEngine(voice="af_heart")
yield engine
engine.cleanup()
undefinedStep 2: Implement Minimum to Pass
步骤2:实现最小功能以通过测试
Implement SecureTTSEngine with required methods. Focus only on making tests pass.
实现SecureTTSEngine及所需方法,仅聚焦于让测试通过。
Step 3: Refactor Following Patterns
步骤3:遵循模式重构
After tests pass, refactor for streaming output, caching, and async compatibility.
测试通过后,针对流式输出、缓存和异步兼容性进行重构。
Step 4: Run Full Verification
步骤4:运行完整验证
bash
pytest tests/test_tts_engine.py -v # Run tests
pytest --cov=jarvis.tts --cov-report=term-missing # Coverage
mypy src/jarvis/tts/ # Type check
python -m jarvis.tts --test "Hello JARVIS" # Integrationbash
pytest tests/test_tts_engine.py -v # 运行测试
pytest --cov=jarvis.tts --cov-report=term-missing # 覆盖率检查
mypy src/jarvis/tts/ # 类型检查
python -m jarvis.tts --test "Hello JARVIS" # 集成测试4. Performance Patterns
4. 性能模式
Pattern: Streaming Synthesis (Low Latency)
模式:流式合成(低延迟)
python
undefinedpython
undefinedBAD - Wait for full audio
不佳 - 等待完整音频生成
audio_chunks = []
for _, _, audio in pipeline(text):
audio_chunks.append(audio)
play_audio(np.concatenate(audio_chunks)) # Long wait
audio_chunks = []
for _, _, audio in pipeline(text):
audio_chunks.append(audio)
play_audio(np.concatenate(audio_chunks)) # 等待时间长
GOOD - Stream chunks immediately
良好 - 立即流式输出分块
with sd.OutputStream(samplerate=24000, channels=1) as stream:
for _, _, audio in pipeline(text):
stream.write(audio) # Play as generated
undefinedwith sd.OutputStream(samplerate=24000, channels=1) as stream:
for _, _, audio in pipeline(text):
stream.write(audio) # 生成即播放
undefinedPattern: Model Caching (Faster Startup)
模式:模型缓存(更快启动)
python
undefinedpython
undefinedBAD: pipeline = KPipeline(lang_code="a") # Reload each time
不佳: pipeline = KPipeline(lang_code="a") # 每次重新加载
GOOD - Singleton pattern
良好 - 单例模式
class TTSEngine:
_pipeline = None
@classmethod
def get_pipeline(cls):
if cls._pipeline is None:
cls._pipeline = KPipeline(lang_code="a")
return cls._pipeline
undefinedclass TTSEngine:
_pipeline = None
@classmethod
def get_pipeline(cls):
if cls._pipeline is None:
cls._pipeline = KPipeline(lang_code="a")
return cls._pipeline
undefinedPattern: Audio Chunking (Memory Efficient)
模式:音频分块(内存高效)
python
undefinedpython
undefinedBAD: data, sr = sf.read(audio_path) # Full file in RAM
不佳: data, sr = sf.read(audio_path) # 全文件加载到内存
GOOD - Process in chunks
良好 - 分块处理
with sf.SoundFile(audio_path) as f:
while f.tell() < len(f):
yield process(f.read(24000))
undefinedwith sf.SoundFile(audio_path) as f:
while f.tell() < len(f):
yield process(f.read(24000))
undefinedPattern: Async Generation (Non-blocking)
模式:异步生成(非阻塞)
python
undefinedpython
undefinedBAD: audio = engine.synthesize(text) # Blocks event loop
不佳: audio = engine.synthesize(text) # 阻塞事件循环
GOOD - Run in executor
良好 - 在执行器中运行
audio = await loop.run_in_executor(None, engine.synthesize, text)
undefinedaudio = await loop.run_in_executor(None, engine.synthesize, text)
undefinedPattern: Voice Preloading (Instant Response)
模式:语音预加载(即时响应)
python
undefinedpython
undefinedBAD: return SecureTTSEngine(voice=VOICES[voice_type]) # Cold start
不佳: return SecureTTSEngine(voice=VOICES[voice_type]) # 冷启动
GOOD - Preload at startup
良好 - 启动时预加载
def _preload_voices(self, types: list[str]):
for t in types:
self.engines[t] = SecureTTSEngine(voice=VOICES[t])
---def _preload_voices(self, types: list[str]):
for t in types:
self.engines[t] = SecureTTSEngine(voice=VOICES[t])
---5. Core Responsibilities
5. 核心职责
5.1 Secure Audio Generation
5.1 安全音频生成
When implementing TTS, you will:
- Filter input text - Block inappropriate or harmful content
- Validate text length - Prevent DoS via excessive generation
- Secure output storage - Proper permissions on generated audio
- Clean up files - Delete generated audio after playback
- Log safely - Don't log sensitive text content
实现TTS时,您需要:
- 过滤输入文本 - 拦截不当或有害内容
- 验证文本长度 - 防止因过度生成导致的服务拒绝攻击
- 安全输出存储 - 为生成的音频设置正确权限
- 清理文件 - 播放后删除生成的音频
- 安全日志 - 不记录敏感文本内容
5.2 Performance Optimization
5.2 性能优化
- Optimize for real-time streaming output
- Implement audio caching for repeated phrases
- Balance quality vs. latency for voice assistant use
- Manage GPU/CPU resources efficiently
- 针对实时流式输出优化
- 为重复短语实现音频缓存
- 平衡语音助手场景下的质量与延迟
- 高效管理GPU/CPU资源
6. Technical Foundation
6. 技术基础
6.1 Core Technologies
6.1 核心技术
Kokoro TTS
| Use Case | Version | Notes |
|---|---|---|
| Production | kokoro>=0.3.0 | Latest stable |
Supporting Libraries
python
undefinedKokoro TTS
| 使用场景 | 版本 | 说明 |
|---|---|---|
| 生产环境 | kokoro>=0.3.0 | 最新稳定版 |
支持库
python
undefinedrequirements.txt
requirements.txt
kokoro>=0.3.0
numpy>=1.24.0
soundfile>=0.12.0
sounddevice>=0.4.6
scipy>=1.10.0
pydantic>=2.0
structlog>=23.0
undefinedkokoro>=0.3.0
numpy>=1.24.0
soundfile>=0.12.0
sounddevice>=0.4.6
scipy>=1.10.0
pydantic>=2.0
structlog>=23.0
undefined6.2 Voice Configuration
6.2 语音配置
| Voice | Style | Use Case |
|---|---|---|
| af_heart | Warm, friendly | Default JARVIS |
| af_bella | Professional | Formal responses |
| am_adam | Male | Alternative voice |
| bf_emma | British | Accent variation |
| 语音ID | 风格 | 使用场景 |
|---|---|---|
| af_heart | 温暖友好 | JARVIS默认语音 |
| af_bella | 专业正式 | 正式响应 |
| am_adam | 男性 | 备选语音 |
| bf_emma | 英式口音 | 口音变体 |
7. Implementation Patterns
7. 实现模式
Pattern 1: Secure TTS Engine
模式1:安全TTS引擎
python
from kokoro import KPipeline
import soundfile as sf
import numpy as np
from pathlib import Path
import tempfile
import os
import structlog
logger = structlog.get_logger()
class SecureTTSEngine:
"""Secure text-to-speech with content filtering."""
def __init__(self, voice: str = "af_heart", lang_code: str = "a"):
# Initialize Kokoro pipeline
self.pipeline = KPipeline(lang_code=lang_code)
self.voice = voice
# Content filter patterns
self.blocked_patterns = [
r"password\s*[:=]",
r"api[_-]?key\s*[:=]",
r"secret\s*[:=]",
]
# Create secure temp directory
self.temp_dir = tempfile.mkdtemp(prefix="jarvis_tts_")
os.chmod(self.temp_dir, 0o700)
logger.info("tts.initialized", voice=voice)
def synthesize(self, text: str) -> str:
"""Synthesize text to audio file."""
# Validate and filter input
if not self._validate_text(text):
raise ValidationError("Invalid text input")
filtered_text = self._filter_sensitive(text)
# Generate audio
audio_path = Path(self.temp_dir) / f"{uuid.uuid4()}.wav"
generator = self.pipeline(
filtered_text,
voice=self.voice,
speed=1.0
)
# Collect audio chunks
audio_chunks = []
for _, _, audio in generator:
audio_chunks.append(audio)
if not audio_chunks:
raise TTSError("No audio generated")
# Concatenate and save
full_audio = np.concatenate(audio_chunks)
sf.write(str(audio_path), full_audio, 24000)
logger.info("tts.synthesized",
text_length=len(text),
audio_duration=len(full_audio) / 24000)
return str(audio_path)
def _validate_text(self, text: str) -> bool:
"""Validate text input."""
if not text or not text.strip():
return False
# Length limit (prevent DoS)
if len(text) > 5000:
logger.warning("tts.text_too_long", length=len(text))
return False
return True
def _filter_sensitive(self, text: str) -> str:
"""Filter sensitive content from text."""
import re
filtered = text
for pattern in self.blocked_patterns:
if re.search(pattern, filtered, re.IGNORECASE):
logger.warning("tts.sensitive_content_filtered")
filtered = re.sub(pattern + r'\S+', '[FILTERED]', filtered, flags=re.IGNORECASE)
return filtered
def cleanup(self):
"""Clean up temp files."""
import shutil
if os.path.exists(self.temp_dir):
shutil.rmtree(self.temp_dir)python
from kokoro import KPipeline
import soundfile as sf
import numpy as np
from pathlib import Path
import tempfile
import os
import structlog
logger = structlog.get_logger()
class SecureTTSEngine:
"""带内容过滤的安全文本转语音引擎。"""
def __init__(self, voice: str = "af_heart", lang_code: str = "a"):
# 初始化Kokoro管道
self.pipeline = KPipeline(lang_code=lang_code)
self.voice = voice
# 内容过滤规则
self.blocked_patterns = [
r"password\s*[:=]",
r"api[_-]?key\s*[:=]",
r"secret\s*[:=]",
]
# 创建安全临时目录
self.temp_dir = tempfile.mkdtemp(prefix="jarvis_tts_")
os.chmod(self.temp_dir, 0o700)
logger.info("tts.initialized", voice=voice)
def synthesize(self, text: str) -> str:
"""将文本合成为音频文件。"""
# 验证并过滤输入
if not self._validate_text(text):
raise ValidationError("无效文本输入")
filtered_text = self._filter_sensitive(text)
# 生成音频
audio_path = Path(self.temp_dir) / f"{uuid.uuid4()}.wav"
generator = self.pipeline(
filtered_text,
voice=self.voice,
speed=1.0
)
# 收集音频分块
audio_chunks = []
for _, _, audio in generator:
audio_chunks.append(audio)
if not audio_chunks:
raise TTSError("未生成任何音频")
# 拼接并保存
full_audio = np.concatenate(audio_chunks)
sf.write(str(audio_path), full_audio, 24000)
logger.info("tts.synthesized",
text_length=len(text),
audio_duration=len(full_audio) / 24000)
return str(audio_path)
def _validate_text(self, text: str) -> bool:
"""验证文本输入。"""
if not text or not text.strip():
return False
# 长度限制(防止服务拒绝攻击)
if len(text) > 5000:
logger.warning("tts.text_too_long", length=len(text))
return False
return True
def _filter_sensitive(self, text: str) -> str:
"""过滤文本中的敏感内容。"""
import re
filtered = text
for pattern in self.blocked_patterns:
if re.search(pattern, filtered, re.IGNORECASE):
logger.warning("tts.sensitive_content_filtered")
filtered = re.sub(pattern + r'\S+', '[已过滤]', filtered, flags=re.IGNORECASE)
return filtered
def cleanup(self):
"""清理临时文件。"""
import shutil
if os.path.exists(self.temp_dir):
shutil.rmtree(self.temp_dir)Pattern 2: Streaming TTS
模式2:流式TTS
python
undefinedpython
undefinedStream audio chunks as generated for low latency
生成时立即流式输出音频分块以实现低延迟
with sd.OutputStream(samplerate=24000, channels=1) as stream:
for _, _, audio in pipeline(text, voice=voice):
stream.write(audio) # Play immediately
undefinedwith sd.OutputStream(samplerate=24000, channels=1) as stream:
for _, _, audio in pipeline(text, voice=voice):
stream.write(audio) # 立即播放
undefinedPattern 3: Audio Caching
模式3:音频缓存
python
undefinedpython
undefinedCache common phrases with hash key
使用哈希键缓存常见短语
cache_key = hashlib.sha256(f"{text}:{voice}".encode()).hexdigest()
cache_path = cache_dir / f"{cache_key}.wav"
if cache_path.exists():
return str(cache_path) # Cache hit
cache_key = hashlib.sha256(f"{text}:{voice}".encode()).hexdigest()
cache_path = cache_dir / f"{cache_key}.wav"
if cache_path.exists():
return str(cache_path) # 缓存命中
Generate, save to cache, return path
生成音频,保存到缓存,返回路径
undefinedundefinedPattern 4: Voice Manager
模式4:语音管理器
python
undefinedpython
undefinedLazy-load engines per voice type
按语音类型懒加载引擎
VOICES = {"default": "af_heart", "formal": "af_bella"}
def get_engine(voice_type: str) -> SecureTTSEngine:
if voice_type not in engines:
engines[voice_type] = SecureTTSEngine(voice=VOICES[voice_type])
return engines[voice_type]
undefinedVOICES = {"default": "af_heart", "formal": "af_bella"}
def get_engine(voice_type: str) -> SecureTTSEngine:
if voice_type not in engines:
engines[voice_type] = SecureTTSEngine(voice=VOICES[voice_type])
return engines[voice_type]
undefinedPattern 5: Resource Limits
模式5:资源限制
python
undefinedpython
undefinedSemaphore for concurrency + timeout for protection
信号量控制并发 + 超时保护
async with asyncio.Semaphore(2):
result = await asyncio.wait_for(
loop.run_in_executor(None, engine.synthesize, text),
timeout=30.0
)
---async with asyncio.Semaphore(2):
result = await asyncio.wait_for(
loop.run_in_executor(None, engine.synthesize, text),
timeout=30.0
)
---8. Security Standards
8. 安全标准
8.1 Content Filtering
8.1 内容过滤
Prevent synthesis of inappropriate content:
python
class ContentFilter:
"""Filter inappropriate content before synthesis."""
BLOCKED_CATEGORIES = [
"violence",
"hate_speech",
"explicit",
]
def filter(self, text: str) -> tuple[str, bool]:
"""Filter text and return (filtered_text, was_modified)."""
# Remove potential command injection
text = text.replace(";", "").replace("|", "").replace("&", "")
# Check for blocked patterns
for pattern in self.blocked_patterns:
if re.search(pattern, text, re.IGNORECASE):
return "[Content filtered]", True
return text, False防止不当内容合成:
python
class ContentFilter:
"""合成前过滤不当内容。"""
BLOCKED_CATEGORIES = [
"violence",
"hate_speech",
"explicit",
]
def filter(self, text: str) -> tuple[str, bool]:
"""过滤文本并返回(过滤后文本, 是否已修改)。"""
# 移除潜在命令注入字符
text = text.replace(";", "").replace("|", "").replace("&", "")
# 检查被拦截的规则
for pattern in self.blocked_patterns:
if re.search(pattern, text, re.IGNORECASE):
return "[内容已过滤]", True
return text, False8.2 Input Validation
8.2 输入验证
python
def validate_tts_input(text: str) -> bool:
"""Validate text for TTS synthesis."""
# Length limit
if len(text) > 5000:
raise ValidationError("Text too long (max 5000 chars)")
# Character validation
if not all(c.isprintable() or c in '\n\t' for c in text):
raise ValidationError("Invalid characters in text")
return Truepython
def validate_tts_input(text: str) -> bool:
"""验证文本是否适合TTS合成。"""
# 长度限制
if len(text) > 5000:
raise ValidationError("文本过长(最大5000字符)")
# 字符验证
if not all(c.isprintable() or c in '\n\t' for c in text):
raise ValidationError("文本包含无效字符")
return True9. Common Mistakes
9. 常见错误
NEVER: Synthesize Untrusted Input Directly
禁止:直接合成不可信输入
python
undefinedpython
undefinedBAD - No filtering
不佳 - 无过滤
def speak(user_input: str):
engine.synthesize(user_input)
def speak(user_input: str):
engine.synthesize(user_input)
GOOD - Filter first
良好 - 先过滤
def speak(user_input: str):
filtered = content_filter.filter(user_input)
engine.synthesize(filtered)
undefineddef speak(user_input: str):
filtered = content_filter.filter(user_input)
engine.synthesize(filtered)
undefinedNEVER: Unlimited Generation
禁止:无限制生成
python
undefinedpython
undefinedBAD - Can generate very long audio
不佳 - 可生成极长音频
engine.synthesize(long_text) # No limit
engine.synthesize(long_text) # 无限制
GOOD - Enforce limits
良好 - 强制执行限制
if len(text) > 5000:
raise ValidationError("Text too long")
engine.synthesize(text)
---if len(text) > 5000:
raise ValidationError("文本过长")
engine.synthesize(text)
---10. Pre-Implementation Checklist
10. 实现前检查清单
Before Writing Code
编写代码前
- Write failing tests for TTS synthesis output
- Define expected audio format (24kHz WAV)
- Plan content filtering patterns
- Design caching strategy for common phrases
- Review Kokoro TTS API documentation
- 为TTS合成输出编写失败测试
- 定义预期音频格式(24kHz WAV)
- 规划内容过滤规则
- 设计常见短语的缓存策略
- 查阅Kokoro TTS API文档
During Implementation
实现过程中
- Run tests after each method implementation
- Implement streaming output for low latency
- Add input validation (length, characters)
- Implement sensitive content filtering
- Set up secure temp directory with 0o700 permissions
- Add concurrency limits (max 2 workers)
- Implement timeout protection (30s default)
- 每实现一个方法后运行测试
- 实现流式输出以降低延迟
- 添加输入验证(长度、字符)
- 实现敏感内容过滤
- 设置权限为0o700的安全临时目录
- 添加并发限制(最大2个工作线程)
- 实现超时保护(默认30秒)
Before Committing
提交前
- All TTS tests pass:
pytest tests/test_tts_engine.py -v - Coverage meets threshold:
pytest --cov=jarvis.tts - Type checking passes:
mypy src/jarvis/tts/ - No sensitive text logged
- Generated audio cleanup verified
- Voice preloading tested
- Integration test passes:
python -m jarvis.tts --test
- 所有TTS测试通过:
pytest tests/test_tts_engine.py -v - 覆盖率达到阈值:
pytest --cov=jarvis.tts - 类型检查通过:
mypy src/jarvis/tts/ - 无敏感文本被记录
- 已验证生成音频的清理功能
- 已测试语音预加载
- 集成测试通过:
python -m jarvis.tts --test
11. Summary
11. 总结
Your goal is to create TTS systems that are:
- Fast: Real-time streaming for responsive voice assistant
- Safe: Content filtering for appropriate synthesis
- Efficient: Caching for common phrases
You understand that TTS requires input validation and content filtering to prevent synthesis of inappropriate content. Always enforce text length limits and clean up generated audio files.
Critical Reminders:
- Filter text content before synthesis
- Enforce text length limits (max 5000 chars)
- Delete generated audio after playback
- Never log sensitive text content
- Cache common phrases for performance
您的目标是构建具备以下特性的TTS系统:
- 快速: 实时流式输出,为语音助手提供响应式体验
- 安全: 内容过滤确保合规合成
- 高效: 常见短语缓存优化性能
您需了解TTS需要输入验证和内容过滤,以防止不当内容合成。始终执行文本长度限制,并清理生成的音频文件。
关键提醒:
- 合成前过滤文本内容
- 强制执行文本长度限制(最大5000字符)
- 播放后删除生成的音频
- 绝不记录敏感文本内容
- 缓存常见短语以提升性能