audio-reply

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Audio Reply Skill

Audio Reply Skill

Generate spoken audio responses using MLX Audio TTS (chatterbox-turbo model).
使用MLX Audio TTS(chatterbox-turbo模型)生成语音回复。

Trigger Phrases

触发短语

  • "read it to me [URL]" - Fetch content from URL and read it aloud
  • "talk to me [topic/question]" - Generate a conversational response as audio
  • "speak", "say it", "voice reply" - Convert your response to audio
  • "read it to me [URL]" - 获取URL中的内容并朗读
  • "talk to me [topic/question]" - 生成对话式音频回复
  • "speak", "say it", "voice reply" - 将回复转换为音频

How to Use

使用方法

Mode 1: Read URL Content

模式1:朗读URL内容

User: read it to me https://example.com/article
  1. Fetch the URL content using WebFetch
  2. Extract readable text (strip HTML, focus on main content)
  3. Generate audio using TTS
  4. Play the audio and delete the file afterward
用户: read it to me https://example.com/article
  1. 使用WebFetch获取URL内容
  2. 提取可读文本(去除HTML标签,聚焦主要内容)
  3. 使用TTS生成音频
  4. 播放音频并在之后删除文件

Mode 2: Conversational Audio Response

模式2:对话式音频回复

User: talk to me about the weather today
  1. Generate a natural, conversational response
  2. Keep it concise (TTS works best with shorter segments)
  3. Convert to audio, play it, then delete the file
用户: talk to me about the weather today
  1. 生成自然的对话式回复文本
  2. 保持内容简洁(TTS在处理短文本时效果最佳)
  3. 转换为音频,播放后删除文件

Implementation

实现细节

TTS Command

TTS命令

bash
uv run mlx_audio.tts.generate \
  --model mlx-community/chatterbox-turbo-fp16 \
  --text "Your text here" \
  --play \
  --file_prefix /tmp/audio_reply
bash
uv run mlx_audio.tts.generate \
  --model mlx-community/chatterbox-turbo-fp16 \
  --text "Your text here" \
  --play \
  --file_prefix /tmp/audio_reply

Key Parameters

关键参数

  • --model mlx-community/chatterbox-turbo-fp16
    - Fast, natural voice
  • --play
    - Auto-play the generated audio
  • --file_prefix
    - Save to temp location for cleanup
  • --exaggeration 0.3
    - Optional: add expressiveness (0.0-1.0)
  • --speed 1.0
    - Adjust speech rate if needed
  • --model mlx-community/chatterbox-turbo-fp16
    - 快速、自然的语音模型
  • --play
    - 自动播放生成的音频
  • --file_prefix
    - 将文件保存到临时位置以便清理
  • --exaggeration 0.3
    - 可选:增加语音表现力(取值范围0.0-1.0)
  • --speed 1.0
    - 根据需要调整语速

Text Preparation Guidelines

文本准备指南

For "read it to me" mode:
  1. Fetch URL with WebFetch tool
  2. Extract main content, strip navigation/ads/boilerplate
  3. Summarize if very long (>500 words) - keep key points
  4. Add natural pauses with periods and commas
For "talk to me" mode:
  1. Write conversationally, as if speaking
  2. Use contractions (I'm, you're, it's)
  3. Add filler words sparingly for naturalness ([chuckle], um, anyway)
  4. Keep responses under 200 words for best quality
  5. Avoid technical jargon unless explaining it
针对"read it to me"模式:
  1. 使用WebFetch工具获取URL内容
  2. 提取主要内容,去除导航栏、广告和冗余内容
  3. 如果内容过长(超过500词),进行总结 - 保留关键点
  4. 使用句号和逗号添加自然停顿
针对"talk to me"模式:
  1. 以口语化风格撰写回复,如同日常对话
  2. 使用缩写形式(如I'm、you're、it's)
  3. 适当使用填充词提升自然度(如[chuckle]、um、anyway)
  4. 回复内容控制在200词以内以获得最佳效果
  5. 避免使用专业术语,除非需要解释

Audio Generation & Cleanup (IMPORTANT)

音频生成与清理(重要)

Always delete the audio file after playing - it's already in the chat history.
bash
undefined
播放后务必删除音频文件 - 内容已记录在聊天历史中。
bash
undefined

Generate with unique filename and play

生成带唯一文件名的音频并播放

OUTPUT_FILE="/tmp/audio_reply_$(date +%s)" uv run mlx_audio.tts.generate
--model mlx-community/chatterbox-turbo-fp16
--text "Your response text"
--play
--file_prefix "$OUTPUT_FILE"
OUTPUT_FILE="/tmp/audio_reply_$(date +%s)" uv run mlx_audio.tts.generate
--model mlx-community/chatterbox-turbo-fp16
--text "Your response text"
--play
--file_prefix "$OUTPUT_FILE"

ALWAYS clean up after playing

播放后务必清理

rm -f "${OUTPUT_FILE}"*.wav 2>/dev/null
undefined
rm -f "${OUTPUT_FILE}"*.wav 2>/dev/null
undefined

Error Handling

错误处理

If TTS fails:
  1. Check if model is downloaded (first run downloads ~500MB)
  2. Ensure
    uv
    is installed and in PATH
  3. Fall back to text response with apology
如果TTS生成失败:
  1. 检查模型是否已下载(首次运行会下载约500MB的文件)
  2. 确保
    uv
    已安装并在PATH环境变量中
  3. fallback到文本回复并致歉

Example Workflows

示例流程

Example 1: Read URL

示例1:朗读URL内容

User: read it to me https://blog.example.com/new-feature

Assistant actions:
1. WebFetch the URL
2. Extract article content
3. Generate TTS:
   uv run mlx_audio.tts.generate \
     --model mlx-community/chatterbox-turbo-fp16 \
     --text "Here's what I found... [article summary]" \
     --play --file_prefix /tmp/audio_reply_1706123456
4. Delete: rm -f /tmp/audio_reply_1706123456*.wav
5. Confirm: "Done reading the article to you."
用户: read it to me https://blog.example.com/new-feature

助手操作:
1. 使用WebFetch获取URL内容
2. 提取文章内容
3. 生成TTS音频:
   uv run mlx_audio.tts.generate \
     --model mlx-community/chatterbox-turbo-fp16 \
     --text "这是我找到的内容... [文章摘要]" \
     --play --file_prefix /tmp/audio_reply_1706123456
4. 删除文件:rm -f /tmp/audio_reply_1706123456*.wav
5. 确认回复:"已为您朗读完文章。"

Example 2: Talk to Me

示例2:对话式回复

User: talk to me about what you can help with

Assistant actions:
1. Generate conversational response text
2. Generate TTS:
   uv run mlx_audio.tts.generate \
     --model mlx-community/chatterbox-turbo-fp16 \
     --text "Hey! So I can help you with all kinds of things..." \
     --play --file_prefix /tmp/audio_reply_1706123789
3. Delete: rm -f /tmp/audio_reply_1706123789*.wav
4. (No text output needed - audio IS the response)
用户: talk to me about what you can help with

助手操作:
1. 生成对话式回复文本
2. 生成TTS音频:
   uv run mlx_audio.tts.generate \
     --model mlx-community/chatterbox-turbo-fp16 \
     --text "嘿!我可以帮你处理各种事情..." \
     --play --file_prefix /tmp/audio_reply_1706123789
3. 删除文件:rm -f /tmp/audio_reply_1706123789*.wav
4. (无需文本输出 - 音频即为回复)

Notes

注意事项

  • First run may take longer as the model downloads (~500MB)
  • Audio quality is best for English; other languages may vary
  • For long content, consider chunking into multiple audio segments
  • The
    --play
    flag uses system audio - ensure volume is up
  • 首次运行耗时较长,因为需要下载模型(约500MB)
  • 英文内容的音频质量最佳;其他语言效果可能不同
  • 对于长内容,建议拆分为多个音频片段
  • --play
    标志会使用系统音频 - 请确保音量已调大