speech-use

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Speech Use

语音功能使用指南

Use this skill to perform Text-to-Speech (TTS), Speech-to-Text (STT), and Voice Cloning operations.
This skill uses portable Python scripts managed by
uv
.
使用该skill可执行文本转语音(TTS)、语音转文字(STT)以及语音克隆操作。
本skill采用由
uv
管理的可移植Python脚本。

Prerequisites

前置条件

  1. Environment Variables:
    • GOOGLE_API_KEY
      (for TTS via Gemini)
    • GOOGLE_CLOUD_PROJECT
      (Required for STT and Voice Cloning)
    • GOOGLE_APPLICATION_CREDENTIALS
      (Recommended for STT/Voice Cloning)
  2. APIs Enabled:
    • Text-to-Speech API (
      texttospeech.googleapis.com
      )
    • Speech-to-Text API (
      speech.googleapis.com
      )
  1. 环境变量:
    • GOOGLE_API_KEY
      (用于通过Gemini实现TTS功能)
    • GOOGLE_CLOUD_PROJECT
      (STT和语音克隆功能必填)
    • GOOGLE_APPLICATION_CREDENTIALS
      (推荐用于STT/语音克隆功能)
  2. 已启用的API:
    • 文本转语音API(
      texttospeech.googleapis.com
    • 语音转文字API(
      speech.googleapis.com

Usage

使用方法

1. Generate Speech (TTS)

1. 生成语音(TTS)

Generate audio from text using Gemini-TTS.
Standard Voice:
bash
uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav
Custom Voice (Cloned):
bash
uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav
通过Gemini-TTS将文本转换为音频。
标准语音:
bash
uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav
自定义语音(克隆语音):
bash
uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav

2. Create Custom Voice (Voice Cloning)

2. 创建自定义语音(语音克隆)

Generate a
voiceCloningKey
from a reference audio file and a consent file.
Requirements:
  • reference.wav
    : 10-30s of clear speech (the voice to clone).
  • consent.wav
    : The speaker saying: "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model."
bash
uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav
Save the output key to use with
generate_speech.py
.
从参考音频文件和同意音频文件生成
voiceCloningKey
要求:
  • reference.wav
    : 10-30秒清晰语音(待克隆的语音)。
  • consent.wav
    : 说话者需清晰说出:"I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model."
bash
uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav
保存输出的密钥,以便在
generate_speech.py
中使用。

3. Transcribe Audio (STT)

3. 音频转文字(STT)

Transcribe audio files using Chirp 3.
bash
uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt
通过Chirp 3将音频文件转换为文字。
bash
uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt

Options

可选参数

generate_speech.py
  • --voice
    : Prebuilt voice (e.g.,
    Kore
    ,
    Puck
    ,
    Fenrir
    ,
    Aoede
    ).
  • --voice-cloning-key
    : Key from
    create_custom_voice.py
    .
  • --model
    : Default
    gemini-2.5-flash-preview-tts
    .
transcribe_audio.py
  • --model
    : Default
    chirp_3
    .
  • --language
    : Default
    auto
    .
  • --location
    : Cloud region (default
    us
    ).
generate_speech.py
  • --voice
    : 预构建语音(例如
    Kore
    Puck
    Fenrir
    Aoede
    )。
  • --voice-cloning-key
    : 来自
    create_custom_voice.py
    的密钥。
  • --model
    : 默认值为
    gemini-2.5-flash-preview-tts
transcribe_audio.py
  • --model
    : 默认值为
    chirp_3
  • --language
    : 默认值为
    auto
  • --location
    : 云区域(默认值为
    us
    )。

References

参考资料

Before running scripts, review the reference guides for available voices and options.
  • Voices Guide - 30+ voice options with styles (Puck, Kore, Fenrir, Aoede, etc.)
运行脚本前,请查阅参考指南了解可用语音及选项。
  • 语音指南 - 30+种带风格的语音选项(Puck、Kore、Fenrir、Aoede等)