google-gemini-media
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGemini Multimodal Media (Image/Video/Speech) Skill
Gemini多模态媒体(图像/视频/语音)Skill
1. Goals and scope
1. 目标与范围
This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:
- Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
- Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
- Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
- Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
- Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
- Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)
Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.
本Skill整合了Gemini API的六项能力,将其封装为可复用的工作流与实现模板:
- 图像生成(Nano Banana:文本转图像、图像编辑、多轮迭代)
- 图像理解(标题生成/视觉问答/分类/对比、多图像提示词;支持内嵌式与Files API)
- 视频生成(Veo 3.1:文本转视频、宽高比/分辨率控制、参考图像引导、首尾帧设定、视频扩展、原生音频)
- 视频理解(上传/内嵌/YouTube URL;摘要、问答、带时间戳的证据提取)
- 语音生成(Gemini原生TTS:单语音与多语音;可控制风格/口音/语速/语调)
- 音频理解(上传/内嵌;内容描述、转录、指定时间段转录、Token计数)
约定:本Skill以官方Google Gen AI SDK(Node.js/REST)为技术主线;目前仅提供Node.js/REST示例。若你的项目已封装其他语言或框架,请将本Skill的请求结构、模型选择与输入输出规范映射到你的封装层。
2. Quick routing (decide which capability to use)
2. 快速选型(确定使用哪项能力)
- Do you need to produce images?
- Need to generate images from scratch or edit based on an image -> use Nano Banana image generation (see Section 5)
- Do you need to understand images?
- Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6)
- Do you need to produce video?
- Need to generate an 8-second video (optionally with native audio) -> use Veo 3.1 video generation (see Section 7)
- Do you need to understand video?
- Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8)
- Do you need to read text aloud?
- Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9)
- Do you need to understand audio?
- Need audio descriptions, transcription, time-range transcription, token counting -> use Audio understanding (see Section 10)
- 是否需要生成图像?
- 需要从头生成图像或基于现有图像编辑 -> 使用Nano Banana图像生成(见第5节)
- 是否需要理解图像内容?
- 需要识别、描述、问答、对比或信息提取 -> 使用图像理解(见第6节)
- 是否需要生成视频?
- 需要生成8秒视频(可附带原生音频) -> 使用Veo 3.1视频生成(见第7节)
- 是否需要理解视频内容?
- 需要生成摘要/问答/带时间戳的片段提取 -> 使用视频理解(见第8节)
- 是否需要将文本转为语音?
- 需要可控的旁白、播客/有声书风格等 -> 使用语音生成(TTS)(见第9节)
- 是否需要理解音频内容?
- 需要音频描述、转录、指定时间段转录、Token计数 -> 使用音频理解(见第10节)
3. Unified engineering constraints and I/O spec (must read)
3. 统一工程约束与输入输出规范(必读)
3.0 Prerequisites (dependencies and tools)
3.0 前置条件(依赖与工具)
- Node.js 18+ (match your project version)
- Install SDK (example):
bash
npm install @google/genai- REST examples only need ; if you need to parse image Base64, install
curl(optional).jq
- Node.js 18+(与你的项目版本匹配)
- 安装SDK(示例):
bash
npm install @google/genai- REST示例仅需;若需要解析图像Base64,可安装
curl(可选)。jq
3.1 Authentication and environment variables
3.1 认证与环境变量
- Put your API key in
GEMINI_API_KEY - REST requests use
x-goog-api-key: $GEMINI_API_KEY
- 将你的API密钥存入环境变量
GEMINI_API_KEY - REST请求需携带请求头
x-goog-api-key: $GEMINI_API_KEY
3.2 Two file input modes: Inline vs Files API
3.2 两种文件输入模式:内嵌式 vs Files API
Inline (embedded bytes/Base64)
- Pros: shorter call chain, good for small files.
- Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.
Files API (upload then reference)
- Pros: good for large files, reusing the same file, or multi-turn conversations.
- Typical flow:
- (SDK) or
files.upload(...)(REST resumable)POST /upload/v1beta/files - Use /
file_datainfile_urigenerateContent
Engineering suggestion: implementso that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.ensure_file_uri()
内嵌式(嵌入字节/Base64)
- 优势:调用链更短,适合小文件。
- 关键限制:请求总大小(文本提示词+系统指令+嵌入字节)通常上限约为20MB。
Files API(先上传再引用)
- 优势:适合大文件、重复使用同一文件或多轮对话场景。
- 典型流程:
- 使用SDK的或REST的
files.upload(...)(支持断点续传)上传文件POST /upload/v1beta/files - 在中使用
generateContent/file_data引用文件file_uri
- 使用SDK的
工程建议:实现方法,当文件超过阈值(例如10-15MB时触发警告)或需要重复使用时,自动路由到Files API。ensure_file_uri()
3.3 Unified handling of binary media outputs
3.3 二进制媒体输出的统一处理
- Images: usually returned as (Base64) in response parts; in the SDK use
inline_dataor decode Base64 and save as PNG/JPG.part.as_image() - Speech (TTS): usually returns PCM bytes (Base64); save as or wrap into
.pcm(commonly 24kHz, 16-bit, mono)..wav - Video (Veo): long-running async task; poll the operation; download the file (or use the returned URI).
- 图像:通常在响应部分以(Base64格式)返回;在SDK中可使用
inline_data,或解码Base64后保存为PNG/JPG格式。part.as_image() - 语音(TTS):通常返回PCM字节(Base64格式);可保存为文件,或封装为
.pcm文件(常用参数:24kHz采样率、16位位深、单声道)。.wav - 视频(Veo):属于异步长时任务;需轮询操作状态;完成后下载文件(或使用返回的URI)。
4. Model selection matrix (choose by scenario)
4. 模型选择矩阵(按场景选型)
Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.
重要提示:模型名称、版本、限制与配额可能随时间变化。使用前请查阅官方文档确认。最后更新时间:2026-01-22。
4.1 Image generation (Nano Banana)
4.1 图像生成(Nano Banana)
- gemini-2.5-flash-image: optimized for speed/throughput; good for frequent, low-latency generation/editing.
- gemini-3-pro-image-preview: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.
- gemini-2.5-flash-image:针对速度/吞吐量优化;适合高频、低延迟的生成/编辑场景。
- gemini-3-pro-image-preview:指令遵循能力更强,文本渲染保真度高;更适合专业素材与复杂编辑场景。
4.2 General image/video/audio understanding
4.2 通用图像/视频/音频理解
- Docs use for image, video, and audio understanding (choose stronger models as needed for quality/cost).
gemini-3-flash-preview
- 文档中使用进行图像、视频与音频理解(若对质量/成本有更高要求,可选择更强大的模型)。
gemini-3-flash-preview
4.3 Video generation (Veo)
4.3 视频生成(Veo)
- Example model: (generates 8-second video and can natively generate audio).
veo-3.1-generate-preview
- 示例模型:(可生成8秒视频,并支持原生音频生成)。
veo-3.1-generate-preview
4.4 Speech generation (TTS)
4.4 语音生成(TTS)
- Example model: (native TTS, currently in preview).
gemini-2.5-flash-preview-tts
- 示例模型:(原生TTS,目前处于预览阶段)。
gemini-2.5-flash-preview-tts
5. Image generation (Nano Banana)
5. 图像生成(Nano Banana)
5.1 Text-to-Image
5.1 文本转图像
SDK (Node.js) minimal template
js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents:
"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.text) console.log(part.text);
if (part.inlineData?.data) {
fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
}
}REST (with imageConfig) minimal template
bash
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" -H "x-goog-api-key: $GEMINI_API_KEY" -H "Content-Type: application/json" -d '{
"contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
"generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
}'REST image parsing (Base64 decode)
bash
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
| jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
| base64 --decode > out.pngSDK(Node.js)最简模板
js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents:
"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.text) console.log(part.text);
if (part.inlineData?.data) {
fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
}
}REST(带imageConfig)最简模板
bash
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" -H "x-goog-api-key: $GEMINI_API_KEY" -H "Content-Type: application/json" -d '{
"contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
"generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
}'REST图像解析(Base64解码)
bash
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
| jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
| base64 --decode > out.pngmacOS can use: base64 -D > out.png
macOS可使用:base64 -D > out.png
undefinedundefined5.2 Text-and-Image-to-Image
5.2 文本+图像转图像
Use case: given an image, add/remove/modify elements, change style, color grading, etc.
SDK (Node.js) minimal template
js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents: [
{ text: prompt },
{ inlineData: { mimeType: "image/png", data: imageBase64 } },
],
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.inlineData?.data) {
fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
}
}适用场景:基于现有图像,添加/删除/修改元素、调整风格、色彩分级等。
SDK(Node.js)最简模板
js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents: [
{ text: prompt },
{ inlineData: { mimeType: "image/png", data: imageBase64 } },
],
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.inlineData?.data) {
fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
}
}5.3 Multi-turn image iteration (Multi-turn editing)
5.3 多轮图像迭代(多轮编辑)
Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set to .
To output mixed "text + image" results, set
response_modalities["TEXT", "IMAGE"]最佳实践:使用对话模式进行持续迭代(例如:先生成图像,然后「仅编辑特定区域/元素」,再「生成同风格变体」)。
若需输出「文本+图像」混合结果,需将设置为。
response_modalities["TEXT", "IMAGE"]5.4 ImageConfig
5.4 ImageConfig配置
You can set in or the SDK config:
generationConfig.imageConfig- : e.g.
aspectRatio,16:9.1:1 - : e.g.
imageSize,2K(higher resolution is usually slower/more expensive and model support can vary).4K
可在或SDK配置中设置:
generationConfig.imageConfig- :例如
aspectRatio、16:9。1:1 - :例如
imageSize、2K(分辨率越高通常速度越慢/成本越高,且模型支持情况可能不同)。4K
6. Image understanding (Image Understanding)
6. 图像理解(Image Understanding)
6.1 Two ways to provide input images
6.1 提供输入图像的两种方式
- Inline image data: suitable for small files (total request size < 20MB).
- Files API upload: better for large files or reuse across multiple requests.
- 内嵌式图像数据:适合小文件(请求总大小<20MB)。
- Files API上传:更适合大文件或多请求复用场景。
6.2 Inline images (Node.js) minimal template
6.2 内嵌式图像(Node.js)最简模板
js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const imageBase64 = fs.readFileSync("image.jpg").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: [
{ inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
{ text: "Caption this image, and list any visible brands." },
],
});
console.log(response.text);js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const imageBase64 = fs.readFileSync("image.jpg").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: [
{ inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
{ text: "Caption this image, and list any visible brands." },
],
});
console.log(response.text);6.3 Upload and reference with Files API (Node.js) minimal template
6.3 使用Files API上传并引用(Node.js)最简模板
js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Caption this image.",
]),
});
console.log(response.text);js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Caption this image.",
]),
});
console.log(response.text);6.4 Multi-image prompts
6.4 多图像提示词
Append multiple images as multiple entries in the same ; you can mix uploaded references and inline bytes.
Partcontents在同一个中添加多个图像作为条目;可混合使用上传引用与内嵌字节。
contentsPart7. Video generation (Veo 3.1)
7. 视频生成(Veo 3.1)
7.1 Core features (must know)
7.1 核心功能(必知)
- Generates 8-second high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
- Supports:
- Aspect ratio (16:9 / 9:16)
- Video extension (extend a generated video; typically limited to 720p)
- First/last frame control (frame-specific)
- Up to 3 reference images (image-based direction)
- 生成8秒高保真视频,支持720p/1080p/4k分辨率可选,且支持原生音频生成(对话、环境音、音效)。
- 支持特性:
- 宽高比(16:9 / 9:16)
- 视频扩展(延长已生成的视频;通常限制为720p分辨率)
- 首尾帧控制(指定帧内容)
- 最多3张参考图像(基于图像引导生成)
7.2 SDK (Node.js) minimal template: async polling + download
7.2 SDK(Node.js)最简模板:异步轮询+下载
js
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
model: "veo-3.1-generate-preview",
prompt,
config: { resolution: "1080p" },
});
while (!operation.done) {
await new Promise((resolve) => setTimeout(resolve, 10_000));
operation = await ai.operations.getVideosOperation({ operation });
}
const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });js
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
model: "veo-3.1-generate-preview",
prompt,
config: { resolution: "1080p" },
});
while (!operation.done) {
await new Promise((resolve) => setTimeout(resolve, 10_000));
operation = await ai.operations.getVideosOperation({ operation });
}
const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });7.3 REST minimal template: predictLongRunning + poll + download
7.3 REST最简模板:predictLongRunning+轮询+下载
Key point: Veo REST uses to return an operation name, then poll ; once done, download from the video URI in the response.
:predictLongRunningGET /v1beta/{operation_name}关键点:Veo REST接口使用返回操作名称,随后轮询;完成后从响应中的视频URI下载文件。
:predictLongRunningGET /v1beta/{operation_name}7.4 Common controls (recommend a unified wrapper)
7.4 常用控制项(建议封装统一工具)
- :
aspectRatioor"16:9""9:16" - :
resolution(higher resolutions are usually slower/more expensive)"720p" | "1080p" | "4k" - When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood).
- Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.
- :
aspectRatio或"16:9""9:16" - :
resolution(分辨率越高通常速度越慢/成本越高)"720p" | "1080p" | "4k" - 编写提示词时:将对话内容用引号包裹;明确标注音效与环境音;使用影视专业术语(机位、运镜、构图、镜头效果、氛围)。
- 负向约束:若API支持负向提示词字段则直接使用;否则在提示词中列出不希望出现的元素。
7.5 Important limits (engineering fallback needed)
7.5 重要限制(需实现工程降级方案)
- Latency can vary from seconds to minutes; implement timeouts and retries.
- Generated videos are only retained on the server for a limited time (download promptly).
- Outputs include a SynthID watermark.
Polling fallback (with timeout/backoff) pseudocode
js
const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
await new Promise((resolve) => setTimeout(resolve, sleepMs));
sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");- 延迟范围从几秒到几分钟不等;需实现超时与重试机制。
- 生成的视频在服务器上仅保留有限时间(请及时下载)。
- 输出内容包含SynthID水印。
带超时/退避的轮询降级伪代码
js
const deadline = Date.now() + 300_000; // 5分钟
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
await new Promise((resolve) => setTimeout(resolve, sleepMs));
sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");8. Video understanding (Video Understanding)
8. 视频理解(Video Understanding)
8.1 Video input options
8.1 视频输入方式
- Files API upload: recommended when file > 100MB, video length > ~1 minute, or you need reuse.
- Inline video data: for smaller files.
- Direct YouTube URL: can analyze public videos.
- Files API上传:推荐用于文件>100MB、视频时长>约1分钟或需要复用的场景。
- 内嵌式视频数据:适合小文件。
- 直接YouTube URL:可分析公开视频。
8.2 Files API (Node.js) minimal template
8.2 Files API(Node.js)最简模板
js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Summarize this video. Provide timestamps for key events.",
]),
});
console.log(response.text);js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Summarize this video. Provide timestamps for key events.",
]),
});
console.log(response.text);8.3 Timestamp prompting strategy
8.3 时间戳提示词策略
- Ask for segmented bullets with "(mm:ss)" timestamps.
- Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.
- 要求返回带"(mm:ss)"时间戳的分段要点。
- 若需要结构化输出(如JSON),可在提示词中同时要求「带具体时间段的证据提取」与下游结构化解析。
9. Speech generation (Text-to-Speech, TTS)
9. 语音生成(Text-to-Speech, TTS)
9.1 Positioning
9.1 定位
- Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
- Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.
- 原生TTS:适用于「精准朗读+可控风格」场景(播客、有声书、广告配音等)。
- 与Live API的区别:Live API更侧重交互式、非结构化音频/多模态对话;TTS专注于可控的旁白生成。
9.2 Single-speaker TTS (Node.js) minimal template
9.2 单语音TTS(Node.js)最简模板
js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
config: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: "Kore" },
},
},
},
});
const data =
response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
config: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: "Kore" },
},
},
},
});
const data =
response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));9.3 Multi-speaker TTS (max 2 speakers)
9.3 多语音TTS(最多2个语音)
Requirements:
- Use
multiSpeakerVoiceConfig - Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).
要求:
- 使用
multiSpeakerVoiceConfig - 每个语音名称必须与提示词中的对话标签匹配(例如Joe/Jane)。
9.4 Voice options and language
9.4 语音选项与语言
- supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).
voice_name - The model can auto-detect input language and supports 24 languages (see docs for the list).
- 支持30种预构建语音(例如Zephyr、Puck、Charon、Kore等)。
voice_name - 模型可自动检测输入语言,支持24种语言(具体列表请查阅官方文档)。
9.5 "Director notes" (strongly recommended for high-quality voice)
9.5 「导演式提示词」(强烈推荐用于高质量语音生成)
Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.
为风格、语速、口音等提供可控指引,但避免过度约束。
10. Audio understanding (Audio Understanding)
10. 音频理解(Audio Understanding)
10.1 Typical tasks
10.1 典型任务
- Describe audio content (including non-speech like birds, alarms, etc.)
- Generate transcripts
- Transcribe specific time ranges
- Count tokens (for cost estimates/segmentation)
- 描述音频内容(包含非语音内容,如鸟鸣、警报等)
- 生成转录文本
- 转录指定时间段内容
- Token计数(用于成本估算/内容分段)
10.2 Files API (Node.js) minimal template
10.2 Files API(Node.js)最简模板
js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
"Describe this audio clip.",
createPartFromUri(uploaded.uri, uploaded.mimeType),
]),
});
console.log(response.text);js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
"Describe this audio clip.",
createPartFromUri(uploaded.uri, uploaded.mimeType),
]),
});
console.log(response.text);10.3 Key limits and engineering tips
10.3 关键限制与工程建议
- Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
- Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change).
- Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters).
- If total request size exceeds 20MB, you must use the Files API.
- 支持常见格式:WAV/MP3/AIFF/AAC/OGG/FLAC。
- 音频Token化:约32Token/秒(约1920Token/分钟;数值可能随版本变化)。
- 单提示词支持的音频总时长上限为9.5小时;多声道音频将被混音为单声道;音频会被重采样(具体参数请查阅官方文档)。
- 若请求总大小超过20MB,必须使用Files API。
11. End-to-end examples (composition)
11. 端到端示例(能力组合)
Example A: Image generation -> validation via understanding
示例A:图像生成 -> 理解校验
- Generate product images with Nano Banana (require negative space, consistent lighting).
- Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.
- If not satisfied, feed the generated image into text+image editing and iterate.
- 使用Nano Banana生成产品图像(要求留白、光线统一)。
- 使用图像理解进行自检:验证文本清晰度、品牌拼写、是否包含违规元素。
- 若不符合要求,将生成的图像输入文本+图像编辑流程进行迭代。
Example B: Video generation -> video understanding -> narration script
示例B:视频生成 -> 视频理解 -> 旁白脚本
- Generate an 8-second shot with Veo (include dialogue or SFX).
- Download and save (respect retention window).
- Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).
- 使用Veo生成8秒镜头(包含对话或音效)。
- 下载并保存视频(注意保留期限)。
- 上传视频至视频理解模块,生成分镜脚本+时间戳+旁白文案(随后输入TTS生成语音)。
Example C: Audio understanding -> time-range transcription -> TTS redub
示例C:音频理解 -> 时间段转录 -> TTS重配
- Upload meeting audio and transcribe full content.
- Transcribe or summarize specific time ranges.
- Use TTS to generate a "broadcast" version of the summary.
- 上传会议音频并转录完整内容。
- 转录或总结指定时间段内容。
- 使用TTS生成该总结的「广播级」版本。
12. Compliance and risk (must follow)
12. 合规与风险(必须遵守)
- Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
- Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints.
- Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.
- 确保你拥有上传图像/视频/音频的必要权利;不得生成侵权、欺诈、骚扰或有害内容。
- 生成的图像与视频包含SynthID水印;视频可能还存在区域/人物相关的生成限制。
- 生产系统必须实现超时、重试、故障降级机制,并对生成内容进行人工审核/后处理。
13. Quick reference (Checklist)
13. 快速参考清单
- Pick the right model: image generation (Flash Image / Pro Image Preview), video generation (Veo 3.1), TTS (Gemini 2.5 TTS), understanding (Gemini Flash/Pro).
- Pick the right input mode: inline for small files; Files API for large/reuse.
- Parse binary outputs correctly: image/audio via inline_data decode; video via operation polling + download.
- For video generation: set aspectRatio / resolution, and download promptly (avoid expiration).
- For TTS: set response_modalities=["AUDIO"]; max 2 speakers; speaker names must match prompt.
- For audio understanding: countTokens when needed; segment long audio or use Files API.
- 选择正确的模型:图像生成(Flash Image / Pro Image Preview)、视频生成(Veo 3.1)、TTS(Gemini 2.5 TTS)、理解类(Gemini Flash/Pro)。
- 选择正确的输入方式:小文件用内嵌式;大文件/复用用Files API。
- 正确解析二进制输出:图像/音频通过inline_data解码;视频通过轮询操作+下载获取。
- 视频生成:设置宽高比/分辨率,并及时下载(避免过期)。
- TTS:设置response_modalities=["AUDIO"];最多支持2个语音;语音名称必须与提示词匹配。
- 音频理解:必要时进行Token计数;长音频需分段或使用Files API。