audio-jingle

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Audio Jingle Skill

音频短曲Skill

Three sub-modes. The active project's

audioKind

decides which one runs:

`audioKind`	Models we route to	Plan focus
`music`	Suno V5 (default), Udio, Lyria 2	genre + tempo + instrumentation
`speech`	MiniMax TTS (default), Fish, ElevenLabs V3	script + voice + pacing
`sfx`	ElevenLabs SFX (default), AudioCraft	texture + impact + duration

包含三种子模式，由当前项目的

audioKind

决定启用哪一种：

`audioKind`	路由至的模型	规划重点
`music`	Suno V5（默认）、Udio、Lyria 2	流派 + 节奏 + 乐器配置
`speech`	MiniMax TTS（默认）、Fish、ElevenLabs V3	脚本 + 音色 + 语速
`sfx`	ElevenLabs SFX（默认）、AudioCraft	质感 + 冲击力 + 时长

Resource map

资源结构

audio-jingle/
├── SKILL.md
└── example.html

audio-jingle/
├── SKILL.md
└── example.html

Workflow

工作流程

Step 0 — Read the project metadata

步骤0 — 读取项目元数据

audioKind

audioModel

audioDuration

(seconds), and (for speech)

voice

. Branch by

audioKind

and use the values verbatim — no clarifying form unless something is marked

(unknown — ask)

Important:

voice

is provider-specific. For

minimax-tts

--voice

must be a valid MiniMax

voice_id

(for example

male-qn-qingse

), not a natural-language description. If you only have a prose voice brief ("warm female narrator", "neutral Mandarin"), keep that in your plan but omit

--voice

so the daemon's default voice id applies, or ask the user to choose a specific id.

读取

audioKind

、

audioModel

、

audioDuration

（秒）以及（语音模式下的）

voice

。根据

audioKind

分支处理，直接使用这些值——除非标记为

(unknown — ask)

，否则无需额外确认。

重要提示：

voice

是服务商专属参数。对于

minimax-tts

，

--voice

必须是有效的MiniMax

voice_id

（例如

male-qn-qingse

），而非自然语言描述。如果只有文字化的音色需求（如“温暖的女性旁白”、“中性普通话”），可将其纳入规划，但省略

--voice

参数，使用守护进程的默认音色ID，或请用户选择具体的ID。

Step 1 — Plan

步骤1 — 制定规划

Music

Genre + reference artists (1-2)
Tempo (BPM) + key
Instrumentation (3-5 instruments max)
Vocals: yes / no / hummed / choir
Mood arc (intro → chorus → outro)

Speech

Script (final, not draft — TTS runs verbatim)
Voice target + pacing For MiniMax this means a real
```
voice_id
```
, not prose in
```
--voice
```
Pronunciation hints for proper nouns / acronyms

SFX

Texture (impact / whoosh / ambience / foley)
Duration + envelope (sharp attack vs. gentle swell)
Layering note (single hit vs. stacked)

State the plan in 2-3 sentences before dispatching.

音乐模式

流派 + 参考艺术家（1-2位）
节奏（BPM） + 调式
乐器配置（最多3-5种乐器）
人声：是/否/哼唱/合唱
情绪曲线（开场 → 高潮 → 收尾）

语音模式

脚本（最终版本，非草稿——TTS将严格按原文朗读）
目标音色 + 语速对于MiniMax，需使用真实的
```
voice_id
```
，而非文字描述作为
```
--voice
```
参数值
专有名词/缩写的发音提示

音效模式

质感（撞击声/呼啸声/环境音/拟音）
时长 + 包络（尖锐爆发 vs 平缓渐起）
分层说明（单次音效 vs 叠加音效）

在发送请求前，用2-3句话说明规划内容。

Step 2 — Compose the prompt

步骤2 — 编写提示词

Use the format the upstream model prefers. Bind

audioDuration

to the API parameter directly; never put "make it 30 seconds" in prose.

采用上游模型偏好的格式。将

audioDuration

直接绑定到API参数；切勿在提示词中写入“制作30秒时长”这类描述。

Step 3 — Dispatch via the media contract

步骤3 — 通过媒体协议发送请求

Use the unified dispatcher — do not call provider APIs by hand:

bash

node "$OD_BIN" media generate \
  --project "$OD_PROJECT_ID" \
  --surface audio \
  --audio-kind "<music|speech|sfx>" \
  --model "<audioModel from metadata>" \
  --duration <audioDuration seconds> \
  [--voice "<provider voice id (speech only)>"] \
  --output "<short-slug>-<duration>s.mp3" \
  --prompt "<assembled prompt from Step 2 — for speech, the literal script>"

The command prints one line of JSON:

{"file": {"name": "...", ...}}

. The bytes land in the project; the FileViewer renders the audio transport controls automatically.

使用统一调度器——不要手动调用服务商API：

bash

node "$OD_BIN" media generate \
  --project "$OD_PROJECT_ID" \
  --surface audio \
  --audio-kind "<music|speech|sfx>" \
  --model "<audioModel from metadata>" \
  --duration <audioDuration seconds> \
  [--voice "<provider voice id (speech only)>"] \
  --output "<short-slug>-<duration>s.mp3" \
  --prompt "<assembled prompt from Step 2 — for speech, the literal script>"

该命令会输出一行JSON：

{"file": {"name": "...", ...}}

。音频文件将存入项目，FileViewer会自动渲染音频播放控件。

Step 4 — Hand off

步骤4 — 交付结果

Reply with: plan summary, the filename returned by the dispatcher, and one sentence on what to try if the user wants a variation (e.g. "swap tempo from 92 to 108 BPM" rather than "make it different").

回复内容需包含：规划摘要、调度器返回的文件名，以及一句关于如何调整变体的建议（例如“将节奏从92 BPM改为108 BPM”，而非“做出改变”）。

Hard rules

硬性规则

TTS runs your script literally. Proof it before dispatching — even one stray comma changes the cadence.
MiniMax TTS rejects free-form voice prose in
```
--voice
```
. Use a real MiniMax
```
voice_id
```
(for example
```
male-qn-qingse
```
) or omit the flag and let the daemon's default voice apply.
Music: under 30s = single section; 30–90s = intro + body; 90s+ = full arc. Don't try to fit a 3-act song into 15 seconds.
SFX: prefer one well-described layer over a paragraph of "make it cool" — generators reward specific texture words.
Save the file every turn. The audio viewer shows transport controls the moment the file lands.

TTS将严格按照脚本朗读。发送请求前务必校对——哪怕一个多余的逗号都会改变语调。
MiniMax TTS不接受
```
--voice
```
参数中的自由格式文字描述。请使用真实的MiniMax
```
voice_id
```
（例如
```
male-qn-qingse
```
），或省略该参数，使用守护进程的默认音色。
音乐：时长小于30秒=单段结构；30–90秒=开场+主体；90秒以上=完整结构。不要试图在15秒内塞进三段式歌曲。
音效：优先使用精准描述的单层音效，而非堆砌“让它更酷”这类模糊表述——生成器更青睐具体的质感词汇。
每次操作都要保存文件。文件存入后，音频查看器会立即显示播放控件。