Loading...
Loading...
Found 196 Skills
Convert written documents to narrated video scripts with TTS audio and word-level timing. Use when preparing essays, blog posts, or articles for video narration. Outputs scene files, audio, and VTT with precise word timestamps. Keywords: narration, voiceover, TTS, scenes, audio, timing, video script, spoken.
Speak text aloud using system TTS (say command on macOS/Linux) or browser TTS via Chrome DevTools Protocol. Use when: (1) job completes and you want to announce results, (2) user asks to hear something spoken, (3) notifications that need audio alerts, (4) accessibility - reading content aloud.
Use when the user gives a topic and wants an automated topic-driven narrated explainer, podcast, or knowledge-summary video (Bilibili / YouTube / Xiaohongshu / Douyin / WeChat Channels), or asks to learn visual design patterns from a reference video/image. Trigger when the user mentions creating a knowledge video, narrated explainer, video podcast, or talking-head topic video from a topic — even if they don't say "video podcast" explicitly. Do NOT trigger for generic video editing, trimming, format conversion, color grading, or non-narrative video tasks. Produces 4K video via research → script → TTS → Remotion → MP4 + BGM.
Build with OpenAI's stateless APIs - Chat Completions (GPT-5, GPT-4o), Embeddings, Images (DALL-E 3), Audio (Whisper + TTS), and Moderation. Includes Node.js SDK and fetch-based approaches for Cloudflare Workers. Use when: implementing chat completions with GPT-5/GPT-4o, streaming responses with SSE, using function calling/tools, creating structured outputs with JSON schemas, generating embeddings for RAG (text-embedding-3-small/large), generating images with DALL-E 3, editing images with GPT-Image-1, transcribing audio with Whisper, synthesizing speech with TTS (11 voices), moderating content (11 safety categories), or troubleshooting rate limits (429), invalid API keys (401), function calling failures, streaming parse errors, embeddings dimension mismatches, or token limit exceeded.
Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice.
Local text-to-speech on Ubuntu using Kokoro TTS with fallbacks. Use when the user asks to speak text out loud, test audio output, switch Kokoro voices, or debug TTS playback issues. Triggers on "say this", "read aloud", "speak", "TTS", "voice test".
Text-to-speech (TTS) and speech-to-text (STT) via Together AI. TTS models include Orpheus, Kokoro, Cartesia Sonic, Rime, MiniMax with REST, streaming, and WebSocket support. STT models include Whisper and Voxtral. Use when users need voice synthesis, audio generation, speech recognition, transcription, TTS, STT, or real-time voice applications.
Use when the user has a video + a target-language SRT and wants the video to actually speak that language — generates a time-aligned TTS voice dub. Routes by voice ID — Volcano (豆包) TTS for Chinese, edge-tts neural for any language. Defaults to one voice (single-speaker); opt-in multi-speaker via visual diarization. Outputs `*_<lang>_dub.mp4` with the dub audio in place of the original. Final mixing (audio bed + burn-in) is handed off to `/wjs-burning-subtitles`. Triggers — "配音", "中文配音", "Chinese dub", "voice over this", "dub the video", "TTS this SRT", "different voice for each speaker".
AI media generation CLI tool using Google's Imagen 4, Veo 3.1, and Gemini TTS. Use when the user wants to (1) generate images from text prompts, (2) edit existing images with AI, (3) explain image contents, (4) generate videos from text or images, (5) create narration/voice audio with character settings. Triggers on requests like "generate an image of...", "create a video...", "make a voice that says...", "edit this image to...", "describe this image".
Generate speech audio from text using HeyGen's Starfish TTS model. Use when: (1) Generating standalone speech audio files from text, (2) Converting text to speech with voice selection, speed, and pitch control, (3) Creating audio for voiceovers, narration, or podcasts, (4) Working with HeyGen's /v1/audio endpoints, (5) Listing available TTS voices by language or gender.
MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image. Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design, multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame, subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character reference), and media processing (convert, concat, trim, extract). Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI, MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs.
Select and create the perfect AI voice for your content using ElevenLabs, Qwen3-TTS, and other platforms—matching voice characteristics to brand personality and audience. Use when: Choosing an AI voice for video narration; Creating a consistent brand voice across content; Cloning a voice for scalable production; Comparing voice synthesis platforms; Designing voice characteristics by description