Loading...
Loading...
Found 48 Skills
基于 FFmpeg 的音频处理技能,提供最实用的日常音频处理命令
Use when processing audio for Xiaohongshu content, editing voiceovers, improving sound quality, creating podcasts, or producing audio-based posts
Text-to-speech, speech-to-text, voice conversion, and audio processing using EachLabs AI models. Supports ElevenLabs TTS, Whisper transcription with diarization, and RVC voice conversion. Use when the user needs TTS, transcription, or voice conversion.
Use this skill to create single-voice audio content like audiobooks, voiceovers, narrations, jingles, and audio ads. Triggers: "create audiobook", "generate voiceover", "narration", "audio ad", "radio ad", "jingle", "brand audio", "sonic logo", "text to audio", "read this aloud", "audio guide", "meditation audio", "soundscape" Orchestrates: narration/TTS, background music, and audio assembly. NOTE: For conversations/dialogues, use podcast-producer instead.
ASR (Automatic Speech Recognition) — enhanced speech-to-text built on Doubao large model, with audio preprocessing, denoising, and extended analysis capabilities. Async API. Choose this skill when: - Input is a video file (mp4/mov/mkv) — auto-extracts audio track - Audio needs denoising before recognition - File exceeds 512MB or 5 hours (no size limit) - Audio source is a TOS internal path (tos://bucket/key) - Need structured JSON output with timestamped utterances and metadata - Need speaker diarization, emotion/gender detection, speech rate, or sensitive word filtering Supports 99 languages, multiple formats (wav/mp3/m4a/aac/flac/ogg/mp4/mov/mkv), and auto language detection.
Transcribe audio files to text with optional diarization and known-speaker hints. Use when a user asks to transcribe speech from audio/video, extract text from recordings, or label speakers in interviews or meetings.
Local speech-to-text with the Whisper CLI (no API key).
Internal utility skill for media assembly operations. NOT called directly by users. Used by producer skills (video-producer, podcast-producer, audio-producer, social-producer) to stitch, mix, and assemble final media outputs.
Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.
Help users integrate Runway audio APIs (TTS, sound effects, voice isolation, dubbing)
Expert skill for implementing speech-to-text with Faster Whisper. Covers audio processing, transcription optimization, privacy protection, and secure handling of voice data for JARVIS voice assistant.
Convert audio/video to text using Whisper, with support for word-level timestamps. Use this when users need speech-to-text conversion, audio-to-text transcription, video-to-text extraction, subtitle generation, transcribe audio, speech to text, generate subtitles, or speech recognition.