Loading...
Loading...
Found 18 Skills
Convert audio/video to text using Whisper, with support for word-level timestamps. Use this when users need speech-to-text conversion, audio-to-text transcription, video-to-text extraction, subtitle generation, transcribe audio, speech to text, generate subtitles, or speech recognition.
Text-to-speech, speech-to-text, voice conversion, and audio processing using EachLabs AI models. Supports ElevenLabs TTS, Whisper transcription with diarization, and RVC voice conversion. Use when the user needs TTS, transcription, or voice conversion.
Transcribe audio files to text with optional diarization and known-speaker hints. Use when a user asks to transcribe speech from audio/video, extract text from recordings, or label speakers in interviews or meetings.
Video and audio processing with FFmpeg. Use for format conversion, resizing, compression, audio extraction, and preparing assets for Remotion. Triggers include converting GIF to MP4, resizing video, extracting audio, compressing files, or any media transformation task.
Expert skill for implementing speech-to-text with Faster Whisper. Covers audio processing, transcription optimization, privacy protection, and secure handling of voice data for JARVIS voice assistant.
Local speech-to-text with the Whisper CLI (no API key).
FFmpeg video and audio processing patterns. Use when transcoding video/audio, extracting clips, adding filters, merging media, creating thumbnails, or batch processing media files.
Text-to-Speech Tool - Supports script parsing, emotion tagging, and post-processing, based on Edge TTS
Internal utility skill for media assembly operations. NOT called directly by users. Used by producer skills (video-producer, podcast-producer, audio-producer, social-producer) to stitch, mix, and assemble final media outputs.
Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.
Read, watch, and listen to video/audio files. Use Gemini for native video understanding, or extract key frames + Whisper transcription as fallback. Use when a user sends a video/audio and asks about its content, what's in it, what someone said, etc.
Use Chanjing TTS API to synthesize speech from text, using user-provided voice