Search Results: multimodal-ai

Found 33 Skills

AI & Machine Learningagentspace-so/runcomfy-ag...

seedance-v2

Generate cinematic short-form video with ByteDance Seedance 2.0 Pro on RunComfy. Documents Seedance 2.0 Pro's strengths (multi-modal references — up to 9 images, 3 videos, 3 audio — synchronized in-pass audio with natural lip-sync, cinematic motion refinement), the 4–15s duration schema, and when to route to HappyHorse 1.0 / Wan 2.7 / Kling instead. Calls `runcomfy run bytedance/seedance-v2/pro` through the local RunComfy CLI. Triggers on "seedance", "seedance 2", "seedance v2", "seedance pro", "bytedance video", or any explicit ask to generate video with this model.

🇺🇸|EnglishTranslated

276.5k

AI & Machine Learninggoogle-gemini/gemini-skil...

gemini-api-dev

Use this skill when building applications with Gemini models, Gemini API, working with multimodal content (text, images, audio, video), implementing function calling, using structured outputs, or needing current model specifications. Covers SDK usage (google-genai for Python, @google/genai for JavaScript/TypeScript), model selection, and API capabilities.

🇺🇸|EnglishTranslated

AI & Machine Learningmindrally/skills

transformers-huggingface

Expert guidance for working with Hugging Face Transformers library for NLP, computer vision, and multimodal AI tasks.

🇺🇸|EnglishTranslated

AI & Machine Learningsickn33/antigravity-aweso...

ai-engineer

Build production-ready LLM applications, advanced RAG systems, and intelligent agents. Implements vector search, multimodal AI, agent orchestration, and enterprise AI integrations. Use PROACTIVELY for LLM features, chatbots, AI agents, or AI-powered applications.

🇺🇸|EnglishTranslated

AI & Machine Learningmodelstudioai/skills

bailian-docs-llm-wiki

Technical Document Knowledge Base (LLM Wiki) for Alibaba Cloud Tongyi Qianfan Platform. Activated when users inquire about Qianfan-related issues such as model lists, API parameters, error codes, application development (Agent/RAG/Knowledge Base/Memory/Plugins), model comparison and pricing, SDK/OpenAI compatible interfaces, multimodal capabilities (speech/image/video), Token billing, etc. It includes structured model market data in models (including contextWindow/QPM/pricing/sample code), wiki synthesis layer (topic pages/concept pages/comparison pages), and raw original document layer; for model specification issues, check models/index.md first, and for document-related issues, check wiki/index.md first.

🇨🇳|ChineseTranslated

AI & Machine Learningjezweb/claude-skills

google-gemini-api

Integrate Gemini API with @google/genai SDK (NOT deprecated @google/generative-ai). Text generation, multimodal (images/video/audio/PDFs), function calling, thinking mode, streaming. 1M input tokens. Prevents 14 documented errors. Use when: Gemini integration, multimodal AI, reasoning with thinking mode. Troubleshoot: SDK deprecation, model not found, context window, function calling errors, streaming corruption, safety settings, rate limits.

🇺🇸|EnglishTranslated

15 scripts/Attention

AI & Machine Learningcclank/openclaw_provider_...

bailian-multimodal-skills

Generate images, video, speech, and transcribe audio using Aliyun Bailian models.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningcnemri/google-genai-skill...

google-genai-sdk-python

Expert guidance for writing Python code using the official Google GenAI SDK (google-genai) for Gemini API and Vertex AI. Use for text generation, multimodal inputs, reasoning, tools, and media generation.

🇺🇸|EnglishTranslated

AI & Machine Learningsongguoxs/seedance-prompt...

seedance

This skill should be used when the user asks to "generate video prompts", "create Seedance prompts", "write video descriptions", mentions "Seedance", "seedance", "Jimeng", "Jimeng Platform", "video prompts", "video generation", "AI video", "short drama", "advertising video", "video extension", or discusses video prompt engineering, AI video generation, or Seedance 2.0 workflows.

🇨🇳|ChineseTranslated

AI & Machine Learningsamhvw8/dot-claude

ai-multimodal

Multimodal AI processing via Google Gemini API (2M tokens context). Capabilities: audio (transcription, 9.5hr max, summarization, music analysis), images (captioning, OCR, object detection, segmentation, visual Q&A), video (scene detection, 6hr max, YouTube URLs, temporal analysis), documents (PDF extraction, tables, forms, charts), image generation (text-to-image, editing). Actions: transcribe, analyze, extract, caption, detect, segment, generate from media. Keywords: Gemini API, audio transcription, image captioning, OCR, object detection, video analysis, PDF extraction, text-to-image, multimodal, speech recognition, visual Q&A, scene detection, YouTube transcription, table extraction, form processing, image generation, Imagen. Use when: transcribing audio/video, analyzing images/screenshots, extracting data from PDFs, processing YouTube videos, generating images from text, implementing multimodal AI features.

🇺🇸|EnglishTranslated

6 scripts/Attention

AI & Machine Learninglinkfox-ai/linkfox-skills

linkfox-multimodal-recognize-image

基于多模态AI的图片识别与分析。当用户想分析、描述、从图片URL中提取信息、image recognition, image analysis, image description, image content understanding, OCR text recognition, visual Q&A时触发此技能。当用户提到图片识别、图片分析、图片描述、识别图片内容、分析产品图、从图片中读取文字、描述图片、提取视觉内容或理解照片内容时触发。当用户提供图片URL并就其视觉内容提问时，即使未明确说"图片识别"，也应触发此技能。

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningbinhmuc/autobot-review

ai-multimodal

Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.

🇺🇸|EnglishTranslated

7 scripts/Attention