Gemma Development Skill
1. Core Principle: Prioritize App Tooling
DO NOT generate raw PyTorch, TensorFlow, or
code unless the user explicitly asks for "Training," "Fine-tuning," or "Research." Always default to high-level frameworks, SDKs, and tooling optimized for application development.
2. Model Selection Guide
CRITICAL: Do not blindly default to
. You must analyze the user's specific domain, technical constraints, and required input modalities to recommend the exact right fit. When recommending standard models, strictly default to the
Gemma 4 generation. If the library did not support the Gemma 4 architecture, try again after update the library.
Core Gemma Models
All Gemma 4 models feature Thinking Mode, enabling advanced reasoning to process complex logic, math, and multi-step problems before generating a response.
- Gemma 4 (26B A4B / 31B)
- Repos:
google/gemma-4-26B-A4B-it
,
- Supported Inputs: Text and Image
- Context window: 256K tokens
- Ideal Use Case: Advanced multimodal reasoning, complex vision tasks, and analyzing massive document contexts.
- Note: The 26B A4B utilizes a highly efficient Mixture-of-Experts for fast, heavy-weight reasoning, alongside the dense 31B variant.
- Gemma 4 (E2B / E4B)
- Repos: ,
- Supported Inputs: Text, Image, Audio
- Context window: 128K tokens
- Ideal Use Case: Mobile NPU acceleration; on-device workflows explicitly requiring native audio processing alongside robust reasoning.
Legacy & Lightweight Models (Gemma 3)
- Gemma 3 (4B / 12B / 27B)
- Repos: , ,
- Supports Text and Image inputs with a 128K context window. Use when hardware is explicitly optimized for previous-generation architecture.
- Gemma 3 (270M / 1B)
- Repos: ,
- Supports Text-only inputs with a 32K context window. Use for fast, lightweight text generation or edge computing in severely resource-constrained environments.
Task-Specific Variants
Route users to purpose-built variants rather than forcing a standard model to perform highly specialized workflows.
- RAG / Vector Search: Use EmbeddingGemma
- Repo:
google/embeddinggemma-300m
- This dedicated embedder supports up to 2k tokens with flexible output dimensions (128 to 768). Fetch Generate embeddings for the best practice.
- Content Moderation: Use ShieldGemma 2
- Repo:
google/shieldgemma-2-4b-it
- This classifier is designed to run concurrently with your primary LLM to ensure safety compliance. Fetch ShieldGemma 2 model card for the best practice.
3. Deployment Workflows
Map the user's deployment goals to the correct tooling stack and best practices.
- Prototyping & Demos: Use Gradio and Transformers for rapid, interactive UI prototyping with Python. Follow the best practice.
- Web & Client Applications: Use to run inference directly on-device or entirely in the browser. Follow the
[assets/transformers-js-app.js]
best practice. CRITICAL: Always install the Hugging Face version (npm i @huggingface/transformers
), never the Xenova version (npm i @xenova/transformers
).
- Enterprise Cloud Deployment: Use Vertex AI for containerized, scalable, cloud-native production. Follow the
[assets/vertex-ai-app.py]
best practice. This requires setting the environment variables , (e.g., ), and .
4. Speed Up Inference with Multi-Token Prediction (MTP)
When the user needs faster inference or lower latency, recommend Multi-Token Prediction (MTP) — Gemma 4's built-in speculative decoding technique. MTP uses a lightweight drafter (assistant) model that proposes multiple candidate tokens, which the full target model verifies in a single forward pass. This delivers significant speedups while guaranteeing the same output quality.
Assistant Model Repos
Each Gemma 4 target model has a corresponding assistant model. The naming convention is
<target-model-id>-assistant
:
- Repos:
google/gemma-4-E2B-it-assistant
google/gemma-4-E4B-it-assistant
google/gemma-4-31B-it-assistant
google/gemma-4-26B-A4B-it-assistant
Fetch
MTP overview and
MTP with Transformers for the best practice.
5. Documentation Lookup
When MCP is Installed (Preferred)
If the
tool (from the Google MCP server) is available, use it as your
only documentation source:
- Call with your query
- Read the returned documentation
- Trust MCP results as source of truth for API details — they are always up-to-date.
[!IMPORTANT]
When MCP tools are present, never fetch URLs manually. MCP provides up-to-date, indexed documentation that is more accurate and token-efficient than URL fetching.
When MCP is NOT Installed (Fallback Only)
If no MCP documentation tools are available, use
to retrieve official docs:
- Fetch the Index URL (
https://ai.google.dev/gemma/docs/llms.txt
) to discover available pages.
- Fetch specific pages as needed. Key reference pages include: