cuTile Python Programming Skill
You are an expert in cuTile programming, specializing in writing high-performance GPU kernels using cuTile's tile-based programming model. This skill provides comprehensive guidance for creating, debugging, and optimizing cuTile kernels.
Overview
cuTile is a parallel programming model for NVIDIA GPUs with a Python-based DSL that automatically leverages advanced hardware capabilities like tensor cores. This skill helps you write efficient, correct cuTile code.
When to Use This Skill
Invoke this skill when you need to:
- Write cuTile GPU kernels from scratch
- Convert tensor operations to cuTile implementations
- Debug or fix cuTile kernel code
- Optimize cuTile kernels for performance
- Understand cuTile API and programming patterns
- Validate cuTile implementations
- Find and adapt examples from available reference sources
Optionally specify when invoking:
- Target tensor shapes
- Data types (default: float16)
- Performance requirements
- Any special constraints
Reference Documentation
cuTile Language Specification —
https://docs.nvidia.com/cuda/cutile-python. Covers
the execution model, data and memory models, debugging, compilation, and every public op
(load/store, factories, reductions, scans, matmul, selection, math, bitwise, comparisons,
atomics, metaprogramming, classes, enums, autotuning).
Implementation Guidelines (in the
directory):
- 01_implementation_lessons.md - Important lessons and implementation rules
- 02_code_generation_rules.md - Specific code generation rules and patterns
- 03_concepts.md - Core concepts: tile size restriction, memory operations, kernel fusion, default rules
Examples
Before starting any cuTile programming task,
always search for existing examples first. TileGym is the primary reference; the packaged
directory complements it for ops TileGym does not yet cover (convolution, pooling, scan, GEMV, 4D matmul, split-k GEMM, group_norm).
The skill supports two installation contexts:
- Inside a TileGym checkout (
<repo>/.agents/skills/cutile-python/
, or <repo>/.claude/skills/cutile-python/
via the backward-compat symlink) — TileGym ops are at <repo>/src/tilegym/ops/cutile/
.
- Installed elsewhere (e.g.
~/.agents/skills/cutile-python/
, ~/.claude/skills/cutile-python/
, or inside a different repo) — clone TileGym once to ${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym
and use its .
See examples/tilegym_and_examples_guide.md for the full search order, directory layout, and cache-vs-repo decision procedure.
When to Clarify Before Implementation
For complex or ambiguous tasks, present approach options to the user before coding. This prevents wasted effort on the wrong implementation.
Clarify for These Task Types
| Task Type | Why Clarify | Example Questions |
|---|
| Optimization requests | "Make this faster" has many paths | Which bottleneck? Memory-bound vs compute-bound? Target speedup? |
| Architecture changes | Structural decisions affect everything | Data parallel vs model parallel? Persistent kernel vs standard? |
| Ambiguous operations | Same name, different implementations | Flash attention vs standard? Causal vs bidirectional? Grouped vs depthwise conv? |
| Performance vs correctness tradeoffs | User must choose | Use TF32 for speed? Approximate math functions? Reduced precision accumulation? |
| Missing constraints | Can't optimize without targets | Target tensor shapes? Batch size range? Memory budget? |
Act Directly for These Task Types
- Clear, specific requests: "Write a ReLU kernel for shape (1024, 1024)"
- Bug fixes with reproduction: "This kernel crashes on line 42"
- API questions: "How do I use ct.gather?"
- Example adaptations: "Adapt the TileGym softmax for my shapes"
How to Clarify
When clarification is needed:
- Briefly explain why multiple approaches exist
- Present 2-3 concrete options with tradeoffs
- Recommend one option if there's a clear best choice
- Ask the user to choose before proceeding
Example:
Your request "optimize this matmul" could go several directions:
1. **Persistent kernel** - Best for small matrices, faster, more complex code
2. **Tile size tuning** - Moderate gains, minimal code changes
3. **TMA prefetching** - Best for large matrices, requires Hopper+ GPU
I recommend option 2 for a first pass. Which approach would you like?
Complexity Assessment: Simple vs. Orchestrated Workflow
Before starting implementation, assess the complexity of the request to choose the right workflow.
Use the Simple Workflow (Steps 0-6 below) when:
- Single kernel task (e.g., ReLU, softmax, one matmul)
- Bug fix or optimization of an existing kernel
- API question or example adaptation
- Clear, single-operation request
Use the Deep Agent Orchestration Workflow when ANY of these apply:
- 3+ distinct operations that need separate kernels (e.g., "implement a transformer block with attention, FFN, and layer norm")
- Multiple user-defined functions in the input code (e.g., , )
- Inter-kernel data dependencies where output of one kernel feeds into another
- PyTorch with multiple layers in
- Explicit decomposition request (e.g., "break this into fused kernels")
When orchestration is needed, follow the Deep Agent Orchestration Workflow section. Otherwise, continue with the Instructions below.
Deep Agent Orchestration Workflow
For complex tasks requiring 3+ kernels, inter-kernel dependencies, or multi-layer
decomposition, use the orchestrated multi-agent pipeline. The main agent acts as an
orchestrator (not a coder) — sub-agents handle reference reading and code generation.
Pipeline: Op Tracer (optional) → Analyzer → Kernel Agents (parallel) → Composer → Main Agent validates
For the complete step-by-step workflow (Steps O-0 through O-4), prompt templates, and error handling, see orchestration/workflow.md.
For the orchestration architecture, agent hierarchy, and kernel spec format, see orchestration/overview.md.
Instructions
Follow these steps when writing cuTile kernels (simple workflow for single-kernel tasks).
NOTE: Skip this entire section if using the Deep Agent Orchestration Workflow above. The orchestration workflow has its own steps (O-0 through O-4). Do NOT combine both workflows - that leads to the main agent reading all reference files AND spawning sub-agents, which wastes context.
Step 0: Search Examples and Consult References (MANDATORY)
Objective: Find existing examples and review relevant documentation
Example Search (Two-Step Strategy):
- Search TileGym () first for similar cuTile kernel patterns.
- If TileGym has no match, search the packaged directory (part of this skill).
- Read relevant example files to understand implementation patterns.
Complex Algorithm Translation (flash attention, fused ops, etc.):
When implementing complex algorithms, follow this systematic approach:
- Analyze the PyTorch implementation: Understand the mathematical operations, data flow, key computational patterns, memory access patterns, and any special optimizations or constraints.
- Study relevant cuTile examples: Review examples for similar operations — existing examples often provide the exact patterns you need. Copy and adapt working patterns rather than reinventing the wheel.
- Implement the cuTile version: Map PyTorch operations to cuTile primitives, apply kernel fusion where appropriate, ensure proper tile indexing and memory management, and validate against the PyTorch reference.
Reference Documentation:
- Language Spec — https://docs.nvidia.com/cuda/cutile-python
- Implementation Guidelines ( 01–03) — Lessons, rules, and concepts
Step 1: Understand the Problem
Objective: Clearly define what the kernel needs to compute
- Identify input/output tensors and their shapes/dtypes
- Understand the mathematical operations required
- Determine data dependencies and computation flow
- Analyze memory access patterns for optimization opportunities
Working with user-provided reference implementations:
- Preserve Reference Code: Keep the original PyTorch reference implementation intact. Only remove code that is clearly redundant or unnecessary.
- Conservative Approach: Do not modify or rewrite the reference implementation unless explicitly required. The reference serves as the ground truth for correctness validation.
- Seek Clarification: If you are uncertain about the correctness or intent of any part of the reference code, ask the user for clarification before proceeding.
- Maintain Functionality: Any changes to the reference code must preserve the original functionality and behavior.
Step 2: Design Kernel Architecture
Objective: Plan the kernel structure
- Determine optimal block/tile sizes for parallelization (consider multiples of 32)
- Calculate grid dimensions based on tensor sizes using
- Design block indexing strategy using
- Handle edge cases where tensor size is not divisible by block size
Step 3: Prepare Type System and Constants
Objective: Ensure proper type annotations
- Identify all constant values that need type annotations
- Add proper type annotations using for all constants
- Choose appropriate cuTile dtypes (ct.float32, ct.float16, ct.int32, etc.)
- Ensure block sizes and other parameters are properly typed
Step 4: Implement the Kernel
Objective: Write the cuTile kernel function
- Create decorated kernel function with proper signature
- Add required parameters (input tensors, output tensor, typed constants)
- Implement block indexing with appropriate calls
- Use for input tensor access with proper indexing and tile shapes
- Perform operations on loaded tiles using cuTile tile operations
- Use for output tensor writing with correct indexing
Step 5: Prepare and Launch
Objective: Set up tensor inputs and launch kernel
- Ensure all input tensors are on CUDA device using or
- Verify tensor dtypes are compatible with cuTile
- Handle tensor contiguity requirements using if needed
- Launch kernel with proper grid dimensions
Step 6: Validate and Test
Objective: Ensure correctness
- Verify kernel compiles without errors
- Test with various tensor sizes (aligned and unaligned to tile size)
- Validate results against reference implementation if available
- Check boundary conditions and edge cases
Validation Loop (MANDATORY)
IMPORTANT: After generating cuTile code, you MUST execute it to verify correctness. Do not just write the file - run it and fix any issues.
Validation Workflow
┌─────────────────────────────────────────────────────────────┐
│ 1. Generate Code │
│ - Write cuTile kernel with inline validation to file │
│ │
│ 2. Execute Code │
│ - Run: python <filename>.py │
│ │
│ 3. Check Results │
│ ├─ Compilation error? → Fix syntax/type issues → Retry │
│ ├─ Runtime error? → Fix kernel logic → Retry │
│ ├─ Validation FAIL? → Fix numerical issues → Retry │
│ └─ Validation PASS? → Done ✓ │
└─────────────────────────────────────────────────────────────┘
Execution Steps
- Write the generated code to a file
- Run the file using Bash:
- Analyze the output:
- If compilation error: Read error message, fix the code (check type annotations, syntax, API usage)
- If runtime error: Check tensor shapes, grid dimensions, memory access patterns
- If validation FAIL: Check numerical differences, tolerances, algorithm correctness
- If validation PASS: Report success to user
- Iterate until PASS: Fix issues and re-run until validation passes (max 3 attempts)
Validation Output Best Practices
- Don't print large tensors - Only print tensor contents when validation fails
- Print summary stats - Show PASS/FAIL, max difference, tensor shape
- Example validation pattern:
python
is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3)
if is_close:
print("✓ Validation PASSED")
else:
max_diff = (cutile_output - reference_output).abs().max().item()
print(f"✗ Validation FAILED - max diff: {max_diff}")
print(f" Expected: {reference_output}")
print(f" Got: {cutile_output}")
Common Issues and Fixes
| Error Type | Typical Cause | Fix |
|---|
TypeError: missing Constant annotation
| Missing | Add type annotation to all constants |
ValueError: tile dimension not power of 2
| Non-power-of-2 tile size | Use 2**((size-1).bit_length())
|
| / | Wrong grid dimensions or indices | Check usage, tile vs element indices |
Validation FAIL: max diff = X
| Numerical mismatch | Check algorithm, increase tolerance, or fix logic |
Default Tolerance Values
See
guidelines/03_concepts.md
→ "Default Rules When User Does Not Specify" for tolerance values, default dtypes, and default tensor shapes.
Testing Checklist
- ✓ Verify cuTile output matches reference implementation within tolerance
- ✓ Test with various tensor sizes (aligned and unaligned to tile size)
- ✓ Test boundary conditions and edge cases
- ✓ Ensure all tensors are on CUDA device before kernel launch
- ✓ Verify dtype consistency across inputs and outputs
Critical Requirements
Four essential requirements for all cuTile kernels:
- Pure cuTile forward path: Every compute op in / must go through + . Do not call , , , or any other / compute op as a runtime operation in the forward path.
- Permitted in : , , (allocation); , , , (rearrangement); , (concatenation); , , (simple scalar ops between kernel launches).
- Permitted in : Using , , etc. solely for weight initialization and storage is fine — as long as extracts the weights (e.g., ) and passes them to instead of calling .
- See Rule 15 and Rule 17 in
guidelines/02_code_generation_rules.md
for common violations and detailed examples.
- Tile indices, not element indices:
ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K))
✅ not ❌
- All tile dimensions must be powers of 2: Use
2**((size-1).bit_length())
to round up
- All constants need type annotations: is required for compilation
For detailed guidelines on memory operations, tile sizing, common pitfalls, and optimization strategies, see the
directory (01–03).
Performance Optimization
Key principle: Think in blocks of data rather than individual elements. Choose tile sizes that match hardware characteristics and maximize data reuse within tiles.
File Management Guidelines
IMPORTANT: Follow these rules for file creation:
- Single file by default: Generate a single file containing the kernel, validation, and test code unless the user explicitly requests multiple files
- No documentation files: Do NOT create README.md, documentation files, or separate example files unless explicitly requested
- Inline everything: Include the kernel implementation, validation logic, and test code in one cohesive file
- Minimal file creation: Only create what is absolutely necessary - prefer editing existing files over creating new ones
- No source citations: Do NOT include comments or docstrings mentioning TileGym files, reference files, or sources. The code should stand on its own without attribution
- Output to current working directory: All output files must be written to the current working directory where the user started the coding assistant. Run at the start of the task. All generated files go directly in that directory (e.g. ), never in a subdirectory of the skill.
- Skill directory is read-only: is passed to sub-agents solely so they can read references, examples, and orchestration instructions. No agent — main or sub — may ever write, create, or save any file under . Use it only with read tools (Read, Glob, Grep, Bash /). Never pass it to Write, Edit, or any file-creating command.
Example structure for a single file:
python
import cuda.tile as ct
import torch
# Kernel implementation
@ct.kernel
def my_kernel(...):
...
# Validation function (if needed)
def validate(...):
...
# Test/demo code at bottom
if __name__ == "__main__":
# Test the kernel
...
Success Criteria
Your implementation is successful when:
- ✅ Pure cuTile forward path: No / compute calls in / — all compute routed through (weight-init-only usage in is fine)
- ✅ Existing examples were searched before implementation
- ✅ Packaged were searched if TileGym had no match
- ✅ Only ONE .py file created (no READMEs, no separate examples unless requested)
- ✅ No source citations in code (no mentions of TileGym files or reference files in comments/docstrings)
- ✅ Generated cuTile code compiles without errors
- ✅ Numerical results match reference implementation within tolerance
- ✅ All constants have proper type annotations
- ✅ All tile dimensions are powers of 2
- ✅ Grid dimensions correctly cover all tensor elements
- ✅ Code includes inline validation and test code in the same file
Additional criteria when using orchestration (complex tasks):
- ✅ Complexity was assessed and orchestration was chosen for the right reasons
- ✅ Analyzer produced clear kernel specs with PyTorch references
- ✅ Independent kernels were generated in parallel (not sequentially)
- ✅ Each individual kernel was validated before composition
- ✅ Composed solution passes end-to-end validation against original PyTorch reference
Remember: Start by searching existing examples, follow the workflow systematically, and validate thoroughly. The reference files contain detailed rules and examples to guide you through every aspect of cuTile kernel development.