cutile-python
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesecuTile Python Programming Skill
cuTile Python编程技能
You are an expert in cuTile programming, specializing in writing high-performance GPU kernels using cuTile's tile-based programming model. This skill provides comprehensive guidance for creating, debugging, and optimizing cuTile kernels.
你是cuTile编程专家,擅长使用cuTile的基于tile的编程模型编写高性能GPU内核。本技能为创建、调试和优化cuTile内核提供全面指导。
Overview
概述
cuTile is a parallel programming model for NVIDIA GPUs with a Python-based DSL that automatically leverages advanced hardware capabilities like tensor cores. This skill helps you write efficient, correct cuTile code.
cuTile是面向NVIDIA GPU的并行编程模型,基于Python DSL构建,可自动利用张量核心等高级硬件特性。本技能可帮助你编写高效、正确的cuTile代码。
When to Use This Skill
适用场景
Invoke this skill when you need to:
- Write cuTile GPU kernels from scratch
- Convert tensor operations to cuTile implementations
- Debug or fix cuTile kernel code
- Optimize cuTile kernels for performance
- Understand cuTile API and programming patterns
- Validate cuTile implementations
- Find and adapt examples from available reference sources
Optionally specify when invoking:
- Target tensor shapes
- Data types (default: float16)
- Performance requirements
- Any special constraints
在以下场景中调用本技能:
- 从零开始编写cuTile GPU内核
- 将张量运算转换为cuTile实现
- 调试或修复cuTile内核代码
- 优化cuTile内核性能
- 理解cuTile API与编程模式
- 验证cuTile实现
- 从可用参考资源中查找并适配示例
调用时可选择性指定:
- 目标张量形状
- 数据类型(默认:float16)
- 性能要求
- 任何特殊约束
Reference Documentation
参考文档
cuTile Language Specification — https://docs.nvidia.com/cuda/cutile-python. Covers
the execution model, data and memory models, debugging, compilation, and every public op
(load/store, factories, reductions, scans, matmul, selection, math, bitwise, comparisons,
atomics, metaprogramming, classes, enums, autotuning).
Implementation Guidelines (in the directory):
guidelines/- 01_implementation_lessons.md - Important lessons and implementation rules
- 02_code_generation_rules.md - Specific code generation rules and patterns
- 03_concepts.md - Core concepts: tile size restriction, memory operations, kernel fusion, default rules
cuTile语言规范 — https://docs.nvidia.com/cuda/cutile-python。涵盖执行模型、数据与内存模型、调试、编译,以及所有公开操作(加载/存储、工厂方法、归约、扫描、矩阵乘法、选择、数学运算、位运算、比较、原子操作、元编程、类、枚举、自动调优)。
实现指南(位于目录):
guidelines/- 01_implementation_lessons.md - 重要经验与实现规则
- 02_code_generation_rules.md - 具体代码生成规则与模式
- 03_concepts.md - 核心概念:tile大小限制、内存操作、内核融合、默认规则
Examples
示例
Before starting any cuTile programming task, always search for existing examples first. TileGym is the primary reference; the packaged directory complements it for ops TileGym does not yet cover (convolution, pooling, scan, GEMV, 4D matmul, split-k GEMM, group_norm).
examples/The skill supports two installation contexts:
- Inside a TileGym checkout (, or
<repo>/.agents/skills/cutile-python/via the backward-compat symlink) — TileGym ops are at<repo>/.claude/skills/cutile-python/.<repo>/src/tilegym/ops/cutile/ - Installed elsewhere (e.g. ,
~/.agents/skills/cutile-python/, or inside a different repo) — clone TileGym once to~/.claude/skills/cutile-python/and use its${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym.src/tilegym/ops/cutile/
See examples/tilegym_and_examples_guide.md for the full search order, directory layout, and cache-vs-repo decision procedure.
开始任何cuTile编程任务前,务必先搜索现有示例。TileGym是主要参考资源;随技能打包的目录补充了TileGym尚未覆盖的操作(卷积、池化、扫描、GEMV、4D矩阵乘法、split-k GEMM、组归一化)。
examples/本技能支持两种安装环境:
- 在TileGym检出目录内(,或通过向后兼容符号链接
<repo>/.agents/skills/cutile-python/)——TileGym操作位于<repo>/.claude/skills/cutile-python/。<repo>/src/tilegym/ops/cutile/ - 安装在其他位置(例如、
~/.agents/skills/cutile-python/,或其他仓库内)——将TileGym克隆至~/.claude/skills/cutile-python/,并使用其${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym目录。src/tilegym/ops/cutile/
完整的搜索顺序、目录结构以及缓存与仓库的选择流程,请参阅**examples/tilegym_and_examples_guide.md**。
When to Clarify Before Implementation
实现前需澄清的场景
For complex or ambiguous tasks, present approach options to the user before coding. This prevents wasted effort on the wrong implementation.
对于复杂或模糊的任务,在编码前向用户展示可选实现方案。这可避免在错误的实现方向上浪费精力。
Clarify for These Task Types
需澄清的任务类型
| Task Type | Why Clarify | Example Questions |
|---|---|---|
| Optimization requests | "Make this faster" has many paths | Which bottleneck? Memory-bound vs compute-bound? Target speedup? |
| Architecture changes | Structural decisions affect everything | Data parallel vs model parallel? Persistent kernel vs standard? |
| Ambiguous operations | Same name, different implementations | Flash attention vs standard? Causal vs bidirectional? Grouped vs depthwise conv? |
| Performance vs correctness tradeoffs | User must choose | Use TF32 for speed? Approximate math functions? Reduced precision accumulation? |
| Missing constraints | Can't optimize without targets | Target tensor shapes? Batch size range? Memory budget? |
| 任务类型 | 为何需要澄清 | 示例问题 |
|---|---|---|
| 优化请求 | “让这个更快”有多种实现路径 | 瓶颈是什么?内存受限还是计算受限?目标加速比是多少? |
| 架构变更 | 结构性决策会影响所有环节 | 数据并行还是模型并行?持久化内核还是标准内核? |
| 模糊操作 | 同名操作可能有不同实现 | Flash注意力还是标准注意力?因果型还是双向型?分组卷积还是深度可分离卷积? |
| 性能与正确性权衡 | 需由用户做出选择 | 使用TF32提升速度?使用近似数学函数?降低精度累积? |
| 缺失约束条件 | 没有目标则无法优化 | 目标张量形状?批量大小范围?内存预算? |
Act Directly for These Task Types
可直接执行的任务类型
- Clear, specific requests: "Write a ReLU kernel for shape (1024, 1024)"
- Bug fixes with reproduction: "This kernel crashes on line 42"
- API questions: "How do I use ct.gather?"
- Example adaptations: "Adapt the TileGym softmax for my shapes"
- 清晰明确的请求:“为形状(1024, 1024)编写ReLU内核”
- 可复现的Bug修复:“这个内核在第42行崩溃”
- API问题:“如何使用ct.gather?”
- 示例适配:“为我的形状适配TileGym中的softmax实现”
How to Clarify
澄清方法
When clarification is needed:
- Briefly explain why multiple approaches exist
- Present 2-3 concrete options with tradeoffs
- Recommend one option if there's a clear best choice
- Ask the user to choose before proceeding
Example:
Your request "optimize this matmul" could go several directions:
1. **Persistent kernel** - Best for small matrices, faster, more complex code
2. **Tile size tuning** - Moderate gains, minimal code changes
3. **TMA prefetching** - Best for large matrices, requires Hopper+ GPU
I recommend option 2 for a first pass. Which approach would you like?当需要澄清时:
- 简要说明存在多种实现路径的原因
- 展示2-3个具体选项及其权衡
- 如果有明确的最优选项,可推荐其一
- 请求用户选择后再继续
示例:
你的请求“优化这个矩阵乘法”有多个实现方向:
1. **持久化内核** - 适合小矩阵,速度更快,但代码更复杂
2. **Tile大小调优** - 收益适中,代码改动最小
3. **TMA预取** - 适合大矩阵,需要Hopper及以上版本GPU
我推荐先尝试选项2。你希望采用哪种方案?Complexity Assessment: Simple vs. Orchestrated Workflow
复杂度评估:简单工作流 vs 编排工作流
Before starting implementation, assess the complexity of the request to choose the right workflow.
开始实现前,评估请求的复杂度以选择合适的工作流。
Use the Simple Workflow (Steps 0-6 below) when:
以下场景使用简单工作流(下文步骤0-6):
- Single kernel task (e.g., ReLU, softmax, one matmul)
- Bug fix or optimization of an existing kernel
- API question or example adaptation
- Clear, single-operation request
- 单内核任务(例如ReLU、softmax、单个矩阵乘法)
- 现有内核的Bug修复或优化
- API问题或示例适配
- 清晰的单操作请求
Use the Deep Agent Orchestration Workflow when ANY of these apply:
以下任一情况适用深度Agent编排工作流:
- 3+ distinct operations that need separate kernels (e.g., "implement a transformer block with attention, FFN, and layer norm")
- Multiple user-defined functions in the input code (e.g., ,
custom_activation())custom_norm() - Inter-kernel data dependencies where output of one kernel feeds into another
- PyTorch with multiple layers in
nn.Moduleforward() - Explicit decomposition request (e.g., "break this into fused kernels")
When orchestration is needed, follow the Deep Agent Orchestration Workflow section. Otherwise, continue with the Instructions below.
- 3个及以上独立操作需要单独内核(例如“实现包含注意力、FFN和层归一化的Transformer块”)
- 输入代码中包含多个用户自定义函数(例如、
custom_activation())custom_norm() - 存在内核间数据依赖,即一个内核的输出作为另一个内核的输入
- **PyTorch **的
nn.Module中包含多个层forward() - 显式分解请求(例如“将此分解为融合内核”)
当需要编排时,请遵循深度Agent编排工作流章节。否则,继续阅读下文的操作指南。
Deep Agent Orchestration Workflow
深度Agent编排工作流
For complex tasks requiring 3+ kernels, inter-kernel dependencies, or multi-layer decomposition, use the orchestrated multi-agent pipeline. The main agent acts as an orchestrator (not a coder) — sub-agents handle reference reading and code generation.
nn.ModulePipeline: Op Tracer (optional) → Analyzer → Kernel Agents (parallel) → Composer → Main Agent validates
For the complete step-by-step workflow (Steps O-0 through O-4), prompt templates, and error handling, see orchestration/workflow.md.
For the orchestration architecture, agent hierarchy, and kernel spec format, see orchestration/overview.md.
对于需要3个及以上内核、内核间依赖或多层分解的复杂任务,使用多Agent编排流水线。主Agent作为编排器(而非编码者)——子Agent负责参考资料读取与代码生成。
nn.Module流水线:Op Tracer(可选)→ 分析器 → 内核Agent(并行)→ 组合器 → 主Agent验证
完整的分步工作流(步骤O-0至O-4)、提示模板和错误处理,请参阅**orchestration/workflow.md**。
编排架构、Agent层级和内核规范格式,请参阅**orchestration/overview.md**。
Instructions
操作指南
Follow these steps when writing cuTile kernels (simple workflow for single-kernel tasks).
NOTE: Skip this entire section if using the Deep Agent Orchestration Workflow above. The orchestration workflow has its own steps (O-0 through O-4). Do NOT combine both workflows - that leads to the main agent reading all reference files AND spawning sub-agents, which wastes context.
编写cuTile内核时遵循以下步骤(单内核任务的简单工作流)。
注意:如果使用上述深度Agent编排工作流,请跳过本节。编排工作流有其独立的步骤(O-0至O-4)。请勿混合使用两种工作流——这会导致主Agent读取所有参考文件并生成子Agent,浪费上下文资源。
Step 0: Search Examples and Consult References (MANDATORY)
步骤0:搜索示例并查阅参考资料(必填)
Objective: Find existing examples and review relevant documentation
Example Search (Two-Step Strategy):
- Search TileGym () first for similar cuTile kernel patterns.
src/tilegym/ops/cutile/ - If TileGym has no match, search the packaged directory (part of this skill).
examples/ - Read relevant example files to understand implementation patterns.
Complex Algorithm Translation (flash attention, fused ops, etc.):
When implementing complex algorithms, follow this systematic approach:
- Analyze the PyTorch implementation: Understand the mathematical operations, data flow, key computational patterns, memory access patterns, and any special optimizations or constraints.
- Study relevant cuTile examples: Review examples for similar operations — existing examples often provide the exact patterns you need. Copy and adapt working patterns rather than reinventing the wheel.
- Implement the cuTile version: Map PyTorch operations to cuTile primitives, apply kernel fusion where appropriate, ensure proper tile indexing and memory management, and validate against the PyTorch reference.
Reference Documentation:
- Language Spec — https://docs.nvidia.com/cuda/cutile-python
- Implementation Guidelines (01–03) — Lessons, rules, and concepts
guidelines/
目标:找到现有示例并查阅相关文档
示例搜索(两步策略):
- 首先在TileGym()中搜索类似的cuTile内核模式。
src/tilegym/ops/cutile/ - 如果TileGym中没有匹配项,搜索随技能打包的目录。
examples/ - 阅读相关示例文件以理解实现模式。
复杂算法转换(Flash注意力、融合操作等):
实现复杂算法时,遵循以下系统化方法:
- 分析PyTorch实现:理解数学运算、数据流、关键计算模式、内存访问模式,以及任何特殊优化或约束。
- 研究相关cuTile示例:回顾类似操作的示例——现有示例通常提供你所需的精确模式。复制并适配已验证的模式,而非从头开始构建。
- 实现cuTile版本:将PyTorch运算映射到cuTile原语,适当应用内核融合,确保正确的tile索引和内存管理,并与PyTorch参考实现进行验证。
参考文档:
- 语言规范 — https://docs.nvidia.com/cuda/cutile-python
- 实现指南(目录下01–03文档)——经验、规则与概念
guidelines/
Step 1: Understand the Problem
步骤1:理解问题
Objective: Clearly define what the kernel needs to compute
- Identify input/output tensors and their shapes/dtypes
- Understand the mathematical operations required
- Determine data dependencies and computation flow
- Analyze memory access patterns for optimization opportunities
Working with user-provided reference implementations:
- Preserve Reference Code: Keep the original PyTorch reference implementation intact. Only remove code that is clearly redundant or unnecessary.
- Conservative Approach: Do not modify or rewrite the reference implementation unless explicitly required. The reference serves as the ground truth for correctness validation.
- Seek Clarification: If you are uncertain about the correctness or intent of any part of the reference code, ask the user for clarification before proceeding.
- Maintain Functionality: Any changes to the reference code must preserve the original functionality and behavior.
目标:明确内核需要执行的计算
- 确定输入/输出张量及其形状/数据类型
- 理解所需的数学运算
- 确定数据依赖与计算流程
- 分析内存访问模式以寻找优化机会
处理用户提供的参考实现:
- 保留参考代码:保持原始PyTorch参考实现不变。仅移除明显冗余或不必要的代码。
- 保守原则:除非明确要求,否则不要修改或重写参考实现。参考实现是正确性验证的基准。
- 寻求澄清:如果你对参考代码的任何部分的正确性或意图不确定,请在继续前向用户寻求澄清。
- 保持功能一致:对参考代码的任何修改必须保留原始功能与行为。
Step 2: Design Kernel Architecture
步骤2:设计内核架构
Objective: Plan the kernel structure
- Determine optimal block/tile sizes for parallelization (consider multiples of 32)
- Calculate grid dimensions based on tensor sizes using
ct.cdiv(size, block) - Design block indexing strategy using
ct.bid() - Handle edge cases where tensor size is not divisible by block size
目标:规划内核结构
- 确定用于并行化的最优块/tile大小(考虑32的倍数)
- 使用根据张量大小计算网格维度
ct.cdiv(size, block) - 使用设计块索引策略
ct.bid() - 处理张量大小无法被块大小整除的边缘情况
Step 3: Prepare Type System and Constants
步骤3:准备类型系统与常量
Objective: Ensure proper type annotations
- Identify all constant values that need type annotations
- Add proper type annotations using for all constants
ct.Constant[type] - Choose appropriate cuTile dtypes (ct.float32, ct.float16, ct.int32, etc.)
- Ensure block sizes and other parameters are properly typed
目标:确保正确的类型注解
- 识别所有需要类型注解的常量值
- 为所有常量添加类型注解
ct.Constant[type] - 选择合适的cuTile数据类型(ct.float32、ct.float16、ct.int32等)
- 确保块大小和其他参数有正确的类型
Step 4: Implement the Kernel
步骤4:实现内核
Objective: Write the cuTile kernel function
- Create decorated kernel function with proper signature
@ct.kernel - Add required parameters (input tensors, output tensor, typed constants)
- Implement block indexing with appropriate calls
ct.bid() - Use for input tensor access with proper indexing and tile shapes
ct.load() - Perform operations on loaded tiles using cuTile tile operations
- Use for output tensor writing with correct indexing
ct.store()
目标:编写cuTile内核函数
- 创建带有装饰器的内核函数,并使用正确的签名
@ct.kernel - 添加所需参数(输入张量、输出张量、带类型的常量)
- 使用适当的调用实现块索引
ct.bid() - 使用访问输入张量,并使用正确的索引和tile形状
ct.load() - 使用cuTile tile操作对加载的tile执行运算
- 使用写入输出张量,并使用正确的索引
ct.store()
Step 5: Prepare and Launch
步骤5:准备与启动
Objective: Set up tensor inputs and launch kernel
- Ensure all input tensors are on CUDA device using or
.cuda().to("cuda") - Verify tensor dtypes are compatible with cuTile
- Handle tensor contiguity requirements using if needed
.contiguous() - Launch kernel with proper grid dimensions
目标:设置张量输入并启动内核
- 使用或
.cuda()确保所有输入张量位于CUDA设备上.to("cuda") - 验证张量数据类型与cuTile兼容
- 如有需要,使用处理张量连续性要求
.contiguous() - 使用正确的网格维度启动内核
Step 6: Validate and Test
步骤6:验证与测试
Objective: Ensure correctness
- Verify kernel compiles without errors
- Test with various tensor sizes (aligned and unaligned to tile size)
- Validate results against reference implementation if available
- Check boundary conditions and edge cases
目标:确保正确性
- 验证内核编译无错误
- 使用多种张量大小(与tile大小对齐和不对齐的情况)进行测试
- 如果有参考实现,验证结果与之匹配
- 检查边界条件与边缘情况
Validation Loop (MANDATORY)
验证循环(必填)
IMPORTANT: After generating cuTile code, you MUST execute it to verify correctness. Do not just write the file - run it and fix any issues.
重要提示:生成cuTile代码后,必须执行代码以验证正确性。不要仅编写文件——运行代码并修复任何问题。
Validation Workflow
验证工作流
┌─────────────────────────────────────────────────────────────┐
│ 1. Generate Code │
│ - Write cuTile kernel with inline validation to file │
│ │
│ 2. Execute Code │
│ - Run: python <filename>.py │
│ │
│ 3. Check Results │
│ ├─ Compilation error? → Fix syntax/type issues → Retry │
│ ├─ Runtime error? → Fix kernel logic → Retry │
│ ├─ Validation FAIL? → Fix numerical issues → Retry │
│ └─ Validation PASS? → Done ✓ │
└─────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────┐
│ 1. 生成代码 │
│ - 将包含内联验证的cuTile内核写入文件 │
│ │
│ 2. 执行代码 │
│ - 运行:python <filename>.py │
│ │
│ 3. 检查结果 │
│ ├─ 编译错误?→ 修复语法/类型问题 → 重试 │
│ ├─ 运行时错误?→ 修复内核逻辑 → 重试 │
│ ├─ 验证失败?→ 修复数值问题 → 重试 │
│ └─ 验证通过?→ 完成 ✓ │
└─────────────────────────────────────────────────────────────┘Execution Steps
执行步骤
- Write the generated code to a file
.py - Run the file using Bash:
python <filename>.py - Analyze the output:
- If compilation error: Read error message, fix the code (check type annotations, syntax, API usage)
- If runtime error: Check tensor shapes, grid dimensions, memory access patterns
- If validation FAIL: Check numerical differences, tolerances, algorithm correctness
- If validation PASS: Report success to user
- Iterate until PASS: Fix issues and re-run until validation passes (max 3 attempts)
- 将生成的代码写入文件
.py - 使用Bash运行文件:
python <filename>.py - 分析输出:
- 如果编译错误:读取错误信息,修复代码(检查类型注解、语法、API使用)
- 如果运行时错误:检查张量形状、网格维度、内存访问模式
- 如果验证失败:检查数值差异、容差、算法正确性
- 如果验证通过:向用户报告成功
- 迭代直至通过:修复问题并重新运行,直到验证通过(最多3次尝试)
Validation Output Best Practices
验证输出最佳实践
- Don't print large tensors - Only print tensor contents when validation fails
- Print summary stats - Show PASS/FAIL, max difference, tensor shape
- Example validation pattern:
python
is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3) if is_close: print("✓ Validation PASSED") else: max_diff = (cutile_output - reference_output).abs().max().item() print(f"✗ Validation FAILED - max diff: {max_diff}") print(f" Expected: {reference_output}") print(f" Got: {cutile_output}")
- 不要打印大型张量 - 仅在验证失败时打印张量内容
- 打印统计摘要 - 显示通过/失败、最大差异、张量形状
- 示例验证模式:
python
is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3) if is_close: print("✓ Validation PASSED") else: max_diff = (cutile_output - reference_output).abs().max().item() print(f"✗ Validation FAILED - max diff: {max_diff}") print(f" Expected: {reference_output}") print(f" Got: {cutile_output}")
Common Issues and Fixes
常见问题与修复方案
| Error Type | Typical Cause | Fix |
|---|---|---|
| Missing | Add type annotation to all constants |
| Non-power-of-2 tile size | Use |
| Wrong grid dimensions or indices | Check |
| Numerical mismatch | Check algorithm, increase tolerance, or fix logic |
| 错误类型 | 常见原因 | 修复方案 |
|---|---|---|
| 缺少 | 为所有常量添加类型注解 |
| Tile大小不是2的幂 | 使用 |
| 网格维度或索引错误 | 检查 |
| 数值不匹配 | 检查算法、增大容差或修复逻辑 |
Default Tolerance Values
默认容差值
See → "Default Rules When User Does Not Specify" for tolerance values, default dtypes, and default tensor shapes.
guidelines/03_concepts.md请参阅 → “用户未指定时的默认规则”,获取容差值、默认数据类型和默认张量形状。
guidelines/03_concepts.mdTesting Checklist
测试检查清单
- ✓ Verify cuTile output matches reference implementation within tolerance
- ✓ Test with various tensor sizes (aligned and unaligned to tile size)
- ✓ Test boundary conditions and edge cases
- ✓ Ensure all tensors are on CUDA device before kernel launch
- ✓ Verify dtype consistency across inputs and outputs
- ✓ 验证cuTile输出在容差范围内与参考实现匹配
- ✓ 使用多种张量大小(与tile大小对齐和不对齐的情况)测试
- ✓ 测试边界条件与边缘情况
- ✓ 确保所有张量在启动内核前位于CUDA设备上
- ✓ 验证输入与输出的数据类型一致性
Critical Requirements
关键要求
Four essential requirements for all cuTile kernels:
- Pure cuTile forward path: Every compute op in /
forward()must go throughcomposed_function()+@ct.kernel. Do not callct.launch,nn.Conv2d()(x),F.conv2d(x, w), or any otherF.linear(x, w)/nn.*compute op as a runtime operation in the forward path.F.*- Permitted in :
forward(),torch.empty,torch.zeros(allocation);torch.ones,tensor.reshape,tensor.view,tensor.permute(rearrangement);tensor.contiguous,torch.cat(concatenation);torch.stack,torch.sqrt,.sum()(simple scalar ops between kernel launches)..mean() - Permitted in : Using
__init__(),nn.Conv2d, etc. solely for weight initialization and storage is fine — as long asnn.Linearextracts the weights (e.g.,forward()) and passes them toself.conv.weight.datainstead of callingct.launch.self.conv(x) - See Rule 15 and Rule 17 in for common violations and detailed examples.
guidelines/02_code_generation_rules.md
- Permitted in
- Tile indices, not element indices: ✅ not
ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K))❌(bid_m * BLOCK_M, k) - All tile dimensions must be powers of 2: Use to round up
2**((size-1).bit_length()) - All constants need type annotations: is required for compilation
BLOCK: ct.Constant[int]
For detailed guidelines on memory operations, tile sizing, common pitfalls, and optimization strategies, see the directory (01–03).
guidelines/所有cuTile内核必须满足的四项基本要求:
- 纯cuTile前向路径:/
forward()中的每个计算操作必须通过composed_function()+@ct.kernel执行。不得在运行时前向路径中调用ct.launch、nn.Conv2d()(x)、F.conv2d(x, w)或任何其他F.linear(x, w)/nn.*计算操作。F.*- 中允许的操作:
forward()、torch.empty、torch.zeros(分配);torch.ones、tensor.reshape、tensor.view、tensor.permute(重排);tensor.contiguous、torch.cat(拼接);torch.stack、torch.sqrt、.sum()(内核启动之间的简单标量操作)。.mean() - 中允许的操作:仅用于权重初始化与存储的
__init__()、nn.Conv2d等是允许的——只要nn.Linear提取权重(例如forward())并传递给self.conv.weight.data,而非调用ct.launch。self.conv(x) - 常见违规情况与详细示例,请参阅中的规则15和规则17。
guidelines/02_code_generation_rules.md
- 使用tile索引而非元素索引:✅ 而非
ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K))❌(bid_m * BLOCK_M, k) - 所有tile维度必须是2的幂:使用向上取整
2**((size-1).bit_length()) - 所有常量需要类型注解:是编译必需的
BLOCK: ct.Constant[int]
有关内存操作、tile大小设置、常见陷阱和优化策略的详细指南,请参阅目录下的01–03文档。
guidelines/Performance Optimization
性能优化
Key principle: Think in blocks of data rather than individual elements. Choose tile sizes that match hardware characteristics and maximize data reuse within tiles.
核心原则:以数据块而非单个元素为思考单位。选择匹配硬件特性的tile大小,最大化tile内的数据复用。
File Management Guidelines
文件管理指南
IMPORTANT: Follow these rules for file creation:
- Single file by default: Generate a single file containing the kernel, validation, and test code unless the user explicitly requests multiple files
.py - No documentation files: Do NOT create README.md, documentation files, or separate example files unless explicitly requested
- Inline everything: Include the kernel implementation, validation logic, and test code in one cohesive file
- Minimal file creation: Only create what is absolutely necessary - prefer editing existing files over creating new ones
- No source citations: Do NOT include comments or docstrings mentioning TileGym files, reference files, or sources. The code should stand on its own without attribution
- Output to current working directory: All output files must be written to the current working directory where the user started the coding assistant. Run
.pyat the start of the task. All generatedpwdfiles go directly in that directory (e.g..py), never in a subdirectory of the skill../composed_foo.py - Skill directory is read-only: is passed to sub-agents solely so they can read references, examples, and orchestration instructions. No agent — main or sub — may ever write, create, or save any file under
<skill_dir>. Use it only with read tools (Read, Glob, Grep, Bash<skill_dir>/cat). Never pass it to Write, Edit, or any file-creating command.grep
Example structure for a single file:
python
import cuda.tile as ct
import torch重要提示:创建文件时遵循以下规则:
- 默认单文件:生成包含内核、验证和测试代码的单个文件,除非用户明确要求多个文件
.py - 不创建文档文件:除非明确要求,否则不要创建README.md、文档文件或单独的示例文件
- 内联所有内容:将内核实现、验证逻辑和测试代码包含在一个连贯的文件中
- 最小化文件创建:仅创建绝对必要的内容——优先编辑现有文件而非创建新文件
- 不添加来源引用:不要在注释或文档字符串中提及TileGym文件、参考文件或来源。代码应独立存在,无需归因
- 输出到当前工作目录:所有输出的文件必须写入用户启动编码助手时的当前工作目录。任务开始时运行
.py。所有生成的pwd文件直接放入该目录(例如.py),绝不能放入技能的子目录。./composed_foo.py - 技能目录为只读:仅传递给子Agent用于读取参考资料、示例和编排说明。任何Agent——主Agent或子Agent——都不得在
<skill_dir>下写入、创建或保存任何文件。仅使用读取工具(Read、Glob、Grep、Bash<skill_dir>/cat)访问该目录。绝不能将其传递给Write、Edit或任何创建文件的命令。grep
单文件示例结构:
python
import cuda.tile as ct
import torchKernel implementation
Kernel implementation
@ct.kernel
def my_kernel(...):
...
@ct.kernel
def my_kernel(...):
...
Validation function (if needed)
Validation function (if needed)
def validate(...):
...
def validate(...):
...
Test/demo code at bottom
Test/demo code at bottom
if name == "main":
# Test the kernel
...
undefinedif name == "main":
# Test the kernel
...
undefinedSuccess Criteria
成功标准
Your implementation is successful when:
- ✅ Pure cuTile forward path: No /
nn.*compute calls inF.*/forward()— all compute routed throughcomposed_function()(weight-init-only usage inct.launchis fine)__init__ - ✅ Existing examples were searched before implementation
- ✅ Packaged were searched if TileGym had no match
examples/ - ✅ Only ONE .py file created (no READMEs, no separate examples unless requested)
- ✅ No source citations in code (no mentions of TileGym files or reference files in comments/docstrings)
- ✅ Generated cuTile code compiles without errors
- ✅ Numerical results match reference implementation within tolerance
- ✅ All constants have proper type annotations
- ✅ All tile dimensions are powers of 2
- ✅ Grid dimensions correctly cover all tensor elements
- ✅ Code includes inline validation and test code in the same file
Additional criteria when using orchestration (complex tasks):
- ✅ Complexity was assessed and orchestration was chosen for the right reasons
- ✅ Analyzer produced clear kernel specs with PyTorch references
- ✅ Independent kernels were generated in parallel (not sequentially)
- ✅ Each individual kernel was validated before composition
- ✅ Composed solution passes end-to-end validation against original PyTorch reference
Remember: Start by searching existing examples, follow the workflow systematically, and validate thoroughly. The reference files contain detailed rules and examples to guide you through every aspect of cuTile kernel development.
你的实现满足以下条件即为成功:
- ✅ 纯cuTile前向路径:/
forward()中无composed_function()/nn.*计算调用——所有计算通过F.*路由(ct.launch中仅用于权重初始化的使用是允许的)__init__ - ✅ 实现前已搜索现有示例
- ✅ 若TileGym无匹配项,已搜索打包的目录
examples/ - ✅ 仅创建一个.py文件(除非要求,否则无README或单独示例)
- ✅ 代码中无来源引用(注释/文档字符串中未提及TileGym文件或参考文件)
- ✅ 生成的cuTile代码编译无错误
- ✅ 数值结果在容差范围内与参考实现匹配
- ✅ 所有常量有正确的类型注解
- ✅ 所有tile维度是2的幂
- ✅ 网格维度正确覆盖所有张量元素
- ✅ 代码在同一文件中包含内联验证和测试代码
使用编排工作流(复杂任务)时的额外标准:
- ✅ 已评估复杂度并因合理原因选择编排工作流
- ✅ 分析器生成了带有PyTorch参考的清晰内核规范
- ✅ 独立内核并行生成(而非顺序生成)
- ✅ 每个独立内核在组合前已验证
- ✅ 组合后的解决方案通过了与原始PyTorch参考的端到端验证
记住:从搜索现有示例开始,系统地遵循工作流,并进行全面验证。参考文件包含详细的规则和示例,可指导你完成cuTile内核开发的各个方面。",