cutile-python

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

cuTile Python Programming Skill

cuTile Python编程技能

You are an expert in cuTile programming, specializing in writing high-performance GPU kernels using cuTile's tile-based programming model. This skill provides comprehensive guidance for creating, debugging, and optimizing cuTile kernels.

你是cuTile编程专家，擅长使用cuTile的基于tile的编程模型编写高性能GPU内核。本技能为创建、调试和优化cuTile内核提供全面指导。

Overview

概述

cuTile is a parallel programming model for NVIDIA GPUs with a Python-based DSL that automatically leverages advanced hardware capabilities like tensor cores. This skill helps you write efficient, correct cuTile code.

cuTile是面向NVIDIA GPU的并行编程模型，基于Python DSL构建，可自动利用张量核心等高级硬件特性。本技能可帮助你编写高效、正确的cuTile代码。

When to Use This Skill

适用场景

Invoke this skill when you need to:

Write cuTile GPU kernels from scratch
Convert tensor operations to cuTile implementations
Debug or fix cuTile kernel code
Optimize cuTile kernels for performance
Understand cuTile API and programming patterns
Validate cuTile implementations
Find and adapt examples from available reference sources

Optionally specify when invoking:

Target tensor shapes
Data types (default: float16)
Performance requirements
Any special constraints

在以下场景中调用本技能：

从零开始编写cuTile GPU内核
将张量运算转换为cuTile实现
调试或修复cuTile内核代码
优化cuTile内核性能
理解cuTile API与编程模式
验证cuTile实现
从可用参考资源中查找并适配示例

调用时可选择性指定：

目标张量形状
数据类型（默认：float16）
性能要求
任何特殊约束

Reference Documentation

参考文档

cuTile Language Specification — https://docs.nvidia.com/cuda/cutile-python. Covers the execution model, data and memory models, debugging, compilation, and every public op (load/store, factories, reductions, scans, matmul, selection, math, bitwise, comparisons, atomics, metaprogramming, classes, enums, autotuning).

Implementation Guidelines (in the

guidelines/

directory):

01_implementation_lessons.md - Important lessons and implementation rules
02_code_generation_rules.md - Specific code generation rules and patterns
03_concepts.md - Core concepts: tile size restriction, memory operations, kernel fusion, default rules

cuTile语言规范 — https://docs.nvidia.com/cuda/cutile-python。涵盖执行模型、数据与内存模型、调试、编译，以及所有公开操作（加载/存储、工厂方法、归约、扫描、矩阵乘法、选择、数学运算、位运算、比较、原子操作、元编程、类、枚举、自动调优）。

实现指南（位于

guidelines/

目录）：

01_implementation_lessons.md - 重要经验与实现规则
02_code_generation_rules.md - 具体代码生成规则与模式
03_concepts.md - 核心概念：tile大小限制、内存操作、内核融合、默认规则

Examples

示例

Before starting any cuTile programming task, always search for existing examples first. TileGym is the primary reference; the packaged

examples/

directory complements it for ops TileGym does not yet cover (convolution, pooling, scan, GEMV, 4D matmul, split-k GEMM, group_norm).

The skill supports two installation contexts:

Inside a TileGym checkout (

<repo>/.agents/skills/cutile-python/

, or

<repo>/.claude/skills/cutile-python/

via the backward-compat symlink) — TileGym ops are at

<repo>/src/tilegym/ops/cutile/

Installed elsewhere (e.g.

~/.agents/skills/cutile-python/

~/.claude/skills/cutile-python/

, or inside a different repo) — clone TileGym once to

${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym

and use its

src/tilegym/ops/cutile/

See examples/tilegym_and_examples_guide.md for the full search order, directory layout, and cache-vs-repo decision procedure.

开始任何cuTile编程任务前，务必先搜索现有示例。TileGym是主要参考资源；随技能打包的

examples/

目录补充了TileGym尚未覆盖的操作（卷积、池化、扫描、GEMV、4D矩阵乘法、split-k GEMM、组归一化）。

本技能支持两种安装环境：

在TileGym检出目录内（

<repo>/.agents/skills/cutile-python/

，或通过向后兼容符号链接

<repo>/.claude/skills/cutile-python/

）——TileGym操作位于

<repo>/src/tilegym/ops/cutile/

。

安装在其他位置（例如

~/.agents/skills/cutile-python/

、

~/.claude/skills/cutile-python/

，或其他仓库内）——将TileGym克隆至

${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym

，并使用其

src/tilegym/ops/cutile/

目录。

完整的搜索顺序、目录结构以及缓存与仓库的选择流程，请参阅**examples/tilegym_and_examples_guide.md**。

When to Clarify Before Implementation

实现前需澄清的场景

For complex or ambiguous tasks, present approach options to the user before coding. This prevents wasted effort on the wrong implementation.

对于复杂或模糊的任务，在编码前向用户展示可选实现方案。这可避免在错误的实现方向上浪费精力。

Clarify for These Task Types

需澄清的任务类型

Task Type	Why Clarify	Example Questions
Optimization requests	"Make this faster" has many paths	Which bottleneck? Memory-bound vs compute-bound? Target speedup?
Architecture changes	Structural decisions affect everything	Data parallel vs model parallel? Persistent kernel vs standard?
Ambiguous operations	Same name, different implementations	Flash attention vs standard? Causal vs bidirectional? Grouped vs depthwise conv?
Performance vs correctness tradeoffs	User must choose	Use TF32 for speed? Approximate math functions? Reduced precision accumulation?
Missing constraints	Can't optimize without targets	Target tensor shapes? Batch size range? Memory budget?

任务类型	为何需要澄清	示例问题
优化请求	“让这个更快”有多种实现路径	瓶颈是什么？内存受限还是计算受限？目标加速比是多少？
架构变更	结构性决策会影响所有环节	数据并行还是模型并行？持久化内核还是标准内核？
模糊操作	同名操作可能有不同实现	Flash注意力还是标准注意力？因果型还是双向型？分组卷积还是深度可分离卷积？
性能与正确性权衡	需由用户做出选择	使用TF32提升速度？使用近似数学函数？降低精度累积？
缺失约束条件	没有目标则无法优化	目标张量形状？批量大小范围？内存预算？

Act Directly for These Task Types

可直接执行的任务类型

Clear, specific requests: "Write a ReLU kernel for shape (1024, 1024)"
Bug fixes with reproduction: "This kernel crashes on line 42"
API questions: "How do I use ct.gather?"
Example adaptations: "Adapt the TileGym softmax for my shapes"

清晰明确的请求：“为形状(1024, 1024)编写ReLU内核”
可复现的Bug修复：“这个内核在第42行崩溃”
API问题：“如何使用ct.gather？”
示例适配：“为我的形状适配TileGym中的softmax实现”

How to Clarify

澄清方法

When clarification is needed:

Briefly explain why multiple approaches exist
Present 2-3 concrete options with tradeoffs
Recommend one option if there's a clear best choice
Ask the user to choose before proceeding

Example:

Your request "optimize this matmul" could go several directions:

1. **Persistent kernel** - Best for small matrices, faster, more complex code
2. **Tile size tuning** - Moderate gains, minimal code changes
3. **TMA prefetching** - Best for large matrices, requires Hopper+ GPU

I recommend option 2 for a first pass. Which approach would you like?

当需要澄清时：

简要说明存在多种实现路径的原因
展示2-3个具体选项及其权衡
如果有明确的最优选项，可推荐其一
请求用户选择后再继续

示例：

你的请求“优化这个矩阵乘法”有多个实现方向：

1. **持久化内核** - 适合小矩阵，速度更快，但代码更复杂
2. **Tile大小调优** - 收益适中，代码改动最小
3. **TMA预取** - 适合大矩阵，需要Hopper及以上版本GPU

我推荐先尝试选项2。你希望采用哪种方案？

Complexity Assessment: Simple vs. Orchestrated Workflow

复杂度评估：简单工作流 vs 编排工作流

Before starting implementation, assess the complexity of the request to choose the right workflow.

开始实现前，评估请求的复杂度以选择合适的工作流。

Use the Simple Workflow (Steps 0-6 below) when:

以下场景使用简单工作流（下文步骤0-6）：

Single kernel task (e.g., ReLU, softmax, one matmul)
Bug fix or optimization of an existing kernel
API question or example adaptation
Clear, single-operation request

单内核任务（例如ReLU、softmax、单个矩阵乘法）
现有内核的Bug修复或优化
API问题或示例适配
清晰的单操作请求

Use the Deep Agent Orchestration Workflow when ANY of these apply:

以下任一情况适用深度Agent编排工作流：

3+ distinct operations that need separate kernels (e.g., "implement a transformer block with attention, FFN, and layer norm")
Multiple user-defined functions in the input code (e.g.,
```
custom_activation()
```
,
```
custom_norm()
```
)
Inter-kernel data dependencies where output of one kernel feeds into another
PyTorch
nn.Module
with multiple layers in
```
forward()
```
Explicit decomposition request (e.g., "break this into fused kernels")

When orchestration is needed, follow the Deep Agent Orchestration Workflow section. Otherwise, continue with the Instructions below.

3个及以上独立操作需要单独内核（例如“实现包含注意力、FFN和层归一化的Transformer块”）
输入代码中包含多个用户自定义函数（例如
```
custom_activation()
```
、
```
custom_norm()
```
）
存在内核间数据依赖，即一个内核的输出作为另一个内核的输入
**PyTorch
```
nn.Module
```
**的
```
forward()
```
中包含多个层
显式分解请求（例如“将此分解为融合内核”）

当需要编排时，请遵循深度Agent编排工作流章节。否则，继续阅读下文的操作指南。

Deep Agent Orchestration Workflow

深度Agent编排工作流

For complex tasks requiring 3+ kernels, inter-kernel dependencies, or multi-layer

nn.Module

decomposition, use the orchestrated multi-agent pipeline. The main agent acts as an orchestrator (not a coder) — sub-agents handle reference reading and code generation.

Pipeline: Op Tracer (optional) → Analyzer → Kernel Agents (parallel) → Composer → Main Agent validates

For the complete step-by-step workflow (Steps O-0 through O-4), prompt templates, and error handling, see orchestration/workflow.md.

For the orchestration architecture, agent hierarchy, and kernel spec format, see orchestration/overview.md.

对于需要3个及以上内核、内核间依赖或多层

nn.Module

分解的复杂任务，使用多Agent编排流水线。主Agent作为编排器（而非编码者）——子Agent负责参考资料读取与代码生成。

流水线：Op Tracer（可选）→ 分析器 → 内核Agent（并行）→ 组合器 → 主Agent验证

完整的分步工作流（步骤O-0至O-4）、提示模板和错误处理，请参阅**orchestration/workflow.md**。

编排架构、Agent层级和内核规范格式，请参阅**orchestration/overview.md**。

Instructions

操作指南

Follow these steps when writing cuTile kernels (simple workflow for single-kernel tasks).

NOTE: Skip this entire section if using the Deep Agent Orchestration Workflow above. The orchestration workflow has its own steps (O-0 through O-4). Do NOT combine both workflows - that leads to the main agent reading all reference files AND spawning sub-agents, which wastes context.

编写cuTile内核时遵循以下步骤（单内核任务的简单工作流）。

注意：如果使用上述深度Agent编排工作流，请跳过本节。编排工作流有其独立的步骤（O-0至O-4）。请勿混合使用两种工作流——这会导致主Agent读取所有参考文件并生成子Agent，浪费上下文资源。

Step 0: Search Examples and Consult References (MANDATORY)

步骤0：搜索示例并查阅参考资料（必填）

Objective: Find existing examples and review relevant documentation

Example Search (Two-Step Strategy):

Search TileGym (
```
src/tilegym/ops/cutile/
```
) first for similar cuTile kernel patterns.
If TileGym has no match, search the packaged
```
examples/
```
directory (part of this skill).
Read relevant example files to understand implementation patterns.

Complex Algorithm Translation (flash attention, fused ops, etc.): When implementing complex algorithms, follow this systematic approach:

Analyze the PyTorch implementation: Understand the mathematical operations, data flow, key computational patterns, memory access patterns, and any special optimizations or constraints.
Study relevant cuTile examples: Review examples for similar operations — existing examples often provide the exact patterns you need. Copy and adapt working patterns rather than reinventing the wheel.
Implement the cuTile version: Map PyTorch operations to cuTile primitives, apply kernel fusion where appropriate, ensure proper tile indexing and memory management, and validate against the PyTorch reference.

Reference Documentation:

Language Spec — https://docs.nvidia.com/cuda/cutile-python
Implementation Guidelines (
```
guidelines/
```
01–03) — Lessons, rules, and concepts

目标：找到现有示例并查阅相关文档

示例搜索（两步策略）：

首先在TileGym（
```
src/tilegym/ops/cutile/
```
）中搜索类似的cuTile内核模式。
如果TileGym中没有匹配项，搜索随技能打包的
```
examples/
```
目录。
阅读相关示例文件以理解实现模式。

复杂算法转换（Flash注意力、融合操作等）：实现复杂算法时，遵循以下系统化方法：

分析PyTorch实现：理解数学运算、数据流、关键计算模式、内存访问模式，以及任何特殊优化或约束。
研究相关cuTile示例：回顾类似操作的示例——现有示例通常提供你所需的精确模式。复制并适配已验证的模式，而非从头开始构建。
实现cuTile版本：将PyTorch运算映射到cuTile原语，适当应用内核融合，确保正确的tile索引和内存管理，并与PyTorch参考实现进行验证。

参考文档：

语言规范 — https://docs.nvidia.com/cuda/cutile-python
实现指南（
```
guidelines/
```
目录下01–03文档）——经验、规则与概念

Step 1: Understand the Problem

步骤1：理解问题

Objective: Clearly define what the kernel needs to compute

Identify input/output tensors and their shapes/dtypes
Understand the mathematical operations required
Determine data dependencies and computation flow
Analyze memory access patterns for optimization opportunities

Working with user-provided reference implementations:

Preserve Reference Code: Keep the original PyTorch reference implementation intact. Only remove code that is clearly redundant or unnecessary.
Conservative Approach: Do not modify or rewrite the reference implementation unless explicitly required. The reference serves as the ground truth for correctness validation.
Seek Clarification: If you are uncertain about the correctness or intent of any part of the reference code, ask the user for clarification before proceeding.
Maintain Functionality: Any changes to the reference code must preserve the original functionality and behavior.

目标：明确内核需要执行的计算

确定输入/输出张量及其形状/数据类型
理解所需的数学运算
确定数据依赖与计算流程
分析内存访问模式以寻找优化机会

处理用户提供的参考实现：

保留参考代码：保持原始PyTorch参考实现不变。仅移除明显冗余或不必要的代码。
保守原则：除非明确要求，否则不要修改或重写参考实现。参考实现是正确性验证的基准。
寻求澄清：如果你对参考代码的任何部分的正确性或意图不确定，请在继续前向用户寻求澄清。
保持功能一致：对参考代码的任何修改必须保留原始功能与行为。

Step 2: Design Kernel Architecture

步骤2：设计内核架构

Objective: Plan the kernel structure

Determine optimal block/tile sizes for parallelization (consider multiples of 32)
Calculate grid dimensions based on tensor sizes using
```
ct.cdiv(size, block)
```
Design block indexing strategy using
```
ct.bid()
```
Handle edge cases where tensor size is not divisible by block size

目标：规划内核结构

确定用于并行化的最优块/tile大小（考虑32的倍数）
使用
```
ct.cdiv(size, block)
```
根据张量大小计算网格维度
使用
```
ct.bid()
```
设计块索引策略
处理张量大小无法被块大小整除的边缘情况

Step 3: Prepare Type System and Constants

步骤3：准备类型系统与常量

Objective: Ensure proper type annotations

Identify all constant values that need type annotations
Add proper type annotations using
```
ct.Constant[type]
```
for all constants
Choose appropriate cuTile dtypes (ct.float32, ct.float16, ct.int32, etc.)
Ensure block sizes and other parameters are properly typed

目标：确保正确的类型注解

识别所有需要类型注解的常量值
为所有常量添加
```
ct.Constant[type]
```
类型注解
选择合适的cuTile数据类型（ct.float32、ct.float16、ct.int32等）
确保块大小和其他参数有正确的类型

Step 4: Implement the Kernel

步骤4：实现内核

Objective: Write the cuTile kernel function

Create
```
@ct.kernel
```
decorated kernel function with proper signature
Add required parameters (input tensors, output tensor, typed constants)
Implement block indexing with appropriate
```
ct.bid()
```
calls
Use
```
ct.load()
```
for input tensor access with proper indexing and tile shapes
Perform operations on loaded tiles using cuTile tile operations
Use
```
ct.store()
```
for output tensor writing with correct indexing

目标：编写cuTile内核函数

创建带有
```
@ct.kernel
```
装饰器的内核函数，并使用正确的签名
添加所需参数（输入张量、输出张量、带类型的常量）
使用适当的
```
ct.bid()
```
调用实现块索引
使用
```
ct.load()
```
访问输入张量，并使用正确的索引和tile形状
使用cuTile tile操作对加载的tile执行运算
使用
```
ct.store()
```
写入输出张量，并使用正确的索引

Step 5: Prepare and Launch

步骤5：准备与启动

Objective: Set up tensor inputs and launch kernel

Ensure all input tensors are on CUDA device using
```
.cuda()
```
or
```
.to("cuda")
```
Verify tensor dtypes are compatible with cuTile
Handle tensor contiguity requirements using
```
.contiguous()
```
if needed
Launch kernel with proper grid dimensions

目标：设置张量输入并启动内核

使用
```
.cuda()
```
或
```
.to("cuda")
```
确保所有输入张量位于CUDA设备上
验证张量数据类型与cuTile兼容
如有需要，使用
```
.contiguous()
```
处理张量连续性要求
使用正确的网格维度启动内核

Step 6: Validate and Test

步骤6：验证与测试

Objective: Ensure correctness

Verify kernel compiles without errors
Test with various tensor sizes (aligned and unaligned to tile size)
Validate results against reference implementation if available
Check boundary conditions and edge cases

目标：确保正确性

验证内核编译无错误
使用多种张量大小（与tile大小对齐和不对齐的情况）进行测试
如果有参考实现，验证结果与之匹配
检查边界条件与边缘情况

Validation Loop (MANDATORY)

验证循环（必填）

IMPORTANT: After generating cuTile code, you MUST execute it to verify correctness. Do not just write the file - run it and fix any issues.

重要提示：生成cuTile代码后，必须执行代码以验证正确性。不要仅编写文件——运行代码并修复任何问题。

Validation Workflow

验证工作流

┌─────────────────────────────────────────────────────────────┐
│  1. Generate Code                                           │
│     - Write cuTile kernel with inline validation to file    │
│                                                             │
│  2. Execute Code                                            │
│     - Run: python <filename>.py                             │
│                                                             │
│  3. Check Results                                           │
│     ├─ Compilation error? → Fix syntax/type issues → Retry  │
│     ├─ Runtime error? → Fix kernel logic → Retry            │
│     ├─ Validation FAIL? → Fix numerical issues → Retry      │
│     └─ Validation PASS? → Done ✓                            │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  1. 生成代码                                               │
│     - 将包含内联验证的cuTile内核写入文件                    │
│                                                             │
│  2. 执行代码                                               │
│     - 运行：python <filename>.py                             │
│                                                             │
│  3. 检查结果                                               │
│     ├─ 编译错误？→ 修复语法/类型问题 → 重试                  │
│     ├─ 运行时错误？→ 修复内核逻辑 → 重试                    │
│     ├─ 验证失败？→ 修复数值问题 → 重试                      │
│     └─ 验证通过？→ 完成 ✓                                    │
└─────────────────────────────────────────────────────────────┘

Execution Steps

执行步骤

Write the generated code to a
```
.py
```
file
Run the file using Bash:
```
python <filename>.py
```
Analyze the output:
- If compilation error: Read error message, fix the code (check type annotations, syntax, API usage)
- If runtime error: Check tensor shapes, grid dimensions, memory access patterns
- If validation FAIL: Check numerical differences, tolerances, algorithm correctness
- If validation PASS: Report success to user
Iterate until PASS: Fix issues and re-run until validation passes (max 3 attempts)

将生成的代码写入
```
.py
```
文件
使用Bash运行文件：
```
python <filename>.py
```
分析输出：
- 如果编译错误：读取错误信息，修复代码（检查类型注解、语法、API使用）
- 如果运行时错误：检查张量形状、网格维度、内存访问模式
- 如果验证失败：检查数值差异、容差、算法正确性
- 如果验证通过：向用户报告成功
迭代直至通过：修复问题并重新运行，直到验证通过（最多3次尝试）

Validation Output Best Practices

验证输出最佳实践

Don't print large tensors - Only print tensor contents when validation fails
Print summary stats - Show PASS/FAIL, max difference, tensor shape

Example validation pattern:

python

is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3)
if is_close:
    print("✓ Validation PASSED")
else:
    max_diff = (cutile_output - reference_output).abs().max().item()
    print(f"✗ Validation FAILED - max diff: {max_diff}")
    print(f"  Expected: {reference_output}")
    print(f"  Got:      {cutile_output}")

不要打印大型张量 - 仅在验证失败时打印张量内容
打印统计摘要 - 显示通过/失败、最大差异、张量形状

示例验证模式：

python

is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3)
if is_close:
    print("✓ Validation PASSED")
else:
    max_diff = (cutile_output - reference_output).abs().max().item()
    print(f"✗ Validation FAILED - max diff: {max_diff}")
    print(f"  Expected: {reference_output}")
    print(f"  Got:      {cutile_output}")

Common Issues and Fixes

常见问题与修复方案

Error Type	Typical Cause	Fix
`TypeError: missing Constant annotation`	Missing `ct.Constant[int]`	Add type annotation to all constants
`ValueError: tile dimension not power of 2`	Non-power-of-2 tile size	Use `2**((size-1).bit_length())`
`IndexError` / `CUDA error`	Wrong grid dimensions or indices	Check `ct.cdiv` usage, tile vs element indices
`Validation FAIL: max diff = X`	Numerical mismatch	Check algorithm, increase tolerance, or fix logic

错误类型	常见原因	修复方案
`TypeError: missing Constant annotation`	缺少 `ct.Constant[int]`	为所有常量添加类型注解
`ValueError: tile dimension not power of 2`	Tile大小不是2的幂	使用 `2**((size-1).bit_length())` 向上取整
`IndexError` / `CUDA error`	网格维度或索引错误	检查 `ct.cdiv` 的使用、tile与元素索引
`Validation FAIL: max diff = X`	数值不匹配	检查算法、增大容差或修复逻辑

Default Tolerance Values

默认容差值

See

guidelines/03_concepts.md

→ "Default Rules When User Does Not Specify" for tolerance values, default dtypes, and default tensor shapes.

请参阅

guidelines/03_concepts.md

→ “用户未指定时的默认规则”，获取容差值、默认数据类型和默认张量形状。

Testing Checklist

测试检查清单

✓ Verify cuTile output matches reference implementation within tolerance
✓ Test with various tensor sizes (aligned and unaligned to tile size)
✓ Test boundary conditions and edge cases
✓ Ensure all tensors are on CUDA device before kernel launch
✓ Verify dtype consistency across inputs and outputs

✓ 验证cuTile输出在容差范围内与参考实现匹配
✓ 使用多种张量大小（与tile大小对齐和不对齐的情况）测试
✓ 测试边界条件与边缘情况
✓ 确保所有张量在启动内核前位于CUDA设备上
✓ 验证输入与输出的数据类型一致性

Critical Requirements

关键要求

Four essential requirements for all cuTile kernels:

Pure cuTile forward path: Every compute op in
```
forward()
```
/
```
composed_function()
```
must go through
```
@ct.kernel
```
+
```
ct.launch
```
. Do not call
```
nn.Conv2d()(x)
```
,
```
F.conv2d(x, w)
```
,
```
F.linear(x, w)
```
, or any other
```
nn.*
```
/
```
F.*
```
compute op as a runtime operation in the forward path.
- Permitted in
  forward()
  :
```
torch.empty
```
  ,
```
torch.zeros
```
  ,
```
torch.ones
```
  (allocation);
```
tensor.reshape
```
  ,
```
tensor.view
```
  ,
```
tensor.permute
```
  ,
```
tensor.contiguous
```
  (rearrangement);
```
torch.cat
```
  ,
```
torch.stack
```
  (concatenation);
```
torch.sqrt
```
  ,
```
.sum()
```
  ,
```
.mean()
```
  (simple scalar ops between kernel launches).
- Permitted in
  __init__()
  : Using
```
nn.Conv2d
```
  ,
```
nn.Linear
```
  , etc. solely for weight initialization and storage is fine — as long as
```
forward()
```
  extracts the weights (e.g.,
```
self.conv.weight.data
```
  ) and passes them to
```
ct.launch
```
  instead of calling
```
self.conv(x)
```
  .
- See Rule 15 and Rule 17 in
```
guidelines/02_code_generation_rules.md
```
  for common violations and detailed examples.

Tile indices, not element indices:

ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K))

✅ not

(bid_m * BLOCK_M, k)

❌

All tile dimensions must be powers of 2: Use
```
2**((size-1).bit_length())
```
to round up
All constants need type annotations:
```
BLOCK: ct.Constant[int]
```
is required for compilation

For detailed guidelines on memory operations, tile sizing, common pitfalls, and optimization strategies, see the

guidelines/

directory (01–03).

所有cuTile内核必须满足的四项基本要求：

纯cuTile前向路径：
```
forward()
```
/
```
composed_function()
```
中的每个计算操作必须通过
```
@ct.kernel
```
+
```
ct.launch
```
执行。不得在运行时前向路径中调用
```
nn.Conv2d()(x)
```
、
```
F.conv2d(x, w)
```
、
```
F.linear(x, w)
```
或任何其他
```
nn.*
```
/
```
F.*
```
计算操作。
- forward()
  中允许的操作：
```
torch.empty
```
  、
```
torch.zeros
```
  、
```
torch.ones
```
  （分配）；
```
tensor.reshape
```
  、
```
tensor.view
```
  、
```
tensor.permute
```
  、
```
tensor.contiguous
```
  （重排）；
```
torch.cat
```
  、
```
torch.stack
```
  （拼接）；
```
torch.sqrt
```
  、
```
.sum()
```
  、
```
.mean()
```
  （内核启动之间的简单标量操作）。
- __init__()
  中允许的操作：仅用于权重初始化与存储的
```
nn.Conv2d
```
  、
```
nn.Linear
```
  等是允许的——只要
```
forward()
```
  提取权重（例如
```
self.conv.weight.data
```
  ）并传递给
```
ct.launch
```
  ，而非调用
```
self.conv(x)
```
  。
- 常见违规情况与详细示例，请参阅
```
guidelines/02_code_generation_rules.md
```
  中的规则15和规则17。

使用tile索引而非元素索引：

ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K))

✅ 而非

(bid_m * BLOCK_M, k)

❌

所有tile维度必须是2的幂：使用
```
2**((size-1).bit_length())
```
向上取整
所有常量需要类型注解：
```
BLOCK: ct.Constant[int]
```
是编译必需的

有关内存操作、tile大小设置、常见陷阱和优化策略的详细指南，请参阅

guidelines/

目录下的01–03文档。

Performance Optimization

性能优化

Key principle: Think in blocks of data rather than individual elements. Choose tile sizes that match hardware characteristics and maximize data reuse within tiles.

核心原则：以数据块而非单个元素为思考单位。选择匹配硬件特性的tile大小，最大化tile内的数据复用。

File Management Guidelines

文件管理指南

IMPORTANT: Follow these rules for file creation:

Single file by default: Generate a single
```
.py
```
file containing the kernel, validation, and test code unless the user explicitly requests multiple files
No documentation files: Do NOT create README.md, documentation files, or separate example files unless explicitly requested
Inline everything: Include the kernel implementation, validation logic, and test code in one cohesive file
Minimal file creation: Only create what is absolutely necessary - prefer editing existing files over creating new ones
No source citations: Do NOT include comments or docstrings mentioning TileGym files, reference files, or sources. The code should stand on its own without attribution
Output to current working directory: All output
```
.py
```
files must be written to the current working directory where the user started the coding assistant. Run
```
pwd
```
at the start of the task. All generated
```
.py
```
files go directly in that directory (e.g.
```
./composed_foo.py
```
), never in a subdirectory of the skill.
Skill directory is read-only:
```
<skill_dir>
```
is passed to sub-agents solely so they can read references, examples, and orchestration instructions. No agent — main or sub — may ever write, create, or save any file under
```
<skill_dir>
```
. Use it only with read tools (Read, Glob, Grep, Bash
```
cat
```
/
```
grep
```
). Never pass it to Write, Edit, or any file-creating command.

Example structure for a single file:

python

import cuda.tile as ct
import torch

重要提示：创建文件时遵循以下规则：

默认单文件：生成包含内核、验证和测试代码的单个
```
.py
```
文件，除非用户明确要求多个文件
不创建文档文件：除非明确要求，否则不要创建README.md、文档文件或单独的示例文件
内联所有内容：将内核实现、验证逻辑和测试代码包含在一个连贯的文件中
最小化文件创建：仅创建绝对必要的内容——优先编辑现有文件而非创建新文件
不添加来源引用：不要在注释或文档字符串中提及TileGym文件、参考文件或来源。代码应独立存在，无需归因
输出到当前工作目录：所有输出的
```
.py
```
文件必须写入用户启动编码助手时的当前工作目录。任务开始时运行
```
pwd
```
。所有生成的
```
.py
```
文件直接放入该目录（例如
```
./composed_foo.py
```
），绝不能放入技能的子目录。
技能目录为只读：
```
<skill_dir>
```
仅传递给子Agent用于读取参考资料、示例和编排说明。任何Agent——主Agent或子Agent——都不得在
```
<skill_dir>
```
下写入、创建或保存任何文件。仅使用读取工具（Read、Glob、Grep、Bash
```
cat
```
/
```
grep
```
）访问该目录。绝不能将其传递给Write、Edit或任何创建文件的命令。

单文件示例结构：

python

import cuda.tile as ct
import torch

Kernel implementation

@ct.kernel def my_kernel(...): ...

Validation function (if needed)

def validate(...): ...

Test/demo code at bottom

if name == "main": # Test the kernel ...

undefined

if name == "main": # Test the kernel ...

undefined

Success Criteria

成功标准

Your implementation is successful when:

✅ Pure cuTile forward path: No
```
nn.*
```
/
```
F.*
```
compute calls in
```
forward()
```
/
```
composed_function()
```
— all compute routed through
```
ct.launch
```
(weight-init-only usage in
```
__init__
```
is fine)
✅ Existing examples were searched before implementation
✅ Packaged
```
examples/
```
were searched if TileGym had no match
✅ Only ONE .py file created (no READMEs, no separate examples unless requested)
✅ No source citations in code (no mentions of TileGym files or reference files in comments/docstrings)
✅ Generated cuTile code compiles without errors
✅ Numerical results match reference implementation within tolerance
✅ All constants have proper type annotations
✅ All tile dimensions are powers of 2
✅ Grid dimensions correctly cover all tensor elements
✅ Code includes inline validation and test code in the same file

Additional criteria when using orchestration (complex tasks):

✅ Complexity was assessed and orchestration was chosen for the right reasons
✅ Analyzer produced clear kernel specs with PyTorch references
✅ Independent kernels were generated in parallel (not sequentially)
✅ Each individual kernel was validated before composition
✅ Composed solution passes end-to-end validation against original PyTorch reference

Remember: Start by searching existing examples, follow the workflow systematically, and validate thoroughly. The reference files contain detailed rules and examples to guide you through every aspect of cuTile kernel development.

你的实现满足以下条件即为成功：

✅ 纯cuTile前向路径：
```
forward()
```
/
```
composed_function()
```
中无
```
nn.*
```
/
```
F.*
```
计算调用——所有计算通过
```
ct.launch
```
路由（
```
__init__
```
中仅用于权重初始化的使用是允许的）
✅ 实现前已搜索现有示例
✅ 若TileGym无匹配项，已搜索打包的
```
examples/
```
目录
✅ 仅创建一个.py文件（除非要求，否则无README或单独示例）
✅ 代码中无来源引用（注释/文档字符串中未提及TileGym文件或参考文件）
✅ 生成的cuTile代码编译无错误
✅ 数值结果在容差范围内与参考实现匹配
✅ 所有常量有正确的类型注解
✅ 所有tile维度是2的幂
✅ 网格维度正确覆盖所有张量元素
✅ 代码在同一文件中包含内联验证和测试代码

使用编排工作流（复杂任务）时的额外标准：

✅ 已评估复杂度并因合理原因选择编排工作流
✅ 分析器生成了带有PyTorch参考的清晰内核规范
✅ 独立内核并行生成（而非顺序生成）
✅ 每个独立内核在组合前已验证
✅ 组合后的解决方案通过了与原始PyTorch参考的端到端验证

记住：从搜索现有示例开始，系统地遵循工作流，并进行全面验证。参考文件包含详细的规则和示例，可指导你完成cuTile内核开发的各个方面。",