cupynumeric-migration-readiness

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

cuPyNumeric Migration Readiness

cuPyNumeric 迁移就绪评估

Purpose

用途

Use this skill BEFORE the migration, not during. Answer one question: which of the user's existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.
This is a static, read-only assessment. Inspect the user's source with
Read
,
Grep
, and
Glob
. Do not execute the user's code, modify or write files, or print environment variables or secrets. The
legate
, and cuPyNumeric Doctor commands shown below are suggestions for the user to run — not actions this skill performs.
If this skill has never been seen before, head to
references/getting-started.md
first.
请在迁移前而非迁移过程中使用此技能。 回答一个问题:在投入工程师周进行移植前,用户现有NumPy API中哪些能在cuPyNumeric上扩展,哪些需要重构? 要回答此问题:读取源代码,按Legate/NVIDIA GPU栈上的预期多GPU扩展能力对每个NumPy惯用写法进行分类,对照内置的API支持清单,并生成包含每个发现的推理和方案指引的结构化结论。
这是静态的只读评估。 使用
Read
Grep
Glob
检查用户源代码。请勿执行用户代码、修改或写入文件,或打印环境变量及机密信息。以下展示的
legate
和cuPyNumeric Doctor命令是给用户的运行建议——并非此技能要执行的操作。
如果从未使用过此技能,请先查看
references/getting-started.md

When to use this skill

何时使用此技能

Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.
Decline and redirect when the request is not a pre-migration assessment:
  • Post-migration performance / profiling ("already ported, why is it slow?") → point to
    legate --profile
    and the upstream profiling and debugging walkthrough.
  • Custom CUDA / kernel authoring ("write/optimize a CUDA kernel")
A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.
当用户即将将NumPy代码迁移到GPU,并询问代码是否能在cuPyNumeric/GPU上扩展、是否应迁移、哪些部分会受益、移植前必须更改什么、移植是否值得——或提及预移植评估、扩展分析、惯用写法分析、GPU重构规划、识别GPU不适用的NumPy反模式时使用。
拒绝并引导当请求并非预迁移评估时:
  • 迁移后性能/性能分析("已完成移植,为什么速度慢?")→ 指向
    legate --profile
    和上游的性能分析与调试指南。
  • 自定义CUDA/内核编写("编写/优化CUDA内核")
用户要求迁移的图/稀疏/机器学习/自然语言处理工作负载仍在范围内:评估后通过Gate 4返回NOT RECOMMENDED。这是结论,而非拒绝。

Instructions

操作步骤

Run all five steps below, in order. Read the user's code and reason about it semantically; do not emit a one-shot prose verdict.
按顺序执行以下五个步骤。语义化地读取用户代码,不要一次性给出散文式结论。

Step 1 — Gather context

步骤1 — 收集上下文

Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.
  • Source location. Default to the current working directory when no path is given.
  • Approximate hot-path array sizes at runtime. Default to 30–50 million elements. Map the user's numbers (or this default) to the Gate 2 tiers (65K per-GPU floor; 10M+ for real single-GPU speedup; 100M+ for multi-GPU).
  • Target hardware. Default to 1–4 GPUs, single-node. Confirm before assuming multi-node. For CPU-only runs, ask about RAM per node instead of FBMEM.
  • Dominant compute pattern. Stencil / GEMM / Monte Carlo / reductions / mixed-with-SciPy. Ask the user to name it; otherwise infer it from the code in Step 3.
State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.
在扫描代码前获取信息。以下每项都针对典型工作负载设置了默认值——当用户未主动提供具体信息时使用默认值;无需因提问而停滞。
  • 源代码位置:未给出路径时默认当前工作目录。
  • 运行时近似热点路径数组大小:默认30-5000万个元素。将用户提供的数值(或此默认值)映射到Gate 2层级(每GPU最低65K;10M+可实现真正的单GPU加速;100M+适合多GPU)。
  • 目标硬件:默认1-4个GPU,单节点。假设多节点前需确认。对于仅CPU运行的情况,询问每节点的RAM而非FBMEM。
  • 主导计算模式:模板/通用矩阵乘法(GEMM)/蒙特卡洛/归约/与SciPy混合。请用户命名;否则在步骤3中从代码推断。
在评估顶部说明你应用的默认值,以便用户更正。如果某个值无法确定,请明确说明并继续进行仅定性的评估——不要编造超出上述默认值的数字。

Step 2 — Load the API support manifest

步骤2 — 加载API支持清单

Read
assets/api-support.md
, the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:
  • ✓✓ numpy.X
    — implemented and works on multi-GPU (the best path).
  • ✓ numpy.X
    — implemented but single-GPU/CPU only (caveats multi-node).
  • 🟡 numpy.X — <note>
    — partial support; read the note.
  • ✗ numpy.X
    — not implemented on the cuPyNumeric distributed path. Behavior on call is version-specific (some unsupported APIs route through host NumPy, others raise an exception) — either way, hot-path use is a migration blocker. Do not promise users a silent fallback to host-NumPy.
If the
Fetched:
line is more than ~90 days old, refresh the snapshot — see the Available Scripts section.
读取
assets/api-support.md
,这是上游NumPy与cuPyNumeric对比表的已提交快照。对于代码调用的每个NumPy API,找到对应的行并读取开头的符号:
  • ✓✓ numpy.X
    — 已实现且支持多GPU(最佳路径)。
  • ✓ numpy.X
    — 已实现但仅支持单GPU/CPU(多节点存在限制)。
  • 🟡 numpy.X — <note>
    — 部分支持;请阅读注释。
  • ✗ numpy.X
    — 在cuPyNumeric分布式路径上未实现。调用时的行为因版本而异(部分不支持的API会路由到主机端NumPy,其他则会引发异常)——无论哪种情况,热点路径中的使用都会成为迁移阻碍。不要向用户承诺会静默回退到主机端NumPy。
如果
Fetched:
行的时间超过约90天,请刷新快照——请查看可用脚本部分。

Step 3 — Read the code semantically

步骤3 — 语义化读取代码

Walk the user's files with
Read
and
Grep
and classify each region of array math against
references/idioms-that-scale.md
and
references/idioms-that-block.md
(full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm
arr
traces back to a
cupynumeric
array (or
np.*
aliased to it) and check whether the access sits inside a hot loop. Apply these rules:
  • Flag element loops (
    for i in range(n): arr[i] = ...
    ) as blockers; treat an epoch/step/file loop with a vectorized body as fine — distinguish the two.
  • Flag scalar sync
    .item()
    /
    float()
    /
    int()
    /
    bool()
    /
    complex()
    on a cuPyNumeric array inside a hot loop (per-iteration host sync); allow it at the boundary.
  • Flag reducing conditions
    if
    /
    while
    over an array reduction (
    while np.max(err) > tol:
    ) syncs every iteration.
  • Flag hoistable allocation in a loop as a fixable inefficiency.
  • Flag
    mpi4py
    in runtime code that partitions/communicates array data alongside
    cupynumeric
    (R108) — but first confirm it issues MPI calls on a hot path; ignore a grep hit in a README, build script, or alt-launcher.
  • Flag
    order=
    on
    reshape
    /
    asarray
    /
    flatten
    as R109 — always, regardless of whether the version warns or silently no-ops.
  • Always cite R304 in INFO for
    np.random.*
    under multi-GPU: cross-GPU bit-identical reproducibility is impossible by default (
    --gpus N
    /
    LEGATE_GPUS
    is the Legate launcher arg).
  • Flag Python builtins on arrays (
    sum
    /
    max
    /
    min
    /
    any
    /
    iter(arr)
    ) — host-iteration fallback (R110; upstream best practices). Allow
    len(arr)
    (shape lookup; prefer
    arr.shape[0]
    /
    arr.size
    for 0-d safety).
  • Flag
    cupy
    mixed with
    cupynumeric
    in a hot loop (R111); the runtimes don't share GPU memory, so every hop goes through host NumPy.
  • Look up every NumPy API the code calls in
    assets/api-support.md
    (glyph legend in Step 2).
For the deep "why," read
references/gpu-stack.md
(memory, SM, communication, dispatch) and
references/execution-model.md
(lazy execution, sync points, mapper).
使用
Read
Grep
遍历用户文件,并对照
references/idioms-that-scale.md
references/idioms-that-block.md
(完整原理和R代码在其中)对每个数组运算区域进行分类。语义化读取,而非通过正则表达式:标记前,确认
arr
可追溯到cuPyNumeric数组(或别名
np.*
指向它),并检查访问是否位于热点循环内。应用以下规则:
  • 元素循环
    for i in range(n): arr[i] = ...
    )标记为阻碍;将包含向量化主体的 epoch/步骤/文件循环视为正常——区分这两种情况。
  • 标记标量同步——在热点循环内对cuPyNumeric数组使用
    .item()
    /
    float()
    /
    int()
    /
    bool()
    /
    complex()
    (每次迭代的主机同步);允许在边界处使用。
  • 标记归约条件——在数组归约上使用
    if
    /
    while
    while np.max(err) > tol:
    )会在每次迭代时同步。
  • 循环中可提升的内存分配标记为可修复的低效问题。
  • 标记与
    cupynumeric
    一起用于分区/通信数组数据的运行时代码中的
    mpi4py
    R108)——但首先确认它在热点路径中发出MPI调用;忽略README、构建脚本或备用启动器中的grep匹配结果。
  • reshape
    /
    asarray
    /
    flatten
    上的
    order=
    标记为R109——无论版本是否发出警告或静默无操作,一律标记。
  • 在多GPU环境下的
    np.random.*
    的INFO中始终引用R304:默认情况下跨GPU位一致的可复现性是不可能的(
    --gpus N
    /
    LEGATE_GPUS
    Legate启动参数)。
  • 标记数组上的Python内置函数(
    sum
    /
    max
    /
    min
    /
    any
    /
    iter(arr)
    )——主机端迭代回退(R110上游最佳实践)。允许使用
    len(arr)
    (形状查找;为了0维安全,优先使用
    arr.shape[0]
    /
    arr.size
    )。
  • 标记热点循环中
    cupy
    cupynumeric
    的混合使用(R111);运行时不共享GPU内存,因此每次跳转都会经过主机端NumPy。
  • 查找代码调用的每个NumPy API
    assets/api-support.md
    中的条目(符号说明见步骤2)。
如需深入了解原因,请阅读
references/gpu-stack.md
(内存、SM、通信、调度)和
references/execution-model.md
(延迟执行、同步点、映射器)。

Step 4 — Produce a structured assessment

步骤4 — 生成结构化评估

Deliver the report in this order. Cite
file:line
for every finding so the user can navigate.
  1. Verdict in one sentence — see "Verdict framework" below.
  2. What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
  3. What blocks (BLOCKS findings) — each tied to
    idioms-that-block.md
    and a recipe in
    refactor-recipes.md
    .
  4. What's fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
  5. Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs
    --gpus N
    .
  6. API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
  7. Decision-framework summary — Gates 1–6 from
    references/decision-framework.md
    , marked pass / fail / uncertain.
  8. Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.
All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write "None for this code" or "n/a — see verdict" in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See
assets/sample_report.md
for worked reports.
按以下顺序交付报告。为每个发现引用
file:line
,以便用户导航。
  1. 结论——一句话表述,见下文“结论框架”。
  2. 可扩展的部分(SCALES发现)——引用代表性代码行,让用户了解导入替换后哪些部分会提速。
  3. 阻碍部分(BLOCKS发现)——每个发现都关联
    idioms-that-block.md
    refactor-recipes.md
    中的方案。
  4. 可修复的部分(REFACTOR发现)——按方案分组;一个方案通常可修复多个位置。
  5. 兼容性/成本说明(INFO发现)——SciPy边界、仅单GPU的线性代数/FFT、RNG布局与
    --gpus N
  6. API支持缺口——代码调用的、清单中未实现或仅支持单GPU的API。
  7. 决策框架摘要——
    references/decision-framework.md
    中的Gate 1–6,标记为通过/失败/不确定。
  8. 推荐后续步骤——优先应用哪些方案、是否应先移植一个模块、何时引入cuPyNumeric Doctor。
必须包含全部8个部分,即使结论是READY或NOT RECOMMENDED。对于空部分,写一行**“此代码无相关内容”“不适用——见结论”**——请勿省略标题;标题是报告评分的结构约定。示例报告见
assets/sample_report.md

Step 5 — Hand off to cuPyNumeric Doctor for runtime validation

步骤5 — 移交至cuPyNumeric Doctor进行运行时验证

Direct the user to run cuPyNumeric Doctor once they have applied the recipes and the code runs:
bash
CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py
cuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing,
nonzero
misuse,
mpi4py
import, in-place ops on views). End the assessment at: "now run with cuPyNumeric Doctor enabled; here is what to look for in its output."
指导用户在应用方案并让代码运行后运行cuPyNumeric Doctor
bash
CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py
cuPyNumeric Doctor能捕获源代码检查遗漏的运行时问题(标量项访问、ndarray迭代、高级索引、
nonzero
误用、
mpi4py
导入、视图上的原地操作)。评估结尾说明:“现在启用cuPyNumeric Doctor运行;以下是其输出中需要关注的内容。”

Verdict framework

结论框架

Assign the verdict qualitatively, from the kinds of findings, not a score:
VerdictWhenAction
READYNo BLOCKS; few/no REFACTORSwap the import; benchmark
LIGHT REFACTORA few recipe-fixable patterns (R201R206), or one or two simple BLOCKSApply 1–3 recipes from
refactor-recipes.md
; re-walk to READY
SIGNIFICANT REFACTORMultiple BLOCKS in hot paths, or any R108 (
mpi4py
) — rewrites, not disqualifications
Real project; budget 1–3 engineer-weeks per module
NOT RECOMMENDEDOnly two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land hereRestructure first or use a different runtime
Apply these in order; the first match wins:
  1. Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
  2. Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
  3. Any R108 (
    mpi4py
    )
    SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification).
  4. Multiple BLOCKS (R101R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
  5. One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
  6. Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
  7. No BLOCKS, no REFACTORREADY.
  8. APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.
Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a size or compute-pattern failure. Full framework:
references/decision-framework.md
.
根据发现的类型而非分数定性分配结论:
结论适用场景操作建议
READY无BLOCKS;极少/无REFACTOR替换导入;进行基准测试
LIGHT REFACTOR少量可通过方案修复的模式(R201R206),或一两个简单的BLOCKS应用
refactor-recipes.md
中的1-3个方案;重新评估直至达到READY
SIGNIFICANT REFACTOR热点路径中有多个BLOCKS,或存在任何R108
mpi4py
)——需要重写,但并非不合格
正式项目;每个模块预算1-3个工程师周
NOT RECOMMENDED仅两种失败情况:Gate 2(数组低于65,536的下限)或Gate 4(计算模式错误)。大量BLOCKS不会归为此类先重构或使用其他运行时
按以下顺序应用;第一个匹配项生效:
  1. Gate 4失败(稀疏/图/机器学习/序列/字符串)→ NOT RECOMMENDED
  2. Gate 2失败(热点路径数组<每GPU65,536元素,无可行的批处理路径)→ NOT RECOMMENDED
  3. 存在任何R108
    mpi4py
    SIGNIFICANT REFACTOR(并行层重写是成本问题,而非不合格)。
  4. 热点路径中有多个BLOCKSR101R111)→ SIGNIFICANT REFACTOR(数量不会超过此级别——每个BLOCKS都有记录的方案)。
  5. 一两个可通过方案修复的BLOCKS(例如R101–R104元素循环/同步)→ LIGHT REFACTOR
  6. 仅存在REFACTOR模式(R201–R206)→ LIGHT REFACTOR;方案是机械性的。
  7. 无BLOCKS,无REFACTORREADY
  8. 热点路径中清单未包含的API → 降低一个级别(SIGNIFICANT仍为SIGNIFICANT,不会变为NOT RECOMMENDED)。仅单GPU的API仅在多节点场景中重要。
关注发现的类型,而非数量。 热点路径中的一个R101比十个R001更严重——它会破坏R001本可以带来的扩展性。相反,大量BLOCKS加上R108仍属于SIGNIFICANT,而非NOT RECOMMENDED——级别衡量的是工程成本,而非不可行性。NOT RECOMMENDED需要规模计算模式失败。完整框架见
references/decision-framework.md

What scales vs what blocks (at-a-glance)

可扩展与阻碍内容一览

  • SCALES (keep as-is) — vectorized elementwise, reductions, matmul / einsum,
    np.where
    , large-per-GPU stencil slicing
    arr[1:-1, 1:-1]
    ,
    out=
    , boolean-mask indexing.
  • BLOCKS (remove before migration) — element loops,
    np.vectorize
    ,
    for row in arr
    ,
    .item()/.tolist()/bool(arr)
    in a hot loop, reducing
    if
    /
    while
    in a loop,
    arr[::2]
    ,
    dtype=object
    ,
    mpi4py
    ,
    order=
    ,
    min/max/sum(arr)
    .
  • REFACTOR (apply a recipe) — alloc in a loop,
    x = x + y
    rebind in a loop,
    vstack/hstack/concatenate
    in a loop,
    np.nonzero()
    + indexing, view-mutation of
    diag/flip/flatten
    ,
    reshape
    in a hot loop.
  • INFO (cost note, not a blocker) — SciPy imports, single-device
    linalg.qr/svd
    , single-transform
    fft.*
    , size-thresholded
    linalg.solve/cholesky
    .
Full taxonomy in
idioms-that-scale.md
and
idioms-that-block.md
. Pass over silently any API the manifest doesn't list (out of scope of the upstream table — flagging it would be noise).
  • SCALES(保持原样)——向量化逐元素操作、归约、矩阵乘法/einsum、
    np.where
    、每GPU大尺寸模板切片
    arr[1:-1, 1:-1]
    out=
    、布尔掩码索引。
  • BLOCKS(迁移前移除)——元素循环、
    np.vectorize
    for row in arr
    、热点循环中的
    .item()/.tolist()/bool(arr)
    、循环中的归约
    if
    /
    while
    arr[::2]
    dtype=object
    mpi4py
    order=
    min/max/sum(arr)
  • REFACTOR(应用方案)——循环中的内存分配、循环中的
    x = x + y
    重新绑定、循环中的
    vstack/hstack/concatenate
    np.nonzero()
    +索引、
    diag/flip/flatten
    的视图修改、热点循环中的
    reshape
  • INFO(成本说明,非阻碍)——SciPy导入、单设备
    linalg.qr/svd
    、单次变换
    fft.*
    、尺寸阈值化的
    linalg.solve/cholesky
完整分类见
idioms-that-scale.md
idioms-that-block.md
。对于清单未列出的API,可忽略(不在上游表格范围内——标记会产生噪音)。

Reading order

阅读顺序

The canonical, read-in-order guide lives in
references/getting-started.md
— read it once for orientation.
For a non-trivial assessment the must-reads are
idioms-that-block.md
,
refactor-recipes.md
, and
decision-framework.md
; the rest (
idioms-that-scale.md
,
gpu-stack.md
,
execution-model.md
,
partitioning-and-balance.md
,
case-studies.md
) are read on demand.
规范的按顺序阅读指南见
references/getting-started.md
——阅读一次以了解概况。
对于非 trivial 的评估,必须阅读的内容是
idioms-that-block.md
refactor-recipes.md
decision-framework.md
;其余内容(
idioms-that-scale.md
gpu-stack.md
execution-model.md
partitioning-and-balance.md
case-studies.md
)按需阅读。

Limitations

局限性

  • Does not run cuPyNumeric. No runtime required; this is the pre-port check. Actual speedup measurement happens after migration.
  • Does not auto-generate refactored code. It identifies what to change and points to recipes; the user (or a follow-up agent) applies them.
  • Does not profile the workload. For runtime measurement use
    legate.timing.time()
    and the upstream profiling and debugging guide.
  • Does not replace judgment. Pattern matching misses implicit syncs inside logging, decorators that hide
    .tolist()
    , runtime-data-dependent partition mismatches. Read the source too, especially in borderline cases.
  • 不运行cuPyNumeric。无需运行时;这是预移植检查。实际加速测量在迁移后进行。
  • 不自动生成重构代码。它识别需要更改的内容并指向方案;用户(或后续代理)应用这些方案。
  • 不分析工作负载性能。如需运行时测量,请使用
    legate.timing.time()
    和上游的性能分析与调试指南。
  • 不能替代判断。模式匹配会遗漏日志中的隐式同步、隐藏
    .tolist()
    的装饰器、运行时数据依赖的分区不匹配。也要阅读源代码,尤其是边界情况。

Examples

示例

A worked assessment of the bundled
assets/examples/
fixtures (an example, not a template):
Verdict: LIGHT REFACTOR.
scales_well.py
translates cleanly;
needs_refactor.py
needs one allocation hoisted;
blocks_scaling.py
syncs every iteration via
.item()
.
What works:
scales_well.py:23-31
(stencil R005),
:40-44
(reduction R002),
:18-22
(elementwise R001). What blocks:
blocks_scaling.py:51-58
(R104
.item()
in hot loop) → RR-sync. What's fixable:
needs_refactor.py:21-28
(R201 — alloc in loop) → RR-alloc. Next: apply the recipes; re-walk to READY; enable
CUPYNUMERIC_DOCTOR=1
on the first real run.
The full worked report is in
assets/sample_report.md
.
对内置
assets/examples/
示例的评估(示例,非模板):
结论:LIGHT REFACTOR。
scales_well.py
可顺利转换;
needs_refactor.py
需要提升一处内存分配;
blocks_scaling.py
通过
.item()
在每次迭代时同步。
可扩展的部分:
scales_well.py:23-31
(模板R005)、
:40-44
(归约R002)、
:18-22
(逐元素R001)。 阻碍部分:
blocks_scaling.py:51-58
R104 — 热点循环中的
.item()
)→ RR-sync可修复的部分:
needs_refactor.py:21-28
R201 — 循环中的内存分配)→ RR-alloc后续步骤: 应用方案;重新评估直至达到READY;首次正式运行时启用
CUPYNUMERIC_DOCTOR=1
完整示例报告见
assets/sample_report.md

Authoritative upstream references

权威上游参考

Available Scripts

可用脚本

ScriptPurposeArguments
scripts/fetch_api_support.py
Scrape the upstream comparison table into
assets/api-support.md
. Python stdlib only; standalone.
--default-path
(write the committed
assets/api-support.md
);
--docs-nvidia-url
(use canonical
docs.nvidia.com
instead of the default GitHub Pages mirror)
The user runs this to refresh the manifest (
python scripts/fetch_api_support.py --default-path
).
脚本用途参数
scripts/fetch_api_support.py
将上游对比表抓取到
assets/api-support.md
中。仅使用Python标准库;独立运行。
--default-path
(写入已提交的
assets/api-support.md
);
--docs-nvidia-url
(使用规范的
docs.nvidia.com
而非默认的GitHub Pages镜像)
用户运行此脚本刷新清单(
python scripts/fetch_api_support.py --default-path
)。

Bundled references and assets

内置参考与资源

The
references/
files are enumerated under Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets:
assets/api-support.md
(committed API snapshot, load in Step 2),
assets/sample_report.md
and
assets/examples/*.py
(worked report and fixtures).
references/
文件在上述必读阅读顺序中列出(R代码范围:idioms-that-scale.md = R001–R007 / R301–R305;idioms-that-block.md = R101–R111 / R201–R206)。资源:
assets/api-support.md
(已提交的API快照,步骤2加载)、
assets/sample_report.md
assets/examples/*.py
(示例报告和示例)。

Troubleshooting

故障排除

SymptomCauseFix
Fetched:
line in the manifest > ~90 days old
Stale snapshotRun
fetch_api_support.py --default-path
(user-run)
Manifest missing or scraper failsUpstream HTML changed
WebFetch
the comparison table for that assessment
NOT RECOMMENDED for many fixable BLOCKSHeuristics applied out of orderRe-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count
Kernel authoring or post-migration profilingOut of scopeDecline and redirect (see "When to use") — no verdict
症状原因修复方法
清单中的
Fetched:
行超过约90天
快照过时运行
fetch_api_support.py --default-path
(用户运行)
清单缺失或抓取器失败上游HTML更改为此次评估WebFetch对比表
大量可修复BLOCKS却返回NOT RECOMMENDED启发式规则应用顺序错误重新按顺序应用:Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR;关注类型而非数量
内核编写或迁移后性能分析超出范围拒绝并引导(见“何时使用”)——不给出结论