cupynumeric-migration-readiness

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

cuPyNumeric Migration Readiness

cuPyNumeric 迁移就绪评估

Purpose

用途

Use this skill BEFORE the migration, not during. Answer one question: which of the user's existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.

This is a static, read-only assessment. Inspect the user's source with

Read

Grep

, and

Glob

. Do not execute the user's code, modify or write files, or print environment variables or secrets. The

legate

, and cuPyNumeric Doctor commands shown below are suggestions for the user to run — not actions this skill performs.

If this skill has never been seen before, head to

references/getting-started.md

first.

请在迁移前而非迁移过程中使用此技能。 回答一个问题：在投入工程师周进行移植前，用户现有NumPy API中哪些能在cuPyNumeric上扩展，哪些需要重构？ 要回答此问题：读取源代码，按Legate/NVIDIA GPU栈上的预期多GPU扩展能力对每个NumPy惯用写法进行分类，对照内置的API支持清单，并生成包含每个发现的推理和方案指引的结构化结论。

这是静态的只读评估。 使用

Read

、

Grep

和

Glob

检查用户源代码。请勿执行用户代码、修改或写入文件，或打印环境变量及机密信息。以下展示的

legate

和cuPyNumeric Doctor命令是给用户的运行建议——并非此技能要执行的操作。

如果从未使用过此技能，请先查看

references/getting-started.md

。

When to use this skill

何时使用此技能

Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.

Decline and redirect when the request is not a pre-migration assessment:

Post-migration performance / profiling ("already ported, why is it slow?") → point to
```
legate --profile
```
and the upstream profiling and debugging walkthrough.
Custom CUDA / kernel authoring ("write/optimize a CUDA kernel")

A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.

当用户即将将NumPy代码迁移到GPU，并询问代码是否能在cuPyNumeric/GPU上扩展、是否应迁移、哪些部分会受益、移植前必须更改什么、移植是否值得——或提及预移植评估、扩展分析、惯用写法分析、GPU重构规划、识别GPU不适用的NumPy反模式时使用。

拒绝并引导当请求并非预迁移评估时：

迁移后性能/性能分析（"已完成移植，为什么速度慢？"）→ 指向
```
legate --profile
```
和上游的性能分析与调试指南。
自定义CUDA/内核编写（"编写/优化CUDA内核"）

用户要求迁移的图/稀疏/机器学习/自然语言处理工作负载仍在范围内：评估后通过Gate 4返回NOT RECOMMENDED。这是结论，而非拒绝。

Instructions

操作步骤

Run all five steps below, in order. Read the user's code and reason about it semantically; do not emit a one-shot prose verdict.

按顺序执行以下五个步骤。语义化地读取用户代码，不要一次性给出散文式结论。

Step 1 — Gather context

步骤1 — 收集上下文

Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.

Source location. Default to the current working directory when no path is given.
Approximate hot-path array sizes at runtime. Default to 30–50 million elements. Map the user's numbers (or this default) to the Gate 2 tiers (65K per-GPU floor; 10M+ for real single-GPU speedup; 100M+ for multi-GPU).
Target hardware. Default to 1–4 GPUs, single-node. Confirm before assuming multi-node. For CPU-only runs, ask about RAM per node instead of FBMEM.
Dominant compute pattern. Stencil / GEMM / Monte Carlo / reductions / mixed-with-SciPy. Ask the user to name it; otherwise infer it from the code in Step 3.

State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.

在扫描代码前获取信息。以下每项都针对典型工作负载设置了默认值——当用户未主动提供具体信息时使用默认值；无需因提问而停滞。

源代码位置：未给出路径时默认当前工作目录。
运行时近似热点路径数组大小：默认30-5000万个元素。将用户提供的数值（或此默认值）映射到Gate 2层级（每GPU最低65K；10M+可实现真正的单GPU加速；100M+适合多GPU）。
目标硬件：默认1-4个GPU，单节点。假设多节点前需确认。对于仅CPU运行的情况，询问每节点的RAM而非FBMEM。
主导计算模式：模板/通用矩阵乘法（GEMM）/蒙特卡洛/归约/与SciPy混合。请用户命名；否则在步骤3中从代码推断。

在评估顶部说明你应用的默认值，以便用户更正。如果某个值无法确定，请明确说明并继续进行仅定性的评估——不要编造超出上述默认值的数字。

Step 2 — Load the API support manifest

步骤2 — 加载API支持清单

Read

assets/api-support.md

, the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:

```
✓✓ numpy.X
```
— implemented and works on multi-GPU (the best path).
```
✓ numpy.X
```
— implemented but single-GPU/CPU only (caveats multi-node).
```
🟡 numpy.X — <note>
```
— partial support; read the note.
```
✗ numpy.X
```
— not implemented on the cuPyNumeric distributed path. Behavior on call is version-specific (some unsupported APIs route through host NumPy, others raise an exception) — either way, hot-path use is a migration blocker. Do not promise users a silent fallback to host-NumPy.

If the

Fetched:

line is more than ~90 days old, refresh the snapshot — see the Available Scripts section.

读取

assets/api-support.md

，这是上游NumPy与cuPyNumeric对比表的已提交快照。对于代码调用的每个NumPy API，找到对应的行并读取开头的符号：

```
✓✓ numpy.X
```
— 已实现且支持多GPU（最佳路径）。
```
✓ numpy.X
```
— 已实现但仅支持单GPU/CPU（多节点存在限制）。
```
🟡 numpy.X — <note>
```
— 部分支持；请阅读注释。
```
✗ numpy.X
```
— 在cuPyNumeric分布式路径上未实现。调用时的行为因版本而异（部分不支持的API会路由到主机端NumPy，其他则会引发异常）——无论哪种情况，热点路径中的使用都会成为迁移阻碍。不要向用户承诺会静默回退到主机端NumPy。

如果

Fetched:

行的时间超过约90天，请刷新快照——请查看可用脚本部分。

Step 3 — Read the code semantically

步骤3 — 语义化读取代码

Walk the user's files with

Read

and

Grep

and classify each region of array math against

references/idioms-that-scale.md

and

references/idioms-that-block.md

(full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm

arr

traces back to a

cupynumeric

array (or

np.*

aliased to it) and check whether the access sits inside a hot loop. Apply these rules:

Flag element loops (
```
for i in range(n): arr[i] = ...
```
) as blockers; treat an epoch/step/file loop with a vectorized body as fine — distinguish the two.
Flag scalar sync —
```
.item()
```
/
```
float()
```
/
```
int()
```
/
```
bool()
```
/
```
complex()
```
on a cuPyNumeric array inside a hot loop (per-iteration host sync); allow it at the boundary.
Flag reducing conditions —
```
if
```
/
```
while
```
over an array reduction (
```
while np.max(err) > tol:
```
) syncs every iteration.
Flag hoistable allocation in a loop as a fixable inefficiency.
Flag
mpi4py
in runtime code that partitions/communicates array data alongside
```
cupynumeric
```
(R108) — but first confirm it issues MPI calls on a hot path; ignore a grep hit in a README, build script, or alt-launcher.
Flag
order=
on
```
reshape
```
/
```
asarray
```
/
```
flatten
```
as R109 — always, regardless of whether the version warns or silently no-ops.
Always cite R304 in INFO for
```
np.random.*
```
under multi-GPU: cross-GPU bit-identical reproducibility is impossible by default (
```
--gpus N
```
/
```
LEGATE_GPUS
```
is the Legate launcher arg).
Flag Python builtins on arrays (
```
sum
```
/
```
max
```
/
```
min
```
/
```
any
```
/
```
iter(arr)
```
) — host-iteration fallback (R110; upstream best practices). Allow
```
len(arr)
```
(shape lookup; prefer
```
arr.shape[0]
```
/
```
arr.size
```
for 0-d safety).
Flag
cupy
mixed with
cupynumeric
in a hot loop (R111); the runtimes don't share GPU memory, so every hop goes through host NumPy.
Look up every NumPy API the code calls in
```
assets/api-support.md
```
(glyph legend in Step 2).

For the deep "why," read

references/gpu-stack.md

(memory, SM, communication, dispatch) and

references/execution-model.md

(lazy execution, sync points, mapper).

使用

Read

和

Grep

遍历用户文件，并对照

references/idioms-that-scale.md

和

references/idioms-that-block.md

（完整原理和R代码在其中）对每个数组运算区域进行分类。语义化读取，而非通过正则表达式：标记前，确认

arr

可追溯到cuPyNumeric数组（或别名

np.*

指向它），并检查访问是否位于热点循环内。应用以下规则：

将元素循环（
```
for i in range(n): arr[i] = ...
```
）标记为阻碍；将包含向量化主体的 epoch/步骤/文件循环视为正常——区分这两种情况。
标记标量同步——在热点循环内对cuPyNumeric数组使用
```
.item()
```
/
```
float()
```
/
```
int()
```
/
```
bool()
```
/
```
complex()
```
（每次迭代的主机同步）；允许在边界处使用。
标记归约条件——在数组归约上使用
```
if
```
/
```
while
```
（
```
while np.max(err) > tol:
```
）会在每次迭代时同步。
将循环中可提升的内存分配标记为可修复的低效问题。
标记与
```
cupynumeric
```
一起用于分区/通信数组数据的运行时代码中的
```
mpi4py
```
（R108）——但首先确认它在热点路径中发出MPI调用；忽略README、构建脚本或备用启动器中的grep匹配结果。
将
```
reshape
```
/
```
asarray
```
/
```
flatten
```
上的
```
order=
```
标记为R109——无论版本是否发出警告或静默无操作，一律标记。
在多GPU环境下的
```
np.random.*
```
的INFO中始终引用R304：默认情况下跨GPU位一致的可复现性是不可能的（
```
--gpus N
```
/
```
LEGATE_GPUS
```
是Legate启动参数）。
标记数组上的Python内置函数（
```
sum
```
/
```
max
```
/
```
min
```
/
```
any
```
/
```
iter(arr)
```
）——主机端迭代回退（R110；上游最佳实践）。允许使用
```
len(arr)
```
（形状查找；为了0维安全，优先使用
```
arr.shape[0]
```
/
```
arr.size
```
）。
标记热点循环中
```
cupy
```
与
```
cupynumeric
```
的混合使用（R111）；运行时不共享GPU内存，因此每次跳转都会经过主机端NumPy。
查找代码调用的每个NumPy API在
```
assets/api-support.md
```
中的条目（符号说明见步骤2）。

如需深入了解原因，请阅读

references/gpu-stack.md

（内存、SM、通信、调度）和

references/execution-model.md

（延迟执行、同步点、映射器）。

Step 4 — Produce a structured assessment

步骤4 — 生成结构化评估

Deliver the report in this order. Cite

file:line

for every finding so the user can navigate.

Verdict in one sentence — see "Verdict framework" below.
What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
What blocks (BLOCKS findings) — each tied to
```
idioms-that-block.md
```
and a recipe in
```
refactor-recipes.md
```
.
What's fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs
```
--gpus N
```
.
API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
Decision-framework summary — Gates 1–6 from
```
references/decision-framework.md
```
, marked pass / fail / uncertain.
Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.

All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write "None for this code" or "n/a — see verdict" in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See

assets/sample_report.md

for worked reports.

按以下顺序交付报告。为每个发现引用

file:line

，以便用户导航。

结论——一句话表述，见下文“结论框架”。
可扩展的部分（SCALES发现）——引用代表性代码行，让用户了解导入替换后哪些部分会提速。
阻碍部分（BLOCKS发现）——每个发现都关联
```
idioms-that-block.md
```
和
```
refactor-recipes.md
```
中的方案。
可修复的部分（REFACTOR发现）——按方案分组；一个方案通常可修复多个位置。
兼容性/成本说明（INFO发现）——SciPy边界、仅单GPU的线性代数/FFT、RNG布局与
```
--gpus N
```
。
API支持缺口——代码调用的、清单中未实现或仅支持单GPU的API。
决策框架摘要——
```
references/decision-framework.md
```
中的Gate 1–6，标记为通过/失败/不确定。
推荐后续步骤——优先应用哪些方案、是否应先移植一个模块、何时引入cuPyNumeric Doctor。

必须包含全部8个部分，即使结论是READY或NOT RECOMMENDED。对于空部分，写一行**“此代码无相关内容”或“不适用——见结论”**——请勿省略标题；标题是报告评分的结构约定。示例报告见

assets/sample_report.md

。

Step 5 — Hand off to cuPyNumeric Doctor for runtime validation

步骤5 — 移交至cuPyNumeric Doctor进行运行时验证

Direct the user to run cuPyNumeric Doctor once they have applied the recipes and the code runs:

bash

CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py

cuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing,

nonzero

misuse,

mpi4py

import, in-place ops on views). End the assessment at: "now run with cuPyNumeric Doctor enabled; here is what to look for in its output."

指导用户在应用方案并让代码运行后运行cuPyNumeric Doctor：

bash

CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py

cuPyNumeric Doctor能捕获源代码检查遗漏的运行时问题（标量项访问、ndarray迭代、高级索引、

nonzero

误用、

mpi4py

导入、视图上的原地操作）。评估结尾说明：“现在启用cuPyNumeric Doctor运行；以下是其输出中需要关注的内容。”

Verdict framework

结论框架

Assign the verdict qualitatively, from the kinds of findings, not a score:

Verdict	When	Action
READY	No BLOCKS; few/no REFACTOR	Swap the import; benchmark
LIGHT REFACTOR	A few recipe-fixable patterns (R201–R206), or one or two simple BLOCKS	Apply 1–3 recipes from `refactor-recipes.md` ; re-walk to READY
SIGNIFICANT REFACTOR	Multiple BLOCKS in hot paths, or any R108 ( `mpi4py` ) — rewrites, not disqualifications	Real project; budget 1–3 engineer-weeks per module
NOT RECOMMENDED	Only two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land here	Restructure first or use a different runtime

Apply these in order; the first match wins:

Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
Any R108 (
mpi4py
) → SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification).
Multiple BLOCKS (R101–R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
No BLOCKS, no REFACTOR → READY.
APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.

Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a size or compute-pattern failure. Full framework:

references/decision-framework.md

根据发现的类型而非分数定性分配结论：

结论	适用场景	操作建议
READY	无BLOCKS；极少/无REFACTOR	替换导入；进行基准测试
LIGHT REFACTOR	少量可通过方案修复的模式（R201–R206），或一两个简单的BLOCKS	应用 `refactor-recipes.md` 中的1-3个方案；重新评估直至达到READY
SIGNIFICANT REFACTOR	热点路径中有多个BLOCKS，或存在任何R108（ `mpi4py` ）——需要重写，但并非不合格	正式项目；每个模块预算1-3个工程师周
NOT RECOMMENDED	仅两种失败情况：Gate 2（数组低于65,536的下限）或Gate 4（计算模式错误）。大量BLOCKS不会归为此类	先重构或使用其他运行时

按以下顺序应用；第一个匹配项生效：

Gate 4失败（稀疏/图/机器学习/序列/字符串）→ NOT RECOMMENDED。
Gate 2失败（热点路径数组<每GPU65,536元素，无可行的批处理路径）→ NOT RECOMMENDED。
存在任何R108（
mpi4py
） → SIGNIFICANT REFACTOR（并行层重写是成本问题，而非不合格）。
热点路径中有多个BLOCKS（R101–R111）→ SIGNIFICANT REFACTOR（数量不会超过此级别——每个BLOCKS都有记录的方案）。
一两个可通过方案修复的BLOCKS（例如R101–R104元素循环/同步）→ LIGHT REFACTOR。
仅存在REFACTOR模式（R201–R206）→ LIGHT REFACTOR；方案是机械性的。
无BLOCKS，无REFACTOR → READY。
热点路径中清单未包含的API → 降低一个级别（SIGNIFICANT仍为SIGNIFICANT，不会变为NOT RECOMMENDED）。仅单GPU的API仅在多节点场景中重要。

关注发现的类型，而非数量。 热点路径中的一个R101比十个R001更严重——它会破坏R001本可以带来的扩展性。相反，大量BLOCKS加上R108仍属于SIGNIFICANT，而非NOT RECOMMENDED——级别衡量的是工程成本，而非不可行性。NOT RECOMMENDED需要规模或计算模式失败。完整框架见

references/decision-framework.md

。

What scales vs what blocks (at-a-glance)

可扩展与阻碍内容一览

SCALES (keep as-is) — vectorized elementwise, reductions, matmul / einsum,
```
np.where
```
, large-per-GPU stencil slicing
```
arr[1:-1, 1:-1]
```
,
```
out=
```
, boolean-mask indexing.

BLOCKS (remove before migration) — element loops,

np.vectorize

for row in arr

.item()/.tolist()/bool(arr)

in a hot loop, reducing

if

while

in a loop,

arr[::2]

dtype=object

mpi4py

order=

min/max/sum(arr)

REFACTOR (apply a recipe) — alloc in a loop,
```
x = x + y
```
rebind in a loop,
```
vstack/hstack/concatenate
```
in a loop,
```
np.nonzero()
```
+ indexing, view-mutation of
```
diag/flip/flatten
```
,
```
reshape
```
in a hot loop.
INFO (cost note, not a blocker) — SciPy imports, single-device
```
linalg.qr/svd
```
, single-transform
```
fft.*
```
, size-thresholded
```
linalg.solve/cholesky
```
.

Full taxonomy in

idioms-that-scale.md

and

idioms-that-block.md

. Pass over silently any API the manifest doesn't list (out of scope of the upstream table — flagging it would be noise).

SCALES（保持原样）——向量化逐元素操作、归约、矩阵乘法/einsum、
```
np.where
```
、每GPU大尺寸模板切片
```
arr[1:-1, 1:-1]
```
、
```
out=
```
、布尔掩码索引。
BLOCKS（迁移前移除）——元素循环、
```
np.vectorize
```
、
```
for row in arr
```
、热点循环中的
```
.item()/.tolist()/bool(arr)
```
、循环中的归约
```
if
```
/
```
while
```
、
```
arr[::2]
```
、
```
dtype=object
```
、
```
mpi4py
```
、
```
order=
```
、
```
min/max/sum(arr)
```
。
REFACTOR（应用方案）——循环中的内存分配、循环中的
```
x = x + y
```
重新绑定、循环中的
```
vstack/hstack/concatenate
```
、
```
np.nonzero()
```
+索引、
```
diag/flip/flatten
```
的视图修改、热点循环中的
```
reshape
```
。
INFO（成本说明，非阻碍）——SciPy导入、单设备
```
linalg.qr/svd
```
、单次变换
```
fft.*
```
、尺寸阈值化的
```
linalg.solve/cholesky
```
。

完整分类见

idioms-that-scale.md

和

idioms-that-block.md

。对于清单未列出的API，可忽略（不在上游表格范围内——标记会产生噪音）。

Reading order

阅读顺序

The canonical, read-in-order guide lives in

references/getting-started.md

— read it once for orientation.

For a non-trivial assessment the must-reads are

idioms-that-block.md

refactor-recipes.md

, and

decision-framework.md

; the rest (

idioms-that-scale.md

gpu-stack.md

execution-model.md

partitioning-and-balance.md

case-studies.md

) are read on demand.

规范的按顺序阅读指南见

references/getting-started.md

——阅读一次以了解概况。

对于非 trivial 的评估，必须阅读的内容是

idioms-that-block.md

、

refactor-recipes.md

和

decision-framework.md

；其余内容（

idioms-that-scale.md

、

gpu-stack.md

、

execution-model.md

、

partitioning-and-balance.md

、

case-studies.md

）按需阅读。

Limitations

局限性

Does not run cuPyNumeric. No runtime required; this is the pre-port check. Actual speedup measurement happens after migration.
Does not auto-generate refactored code. It identifies what to change and points to recipes; the user (or a follow-up agent) applies them.
Does not profile the workload. For runtime measurement use
```
legate.timing.time()
```
and the upstream profiling and debugging guide.
Does not replace judgment. Pattern matching misses implicit syncs inside logging, decorators that hide
```
.tolist()
```
, runtime-data-dependent partition mismatches. Read the source too, especially in borderline cases.

不运行cuPyNumeric。无需运行时；这是预移植检查。实际加速测量在迁移后进行。
不自动生成重构代码。它识别需要更改的内容并指向方案；用户（或后续代理）应用这些方案。
不分析工作负载性能。如需运行时测量，请使用
```
legate.timing.time()
```
和上游的性能分析与调试指南。
不能替代判断。模式匹配会遗漏日志中的隐式同步、隐藏
```
.tolist()
```
的装饰器、运行时数据依赖的分区不匹配。也要阅读源代码，尤其是边界情况。

Examples

示例

A worked assessment of the bundled

assets/examples/

fixtures (an example, not a template):

Verdict: LIGHT REFACTOR.
scales_well.py
translates cleanly;
needs_refactor.py
needs one allocation hoisted;
blocks_scaling.py
syncs every iteration via
.item()
.
What works:
scales_well.py:23-31
(stencil R005),
:40-44
(reduction R002),
:18-22
(elementwise R001). What blocks:
blocks_scaling.py:51-58
(R104 —
.item()
in hot loop) → RR-sync. What's fixable:
needs_refactor.py:21-28
(R201 — alloc in loop) → RR-alloc. Next: apply the recipes; re-walk to READY; enable
CUPYNUMERIC_DOCTOR=1
on the first real run.

The full worked report is in

assets/sample_report.md

对内置

assets/examples/

示例的评估（示例，非模板）：

结论：LIGHT REFACTOR。
scales_well.py
可顺利转换；
needs_refactor.py
需要提升一处内存分配；
blocks_scaling.py
通过
.item()
在每次迭代时同步。
可扩展的部分：
scales_well.py:23-31
（模板R005）、
:40-44
（归约R002）、
:18-22
（逐元素R001）。 阻碍部分：
blocks_scaling.py:51-58
（R104 — 热点循环中的
.item()
）→ RR-sync。 可修复的部分：
needs_refactor.py:21-28
（R201 — 循环中的内存分配）→ RR-alloc。 后续步骤： 应用方案；重新评估直至达到READY；首次正式运行时启用
CUPYNUMERIC_DOCTOR=1
。

完整示例报告见

assets/sample_report.md

。

Authoritative upstream references

权威上游参考

Comparison table (source for
```
assets/api-support.md
```
): https://nv-legate.github.io/cupynumeric/api/comparison.html (mirror, most current) /
```
.../latest/api/comparison.html
```
on docs.nvidia.com (canonical)
Best practices, Doctor, profiling, differences with NumPy, Legate launcher — under https://docs.nvidia.com/cupynumeric/latest/ (
```
user/practices.html
```
,
```
user/doctor.html
```
,
```
user/profiling_debugging.html
```
,
```
user/differences.html
```
) and https://docs.nvidia.com/legate/latest/manual/usage/running.html
Source: https://github.com/nv-legate/cupynumeric

对比表（
```
assets/api-support.md
```
的来源）：https://nv-legate.github.io/cupynumeric/api/comparison.html（镜像，最新）/ docs.nvidia.com上的
```
.../latest/api/comparison.html
```
（规范）
最佳实践、Doctor、性能分析、与NumPy的差异、Legate启动器——见https://docs.nvidia.com/cupynumeric/latest/（`user/practices.html`、`user/doctor.html`、`user/profiling_debugging.html`、`user/differences.html`）和https://docs.nvidia.com/legate/latest/manual/usage/running.html
源代码：https://github.com/nv-legate/cupynumeric

Available Scripts

可用脚本

Script	Purpose	Arguments
`scripts/fetch_api_support.py`	Scrape the upstream comparison table into `assets/api-support.md` . Python stdlib only; standalone.	`--default-path` (write the committed `assets/api-support.md` ); `--docs-nvidia-url` (use canonical `docs.nvidia.com` instead of the default GitHub Pages mirror)

The user runs this to refresh the manifest (

python scripts/fetch_api_support.py --default-path

脚本	用途	参数
`scripts/fetch_api_support.py`	将上游对比表抓取到 `assets/api-support.md` 中。仅使用Python标准库；独立运行。	`--default-path` （写入已提交的 `assets/api-support.md` ）； `--docs-nvidia-url` （使用规范的 `docs.nvidia.com` 而非默认的GitHub Pages镜像）

用户运行此脚本刷新清单（

python scripts/fetch_api_support.py --default-path

）。

Bundled references and assets

内置参考与资源

The

references/

files are enumerated under Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets:

assets/api-support.md

(committed API snapshot, load in Step 2),

assets/sample_report.md

and

assets/examples/*.py

(worked report and fixtures).

references/

文件在上述必读阅读顺序中列出（R代码范围：idioms-that-scale.md = R001–R007 / R301–R305；idioms-that-block.md = R101–R111 / R201–R206）。资源：

assets/api-support.md

（已提交的API快照，步骤2加载）、

assets/sample_report.md

和

assets/examples/*.py

（示例报告和示例）。

Troubleshooting

故障排除

Symptom	Cause	Fix
`Fetched:` line in the manifest > ~90 days old	Stale snapshot	Run `fetch_api_support.py --default-path` (user-run)
Manifest missing or scraper fails	Upstream HTML changed	`WebFetch` the comparison table for that assessment
NOT RECOMMENDED for many fixable BLOCKS	Heuristics applied out of order	Re-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count
Kernel authoring or post-migration profiling	Out of scope	Decline and redirect (see "When to use") — no verdict

症状	原因	修复方法
清单中的 `Fetched:` 行超过约90天	快照过时	运行 `fetch_api_support.py --default-path` （用户运行）
清单缺失或抓取器失败	上游HTML更改	为此次评估WebFetch对比表
大量可修复BLOCKS却返回NOT RECOMMENDED	启发式规则应用顺序错误	重新按顺序应用：Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR；关注类型而非数量
内核编写或迁移后性能分析	超出范围	拒绝并引导（见“何时使用”）——不给出结论