cupynumeric-migration-readiness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesecuPyNumeric Migration Readiness
cuPyNumeric 迁移就绪评估
Purpose
用途
Use this skill BEFORE the migration, not during. Answer one question: which of the user's existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.
This is a static, read-only assessment. Inspect the user's source with , , and . Do not execute the user's code, modify or write files, or print environment variables or secrets. The , and cuPyNumeric Doctor commands shown below are suggestions for the user to run — not actions this skill performs.
ReadGrepGloblegateIf this skill has never been seen before, head to first.
references/getting-started.md请在迁移前而非迁移过程中使用此技能。 回答一个问题:在投入工程师周进行移植前,用户现有NumPy API中哪些能在cuPyNumeric上扩展,哪些需要重构? 要回答此问题:读取源代码,按Legate/NVIDIA GPU栈上的预期多GPU扩展能力对每个NumPy惯用写法进行分类,对照内置的API支持清单,并生成包含每个发现的推理和方案指引的结构化结论。
这是静态的只读评估。 使用、和检查用户源代码。请勿执行用户代码、修改或写入文件,或打印环境变量及机密信息。以下展示的和cuPyNumeric Doctor命令是给用户的运行建议——并非此技能要执行的操作。
ReadGrepGloblegate如果从未使用过此技能,请先查看。
references/getting-started.mdWhen to use this skill
何时使用此技能
Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.
Decline and redirect when the request is not a pre-migration assessment:
- Post-migration performance / profiling ("already ported, why is it slow?") → point to and the upstream profiling and debugging walkthrough.
legate --profile - Custom CUDA / kernel authoring ("write/optimize a CUDA kernel")
A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.
当用户即将将NumPy代码迁移到GPU,并询问代码是否能在cuPyNumeric/GPU上扩展、是否应迁移、哪些部分会受益、移植前必须更改什么、移植是否值得——或提及预移植评估、扩展分析、惯用写法分析、GPU重构规划、识别GPU不适用的NumPy反模式时使用。
拒绝并引导当请求并非预迁移评估时:
- 迁移后性能/性能分析("已完成移植,为什么速度慢?")→ 指向和上游的性能分析与调试指南。
legate --profile - 自定义CUDA/内核编写("编写/优化CUDA内核")
用户要求迁移的图/稀疏/机器学习/自然语言处理工作负载仍在范围内:评估后通过Gate 4返回NOT RECOMMENDED。这是结论,而非拒绝。
Instructions
操作步骤
Run all five steps below, in order. Read the user's code and reason about it semantically; do not emit a one-shot prose verdict.
按顺序执行以下五个步骤。语义化地读取用户代码,不要一次性给出散文式结论。
Step 1 — Gather context
步骤1 — 收集上下文
Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.
- Source location. Default to the current working directory when no path is given.
- Approximate hot-path array sizes at runtime. Default to 30–50 million elements. Map the user's numbers (or this default) to the Gate 2 tiers (65K per-GPU floor; 10M+ for real single-GPU speedup; 100M+ for multi-GPU).
- Target hardware. Default to 1–4 GPUs, single-node. Confirm before assuming multi-node. For CPU-only runs, ask about RAM per node instead of FBMEM.
- Dominant compute pattern. Stencil / GEMM / Monte Carlo / reductions / mixed-with-SciPy. Ask the user to name it; otherwise infer it from the code in Step 3.
State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.
在扫描代码前获取信息。以下每项都针对典型工作负载设置了默认值——当用户未主动提供具体信息时使用默认值;无需因提问而停滞。
- 源代码位置:未给出路径时默认当前工作目录。
- 运行时近似热点路径数组大小:默认30-5000万个元素。将用户提供的数值(或此默认值)映射到Gate 2层级(每GPU最低65K;10M+可实现真正的单GPU加速;100M+适合多GPU)。
- 目标硬件:默认1-4个GPU,单节点。假设多节点前需确认。对于仅CPU运行的情况,询问每节点的RAM而非FBMEM。
- 主导计算模式:模板/通用矩阵乘法(GEMM)/蒙特卡洛/归约/与SciPy混合。请用户命名;否则在步骤3中从代码推断。
在评估顶部说明你应用的默认值,以便用户更正。如果某个值无法确定,请明确说明并继续进行仅定性的评估——不要编造超出上述默认值的数字。
Step 2 — Load the API support manifest
步骤2 — 加载API支持清单
Read , the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:
assets/api-support.md- — implemented and works on multi-GPU (the best path).
✓✓ numpy.X - — implemented but single-GPU/CPU only (caveats multi-node).
✓ numpy.X - — partial support; read the note.
🟡 numpy.X — <note> - — not implemented on the cuPyNumeric distributed path. Behavior on call is version-specific (some unsupported APIs route through host NumPy, others raise an exception) — either way, hot-path use is a migration blocker. Do not promise users a silent fallback to host-NumPy.
✗ numpy.X
If the line is more than ~90 days old, refresh the snapshot — see the Available Scripts section.
Fetched:读取,这是上游NumPy与cuPyNumeric对比表的已提交快照。对于代码调用的每个NumPy API,找到对应的行并读取开头的符号:
assets/api-support.md- — 已实现且支持多GPU(最佳路径)。
✓✓ numpy.X - — 已实现但仅支持单GPU/CPU(多节点存在限制)。
✓ numpy.X - — 部分支持;请阅读注释。
🟡 numpy.X — <note> - — 在cuPyNumeric分布式路径上未实现。调用时的行为因版本而异(部分不支持的API会路由到主机端NumPy,其他则会引发异常)——无论哪种情况,热点路径中的使用都会成为迁移阻碍。不要向用户承诺会静默回退到主机端NumPy。
✗ numpy.X
如果行的时间超过约90天,请刷新快照——请查看可用脚本部分。
Fetched:Step 3 — Read the code semantically
步骤3 — 语义化读取代码
Walk the user's files with and and classify each region of array math against and (full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm traces back to a array (or aliased to it) and check whether the access sits inside a hot loop. Apply these rules:
ReadGrepreferences/idioms-that-scale.mdreferences/idioms-that-block.mdarrcupynumericnp.*- Flag element loops () as blockers; treat an epoch/step/file loop with a vectorized body as fine — distinguish the two.
for i in range(n): arr[i] = ... - Flag scalar sync — /
.item()/float()/int()/bool()on a cuPyNumeric array inside a hot loop (per-iteration host sync); allow it at the boundary.complex() - Flag reducing conditions — /
ifover an array reduction (while) syncs every iteration.while np.max(err) > tol: - Flag hoistable allocation in a loop as a fixable inefficiency.
- Flag in runtime code that partitions/communicates array data alongside
mpi4py(R108) — but first confirm it issues MPI calls on a hot path; ignore a grep hit in a README, build script, or alt-launcher.cupynumeric - Flag on
order=/reshape/asarrayas R109 — always, regardless of whether the version warns or silently no-ops.flatten - Always cite R304 in INFO for under multi-GPU: cross-GPU bit-identical reproducibility is impossible by default (
np.random.*/--gpus Nis the Legate launcher arg).LEGATE_GPUS - Flag Python builtins on arrays (/
sum/max/min/any) — host-iteration fallback (R110; upstream best practices). Allowiter(arr)(shape lookup; preferlen(arr)/arr.shape[0]for 0-d safety).arr.size - Flag mixed with
cupyin a hot loop (R111); the runtimes don't share GPU memory, so every hop goes through host NumPy.cupynumeric - Look up every NumPy API the code calls in (glyph legend in Step 2).
assets/api-support.md
For the deep "why," read (memory, SM, communication, dispatch) and (lazy execution, sync points, mapper).
references/gpu-stack.mdreferences/execution-model.md使用和遍历用户文件,并对照和(完整原理和R代码在其中)对每个数组运算区域进行分类。语义化读取,而非通过正则表达式:标记前,确认可追溯到cuPyNumeric数组(或别名指向它),并检查访问是否位于热点循环内。应用以下规则:
ReadGrepreferences/idioms-that-scale.mdreferences/idioms-that-block.mdarrnp.*- 将元素循环()标记为阻碍;将包含向量化主体的 epoch/步骤/文件循环视为正常——区分这两种情况。
for i in range(n): arr[i] = ... - 标记标量同步——在热点循环内对cuPyNumeric数组使用/
.item()/float()/int()/bool()(每次迭代的主机同步);允许在边界处使用。complex() - 标记归约条件——在数组归约上使用/
if(while)会在每次迭代时同步。while np.max(err) > tol: - 将循环中可提升的内存分配标记为可修复的低效问题。
- 标记与一起用于分区/通信数组数据的运行时代码中的
cupynumeric(R108)——但首先确认它在热点路径中发出MPI调用;忽略README、构建脚本或备用启动器中的grep匹配结果。mpi4py - 将/
reshape/asarray上的flatten标记为R109——无论版本是否发出警告或静默无操作,一律标记。order= - 在多GPU环境下的的INFO中始终引用R304:默认情况下跨GPU位一致的可复现性是不可能的(
np.random.*/--gpus N是Legate启动参数)。LEGATE_GPUS - 标记数组上的Python内置函数(/
sum/max/min/any)——主机端迭代回退(R110;上游最佳实践)。允许使用iter(arr)(形状查找;为了0维安全,优先使用len(arr)/arr.shape[0])。arr.size - 标记热点循环中与
cupy的混合使用(R111);运行时不共享GPU内存,因此每次跳转都会经过主机端NumPy。cupynumeric - 查找代码调用的每个NumPy API在中的条目(符号说明见步骤2)。
assets/api-support.md
如需深入了解原因,请阅读(内存、SM、通信、调度)和(延迟执行、同步点、映射器)。
references/gpu-stack.mdreferences/execution-model.mdStep 4 — Produce a structured assessment
步骤4 — 生成结构化评估
Deliver the report in this order. Cite for every finding so the user can navigate.
file:line- Verdict in one sentence — see "Verdict framework" below.
- What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
- What blocks (BLOCKS findings) — each tied to and a recipe in
idioms-that-block.md.refactor-recipes.md - What's fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
- Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs .
--gpus N - API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
- Decision-framework summary — Gates 1–6 from , marked pass / fail / uncertain.
references/decision-framework.md - Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.
All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write "None for this code" or "n/a — see verdict" in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See for worked reports.
assets/sample_report.md按以下顺序交付报告。为每个发现引用,以便用户导航。
file:line- 结论——一句话表述,见下文“结论框架”。
- 可扩展的部分(SCALES发现)——引用代表性代码行,让用户了解导入替换后哪些部分会提速。
- 阻碍部分(BLOCKS发现)——每个发现都关联和
idioms-that-block.md中的方案。refactor-recipes.md - 可修复的部分(REFACTOR发现)——按方案分组;一个方案通常可修复多个位置。
- 兼容性/成本说明(INFO发现)——SciPy边界、仅单GPU的线性代数/FFT、RNG布局与。
--gpus N - API支持缺口——代码调用的、清单中未实现或仅支持单GPU的API。
- 决策框架摘要——中的Gate 1–6,标记为通过/失败/不确定。
references/decision-framework.md - 推荐后续步骤——优先应用哪些方案、是否应先移植一个模块、何时引入cuPyNumeric Doctor。
必须包含全部8个部分,即使结论是READY或NOT RECOMMENDED。对于空部分,写一行**“此代码无相关内容”或“不适用——见结论”**——请勿省略标题;标题是报告评分的结构约定。示例报告见。
assets/sample_report.mdStep 5 — Hand off to cuPyNumeric Doctor for runtime validation
步骤5 — 移交至cuPyNumeric Doctor进行运行时验证
Direct the user to run cuPyNumeric Doctor once they have applied the recipes and the code runs:
bash
CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.pycuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing, misuse, import, in-place ops on views). End the assessment at: "now run with cuPyNumeric Doctor enabled; here is what to look for in its output."
nonzerompi4py指导用户在应用方案并让代码运行后运行cuPyNumeric Doctor:
bash
CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.pycuPyNumeric Doctor能捕获源代码检查遗漏的运行时问题(标量项访问、ndarray迭代、高级索引、误用、导入、视图上的原地操作)。评估结尾说明:“现在启用cuPyNumeric Doctor运行;以下是其输出中需要关注的内容。”
nonzerompi4pyVerdict framework
结论框架
Assign the verdict qualitatively, from the kinds of findings, not a score:
| Verdict | When | Action |
|---|---|---|
| READY | No BLOCKS; few/no REFACTOR | Swap the import; benchmark |
| LIGHT REFACTOR | A few recipe-fixable patterns (R201–R206), or one or two simple BLOCKS | Apply 1–3 recipes from |
| SIGNIFICANT REFACTOR | Multiple BLOCKS in hot paths, or any R108 ( | Real project; budget 1–3 engineer-weeks per module |
| NOT RECOMMENDED | Only two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land here | Restructure first or use a different runtime |
Apply these in order; the first match wins:
- Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
- Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
- Any R108 () → SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification).
mpi4py - Multiple BLOCKS (R101–R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
- One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
- Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
- No BLOCKS, no REFACTOR → READY.
- APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.
Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a size or compute-pattern failure. Full framework: .
references/decision-framework.md根据发现的类型而非分数定性分配结论:
| 结论 | 适用场景 | 操作建议 |
|---|---|---|
| READY | 无BLOCKS;极少/无REFACTOR | 替换导入;进行基准测试 |
| LIGHT REFACTOR | 少量可通过方案修复的模式(R201–R206),或一两个简单的BLOCKS | 应用 |
| SIGNIFICANT REFACTOR | 热点路径中有多个BLOCKS,或存在任何R108( | 正式项目;每个模块预算1-3个工程师周 |
| NOT RECOMMENDED | 仅两种失败情况:Gate 2(数组低于65,536的下限)或Gate 4(计算模式错误)。大量BLOCKS不会归为此类 | 先重构或使用其他运行时 |
按以下顺序应用;第一个匹配项生效:
- Gate 4失败(稀疏/图/机器学习/序列/字符串)→ NOT RECOMMENDED。
- Gate 2失败(热点路径数组<每GPU65,536元素,无可行的批处理路径)→ NOT RECOMMENDED。
- 存在任何R108() → SIGNIFICANT REFACTOR(并行层重写是成本问题,而非不合格)。
mpi4py - 热点路径中有多个BLOCKS(R101–R111)→ SIGNIFICANT REFACTOR(数量不会超过此级别——每个BLOCKS都有记录的方案)。
- 一两个可通过方案修复的BLOCKS(例如R101–R104元素循环/同步)→ LIGHT REFACTOR。
- 仅存在REFACTOR模式(R201–R206)→ LIGHT REFACTOR;方案是机械性的。
- 无BLOCKS,无REFACTOR → READY。
- 热点路径中清单未包含的API → 降低一个级别(SIGNIFICANT仍为SIGNIFICANT,不会变为NOT RECOMMENDED)。仅单GPU的API仅在多节点场景中重要。
关注发现的类型,而非数量。 热点路径中的一个R101比十个R001更严重——它会破坏R001本可以带来的扩展性。相反,大量BLOCKS加上R108仍属于SIGNIFICANT,而非NOT RECOMMENDED——级别衡量的是工程成本,而非不可行性。NOT RECOMMENDED需要规模或计算模式失败。完整框架见。
references/decision-framework.mdWhat scales vs what blocks (at-a-glance)
可扩展与阻碍内容一览
- SCALES (keep as-is) — vectorized elementwise, reductions, matmul / einsum, , large-per-GPU stencil slicing
np.where,arr[1:-1, 1:-1], boolean-mask indexing.out= - BLOCKS (remove before migration) — element loops, ,
np.vectorize,for row in arrin a hot loop, reducing.item()/.tolist()/bool(arr)/ifin a loop,while,arr[::2],dtype=object,mpi4py,order=.min/max/sum(arr) - REFACTOR (apply a recipe) — alloc in a loop, rebind in a loop,
x = x + yin a loop,vstack/hstack/concatenate+ indexing, view-mutation ofnp.nonzero(),diag/flip/flattenin a hot loop.reshape - INFO (cost note, not a blocker) — SciPy imports, single-device , single-transform
linalg.qr/svd, size-thresholdedfft.*.linalg.solve/cholesky
Full taxonomy in and . Pass over silently any API the manifest doesn't list (out of scope of the upstream table — flagging it would be noise).
idioms-that-scale.mdidioms-that-block.md- SCALES(保持原样)——向量化逐元素操作、归约、矩阵乘法/einsum、、每GPU大尺寸模板切片
np.where、arr[1:-1, 1:-1]、布尔掩码索引。out= - BLOCKS(迁移前移除)——元素循环、、
np.vectorize、热点循环中的for row in arr、循环中的归约.item()/.tolist()/bool(arr)/if、while、arr[::2]、dtype=object、mpi4py、order=。min/max/sum(arr) - REFACTOR(应用方案)——循环中的内存分配、循环中的重新绑定、循环中的
x = x + y、vstack/hstack/concatenate+索引、np.nonzero()的视图修改、热点循环中的diag/flip/flatten。reshape - INFO(成本说明,非阻碍)——SciPy导入、单设备、单次变换
linalg.qr/svd、尺寸阈值化的fft.*。linalg.solve/cholesky
完整分类见和。对于清单未列出的API,可忽略(不在上游表格范围内——标记会产生噪音)。
idioms-that-scale.mdidioms-that-block.mdReading order
阅读顺序
The canonical, read-in-order guide lives in — read it once for orientation.
references/getting-started.mdFor a non-trivial assessment the must-reads are , , and ; the rest (, , , , ) are read on demand.
idioms-that-block.mdrefactor-recipes.mddecision-framework.mdidioms-that-scale.mdgpu-stack.mdexecution-model.mdpartitioning-and-balance.mdcase-studies.md规范的按顺序阅读指南见——阅读一次以了解概况。
references/getting-started.md对于非 trivial 的评估,必须阅读的内容是、和;其余内容(、、、、)按需阅读。
idioms-that-block.mdrefactor-recipes.mddecision-framework.mdidioms-that-scale.mdgpu-stack.mdexecution-model.mdpartitioning-and-balance.mdcase-studies.mdLimitations
局限性
- Does not run cuPyNumeric. No runtime required; this is the pre-port check. Actual speedup measurement happens after migration.
- Does not auto-generate refactored code. It identifies what to change and points to recipes; the user (or a follow-up agent) applies them.
- Does not profile the workload. For runtime measurement use and the upstream profiling and debugging guide.
legate.timing.time() - Does not replace judgment. Pattern matching misses implicit syncs inside logging, decorators that hide , runtime-data-dependent partition mismatches. Read the source too, especially in borderline cases.
.tolist()
- 不运行cuPyNumeric。无需运行时;这是预移植检查。实际加速测量在迁移后进行。
- 不自动生成重构代码。它识别需要更改的内容并指向方案;用户(或后续代理)应用这些方案。
- 不分析工作负载性能。如需运行时测量,请使用和上游的性能分析与调试指南。
legate.timing.time() - 不能替代判断。模式匹配会遗漏日志中的隐式同步、隐藏的装饰器、运行时数据依赖的分区不匹配。也要阅读源代码,尤其是边界情况。
.tolist()
Examples
示例
A worked assessment of the bundled fixtures (an example, not a template):
assets/examples/Verdict: LIGHT REFACTOR.translates cleanly;scales_well.pyneeds one allocation hoisted;needs_refactor.pysyncs every iteration viablocks_scaling.py..item()What works:(stencil R005),scales_well.py:23-31(reduction R002),:40-44(elementwise R001). What blocks::18-22(R104 —blocks_scaling.py:51-58in hot loop) → RR-sync. What's fixable:.item()(R201 — alloc in loop) → RR-alloc. Next: apply the recipes; re-walk to READY; enableneeds_refactor.py:21-28on the first real run.CUPYNUMERIC_DOCTOR=1
The full worked report is in .
assets/sample_report.md对内置示例的评估(示例,非模板):
assets/examples/结论:LIGHT REFACTOR。可顺利转换;scales_well.py需要提升一处内存分配;needs_refactor.py通过blocks_scaling.py在每次迭代时同步。.item()可扩展的部分:(模板R005)、scales_well.py:23-31(归约R002)、:40-44(逐元素R001)。 阻碍部分::18-22(R104 — 热点循环中的blocks_scaling.py:51-58)→ RR-sync。 可修复的部分:.item()(R201 — 循环中的内存分配)→ RR-alloc。 后续步骤: 应用方案;重新评估直至达到READY;首次正式运行时启用needs_refactor.py:21-28。CUPYNUMERIC_DOCTOR=1
完整示例报告见。
assets/sample_report.mdAuthoritative upstream references
权威上游参考
- Comparison table (source for ): https://nv-legate.github.io/cupynumeric/api/comparison.html (mirror, most current) /
assets/api-support.mdon docs.nvidia.com (canonical).../latest/api/comparison.html - Best practices, Doctor, profiling, differences with NumPy, Legate launcher — under https://docs.nvidia.com/cupynumeric/latest/ (,
user/practices.html,user/doctor.html,user/profiling_debugging.html) and https://docs.nvidia.com/legate/latest/manual/usage/running.htmluser/differences.html - Source: https://github.com/nv-legate/cupynumeric
- 对比表(的来源):https://nv-legate.github.io/cupynumeric/api/comparison.html(镜像,最新)/ docs.nvidia.com上的
assets/api-support.md(规范).../latest/api/comparison.html - 最佳实践、Doctor、性能分析、与NumPy的差异、Legate启动器——见https://docs.nvidia.com/cupynumeric/latest/(`user/practices.html`、`user/doctor.html`、`user/profiling_debugging.html`、`user/differences.html`)和https://docs.nvidia.com/legate/latest/manual/usage/running.html
- 源代码:https://github.com/nv-legate/cupynumeric
Available Scripts
可用脚本
| Script | Purpose | Arguments |
|---|---|---|
| Scrape the upstream comparison table into | |
The user runs this to refresh the manifest ().
python scripts/fetch_api_support.py --default-path| 脚本 | 用途 | 参数 |
|---|---|---|
| 将上游对比表抓取到 | |
用户运行此脚本刷新清单()。
python scripts/fetch_api_support.py --default-pathBundled references and assets
内置参考与资源
The files are enumerated under Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets: (committed API snapshot, load in Step 2), and (worked report and fixtures).
references/assets/api-support.mdassets/sample_report.mdassets/examples/*.pyreferences/assets/api-support.mdassets/sample_report.mdassets/examples/*.pyTroubleshooting
故障排除
| Symptom | Cause | Fix |
|---|---|---|
| Stale snapshot | Run |
| Manifest missing or scraper fails | Upstream HTML changed | |
| NOT RECOMMENDED for many fixable BLOCKS | Heuristics applied out of order | Re-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count |
| Kernel authoring or post-migration profiling | Out of scope | Decline and redirect (see "When to use") — no verdict |
| 症状 | 原因 | 修复方法 |
|---|---|---|
清单中的 | 快照过时 | 运行 |
| 清单缺失或抓取器失败 | 上游HTML更改 | 为此次评估WebFetch对比表 |
| 大量可修复BLOCKS却返回NOT RECOMMENDED | 启发式规则应用顺序错误 | 重新按顺序应用:Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR;关注类型而非数量 |
| 内核编写或迁移后性能分析 | 超出范围 | 拒绝并引导(见“何时使用”)——不给出结论 |