cuPyNumeric Migration Readiness
Purpose
Use this skill BEFORE the migration, not during. Answer one question: which of the user's existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.
This is a static, read-only assessment. Inspect the user's source with
,
, and
. Do
not execute the user's code, modify or write files, or print environment variables or secrets. The
, and cuPyNumeric Doctor commands shown below are suggestions for the
user to run — not actions this skill performs.
If this skill has never been seen before, head to
references/getting-started.md
first.
When to use this skill
Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.
Decline and redirect when the request is not a pre-migration assessment:
- Post-migration performance / profiling ("already ported, why is it slow?") → point to and the upstream profiling and debugging walkthrough.
- Custom CUDA / kernel authoring ("write/optimize a CUDA kernel")
A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.
Instructions
Run all five steps below, in order. Read the user's code and reason about it semantically; do not emit a one-shot prose verdict.
Step 1 — Gather context
Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.
- Source location. Default to the current working directory when no path is given.
- Approximate hot-path array sizes at runtime. Default to 30–50 million elements. Map the user's numbers (or this default) to the Gate 2 tiers (65K per-GPU floor; 10M+ for real single-GPU speedup; 100M+ for multi-GPU).
- Target hardware. Default to 1–4 GPUs, single-node. Confirm before assuming multi-node. For CPU-only runs, ask about RAM per node instead of FBMEM.
- Dominant compute pattern. Stencil / GEMM / Monte Carlo / reductions / mixed-with-SciPy. Ask the user to name it; otherwise infer it from the code in Step 3.
State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.
Step 2 — Load the API support manifest
Read
, the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:
- — implemented and works on multi-GPU (the best path).
- — implemented but single-GPU/CPU only (caveats multi-node).
- — partial support; read the note.
- — not implemented on the cuPyNumeric distributed path. Behavior on call is version-specific (some unsupported APIs route through host NumPy, others raise an exception) — either way, hot-path use is a migration blocker. Do not promise users a silent fallback to host-NumPy.
If the
line is more than ~90 days old, refresh the snapshot — see the
Available Scripts section.
Step 3 — Read the code semantically
Walk the user's files with
and
and classify each region of array math against
references/idioms-that-scale.md
and
references/idioms-that-block.md
(full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm
traces back to a
array (or
aliased to it) and check whether the access sits inside a hot loop. Apply these rules:
- Flag element loops (
for i in range(n): arr[i] = ...
) as blockers; treat an epoch/step/file loop with a vectorized body as fine — distinguish the two.
- Flag scalar sync — / / / / on a cuPyNumeric array inside a hot loop (per-iteration host sync); allow it at the boundary.
- Flag reducing conditions — / over an array reduction () syncs every iteration.
- Flag hoistable allocation in a loop as a fixable inefficiency.
- Flag in runtime code that partitions/communicates array data alongside (R108) — but first confirm it issues MPI calls on a hot path; ignore a grep hit in a README, build script, or alt-launcher.
- Flag on / / as R109 — always, regardless of whether the version warns or silently no-ops.
- Always cite R304 in INFO for under multi-GPU: cross-GPU bit-identical reproducibility is impossible by default ( / is the Legate launcher arg).
- Flag Python builtins on arrays (////) — host-iteration fallback (R110; upstream best practices). Allow (shape lookup; prefer / for 0-d safety).
- Flag mixed with in a hot loop (R111); the runtimes don't share GPU memory, so every hop goes through host NumPy.
- Look up every NumPy API the code calls in (glyph legend in Step 2).
For the deep "why," read
(memory, SM, communication, dispatch) and
references/execution-model.md
(lazy execution, sync points, mapper).
Step 4 — Produce a structured assessment
Deliver the report in this order. Cite
for every finding so the user can navigate.
- Verdict in one sentence — see "Verdict framework" below.
- What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
- What blocks (BLOCKS findings) — each tied to and a recipe in .
- What's fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
- Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs .
- API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
- Decision-framework summary — Gates 1–6 from
references/decision-framework.md
, marked pass / fail / uncertain.
- Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.
All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write
"None for this code" or
"n/a — see verdict" in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See
for worked reports.
Step 5 — Hand off to cuPyNumeric Doctor for runtime validation
Direct the user to run
cuPyNumeric Doctor once they have applied the recipes and the code runs:
bash
CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py
cuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing,
misuse,
import, in-place ops on views). End the assessment at: "now run with cuPyNumeric Doctor enabled; here is what to look for in its output."
Verdict framework
Assign the verdict qualitatively, from the kinds of findings, not a score:
| Verdict | When | Action |
|---|
| READY | No BLOCKS; few/no REFACTOR | Swap the import; benchmark |
| LIGHT REFACTOR | A few recipe-fixable patterns (R201–R206), or one or two simple BLOCKS | Apply 1–3 recipes from ; re-walk to READY |
| SIGNIFICANT REFACTOR | Multiple BLOCKS in hot paths, or any R108 () — rewrites, not disqualifications | Real project; budget 1–3 engineer-weeks per module |
| NOT RECOMMENDED | Only two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land here | Restructure first or use a different runtime |
Apply these in order; the first match wins:
- Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
- Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
- Any R108 () → SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification).
- Multiple BLOCKS (R101–R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
- One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
- Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
- No BLOCKS, no REFACTOR → READY.
- APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.
Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is
still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a
size or
compute-pattern failure. Full framework:
references/decision-framework.md
.
What scales vs what blocks (at-a-glance)
- SCALES (keep as-is) — vectorized elementwise, reductions, matmul / einsum, , large-per-GPU stencil slicing , , boolean-mask indexing.
- BLOCKS (remove before migration) — element loops, , ,
.item()/.tolist()/bool(arr)
in a hot loop, reducing / in a loop, , , , , .
- REFACTOR (apply a recipe) — alloc in a loop, rebind in a loop,
vstack/hstack/concatenate
in a loop, + indexing, view-mutation of , in a hot loop.
- INFO (cost note, not a blocker) — SciPy imports, single-device , single-transform , size-thresholded .
Full taxonomy in
and
. Pass over silently any API the manifest doesn't list (out of scope of the upstream table — flagging it would be noise).
Reading order
The canonical, read-in-order guide lives in
references/getting-started.md
— read it once for orientation.
For a non-trivial assessment the must-reads are
,
, and
; the rest (
,
,
,
partitioning-and-balance.md
,
) are read on demand.
Limitations
- Does not run cuPyNumeric. No runtime required; this is the pre-port check. Actual speedup measurement happens after migration.
- Does not auto-generate refactored code. It identifies what to change and points to recipes; the user (or a follow-up agent) applies them.
- Does not profile the workload. For runtime measurement use and the upstream profiling and debugging guide.
- Does not replace judgment. Pattern matching misses implicit syncs inside logging, decorators that hide , runtime-data-dependent partition mismatches. Read the source too, especially in borderline cases.
Examples
A worked assessment of the bundled
fixtures (an example, not a template):
Verdict: LIGHT REFACTOR. translates cleanly;
needs one allocation hoisted;
syncs every iteration via
.
What works: (stencil R005),
(reduction R002),
(elementwise R001).
What blocks: (
R104 —
in hot loop) →
RR-sync.
What's fixable: (
R201 — alloc in loop) →
RR-alloc.
Next: apply the recipes; re-walk to READY; enable
on the first real run.
The full worked report is in
.
Authoritative upstream references
- Comparison table (source for ): https://nv-legate.github.io/cupynumeric/api/comparison.html (mirror, most current) /
.../latest/api/comparison.html
on docs.nvidia.com (canonical)
- Best practices, Doctor, profiling, differences with NumPy, Legate launcher — under https://docs.nvidia.com/cupynumeric/latest/ (, ,
user/profiling_debugging.html
, ) and https://docs.nvidia.com/legate/latest/manual/usage/running.html
- Source: https://github.com/nv-legate/cupynumeric
Available Scripts
| Script | Purpose | Arguments |
|---|
scripts/fetch_api_support.py
| Scrape the upstream comparison table into . Python stdlib only; standalone. | (write the committed ); (use canonical instead of the default GitHub Pages mirror) |
The user runs this to refresh the manifest (
python scripts/fetch_api_support.py --default-path
).
Bundled references and assets
The
files are enumerated under
Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets:
(committed API snapshot, load in Step 2),
and
(worked report and fixtures).
Troubleshooting
| Symptom | Cause | Fix |
|---|
| line in the manifest > ~90 days old | Stale snapshot | Run fetch_api_support.py --default-path
(user-run) |
| Manifest missing or scraper fails | Upstream HTML changed | the comparison table for that assessment |
| NOT RECOMMENDED for many fixable BLOCKS | Heuristics applied out of order | Re-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count |
| Kernel authoring or post-migration profiling | Out of scope | Decline and redirect (see "When to use") — no verdict |