cuPyNumeric Migration Readiness

Purpose

Use this skill BEFORE the migration, not during. Answer one question: which of the user's existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.

This is a static, read-only assessment. Inspect the user's source with

Read

Grep

, and

Glob

. Do not execute the user's code, modify or write files, or print environment variables or secrets. The

legate

, and cuPyNumeric Doctor commands shown below are suggestions for the user to run — not actions this skill performs.

If this skill has never been seen before, head to

references/getting-started.md

first.

When to use this skill

Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.

Decline and redirect when the request is not a pre-migration assessment:

Post-migration performance / profiling ("already ported, why is it slow?") → point to
```
legate --profile
```
and the upstream profiling and debugging walkthrough.
Custom CUDA / kernel authoring ("write/optimize a CUDA kernel")

A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.

Instructions

Run all five steps below, in order. Read the user's code and reason about it semantically; do not emit a one-shot prose verdict.

Step 1 — Gather context

Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.

Source location. Default to the current working directory when no path is given.
Approximate hot-path array sizes at runtime. Default to 30–50 million elements. Map the user's numbers (or this default) to the Gate 2 tiers (65K per-GPU floor; 10M+ for real single-GPU speedup; 100M+ for multi-GPU).
Target hardware. Default to 1–4 GPUs, single-node. Confirm before assuming multi-node. For CPU-only runs, ask about RAM per node instead of FBMEM.
Dominant compute pattern. Stencil / GEMM / Monte Carlo / reductions / mixed-with-SciPy. Ask the user to name it; otherwise infer it from the code in Step 3.

State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.

Step 2 — Load the API support manifest

Read

assets/api-support.md

, the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:

```
✓✓ numpy.X
```
— implemented and works on multi-GPU (the best path).
```
✓ numpy.X
```
— implemented but single-GPU/CPU only (caveats multi-node).
```
🟡 numpy.X — <note>
```
— partial support; read the note.
```
✗ numpy.X
```
— not implemented on the cuPyNumeric distributed path. Behavior on call is version-specific (some unsupported APIs route through host NumPy, others raise an exception) — either way, hot-path use is a migration blocker. Do not promise users a silent fallback to host-NumPy.

If the

Fetched:

line is more than ~90 days old, refresh the snapshot — see the Available Scripts section.

Step 3 — Read the code semantically

Walk the user's files with

Read

and

Grep

and classify each region of array math against

references/idioms-that-scale.md

and

references/idioms-that-block.md

(full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm

arr

traces back to a

cupynumeric

array (or

np.*

aliased to it) and check whether the access sits inside a hot loop. Apply these rules:

Flag element loops (
```
for i in range(n): arr[i] = ...
```
) as blockers; treat an epoch/step/file loop with a vectorized body as fine — distinguish the two.
Flag scalar sync —
```
.item()
```
/
```
float()
```
/
```
int()
```
/
```
bool()
```
/
```
complex()
```
on a cuPyNumeric array inside a hot loop (per-iteration host sync); allow it at the boundary.
Flag reducing conditions —
```
if
```
/
```
while
```
over an array reduction (
```
while np.max(err) > tol:
```
) syncs every iteration.
Flag hoistable allocation in a loop as a fixable inefficiency.
Flag
mpi4py
in runtime code that partitions/communicates array data alongside
```
cupynumeric
```
(R108) — but first confirm it issues MPI calls on a hot path; ignore a grep hit in a README, build script, or alt-launcher.
Flag
order=
on
```
reshape
```
/
```
asarray
```
/
```
flatten
```
as R109 — always, regardless of whether the version warns or silently no-ops.
Always cite R304 in INFO for
```
np.random.*
```
under multi-GPU: cross-GPU bit-identical reproducibility is impossible by default (
```
--gpus N
```
/
```
LEGATE_GPUS
```
is the Legate launcher arg).
Flag Python builtins on arrays (
```
sum
```
/
```
max
```
/
```
min
```
/
```
any
```
/
```
iter(arr)
```
) — host-iteration fallback (R110; upstream best practices). Allow
```
len(arr)
```
(shape lookup; prefer
```
arr.shape[0]
```
/
```
arr.size
```
for 0-d safety).
Flag
cupy
mixed with
cupynumeric
in a hot loop (R111); the runtimes don't share GPU memory, so every hop goes through host NumPy.
Look up every NumPy API the code calls in
```
assets/api-support.md
```
(glyph legend in Step 2).

For the deep "why," read

references/gpu-stack.md

(memory, SM, communication, dispatch) and

references/execution-model.md

(lazy execution, sync points, mapper).

Step 4 — Produce a structured assessment

Deliver the report in this order. Cite

file:line

for every finding so the user can navigate.

Verdict in one sentence — see "Verdict framework" below.
What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
What blocks (BLOCKS findings) — each tied to
```
idioms-that-block.md
```
and a recipe in
```
refactor-recipes.md
```
.
What's fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs
```
--gpus N
```
.
API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
Decision-framework summary — Gates 1–6 from
```
references/decision-framework.md
```
, marked pass / fail / uncertain.
Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.

All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write "None for this code" or "n/a — see verdict" in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See

assets/sample_report.md

for worked reports.

Step 5 — Hand off to cuPyNumeric Doctor for runtime validation

Direct the user to run cuPyNumeric Doctor once they have applied the recipes and the code runs:

bash

CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py

cuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing,

nonzero

misuse,

mpi4py

import, in-place ops on views). End the assessment at: "now run with cuPyNumeric Doctor enabled; here is what to look for in its output."

Verdict framework

Assign the verdict qualitatively, from the kinds of findings, not a score:

Verdict	When	Action
READY	No BLOCKS; few/no REFACTOR	Swap the import; benchmark
LIGHT REFACTOR	A few recipe-fixable patterns (R201–R206), or one or two simple BLOCKS	Apply 1–3 recipes from `refactor-recipes.md` ; re-walk to READY
SIGNIFICANT REFACTOR	Multiple BLOCKS in hot paths, or any R108 ( `mpi4py` ) — rewrites, not disqualifications	Real project; budget 1–3 engineer-weeks per module
NOT RECOMMENDED	Only two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land here	Restructure first or use a different runtime

Apply these in order; the first match wins:

Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
Any R108 (
mpi4py
) → SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification).
Multiple BLOCKS (R101–R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
No BLOCKS, no REFACTOR → READY.
APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.

Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a size or compute-pattern failure. Full framework:

references/decision-framework.md

What scales vs what blocks (at-a-glance)

SCALES (keep as-is) — vectorized elementwise, reductions, matmul / einsum,
```
np.where
```
, large-per-GPU stencil slicing
```
arr[1:-1, 1:-1]
```
,
```
out=
```
, boolean-mask indexing.

BLOCKS (remove before migration) — element loops,

np.vectorize

for row in arr

.item()/.tolist()/bool(arr)

in a hot loop, reducing

if

while

in a loop,

arr[::2]

dtype=object

mpi4py

order=

min/max/sum(arr)

REFACTOR (apply a recipe) — alloc in a loop,
```
x = x + y
```
rebind in a loop,
```
vstack/hstack/concatenate
```
in a loop,
```
np.nonzero()
```
+ indexing, view-mutation of
```
diag/flip/flatten
```
,
```
reshape
```
in a hot loop.
INFO (cost note, not a blocker) — SciPy imports, single-device
```
linalg.qr/svd
```
, single-transform
```
fft.*
```
, size-thresholded
```
linalg.solve/cholesky
```
.

Full taxonomy in

idioms-that-scale.md

and

idioms-that-block.md

. Pass over silently any API the manifest doesn't list (out of scope of the upstream table — flagging it would be noise).

Reading order

The canonical, read-in-order guide lives in

references/getting-started.md

— read it once for orientation.

For a non-trivial assessment the must-reads are

idioms-that-block.md

refactor-recipes.md

, and

decision-framework.md

; the rest (

idioms-that-scale.md

gpu-stack.md

execution-model.md

partitioning-and-balance.md

case-studies.md

) are read on demand.

Limitations

Does not run cuPyNumeric. No runtime required; this is the pre-port check. Actual speedup measurement happens after migration.
Does not auto-generate refactored code. It identifies what to change and points to recipes; the user (or a follow-up agent) applies them.
Does not profile the workload. For runtime measurement use
```
legate.timing.time()
```
and the upstream profiling and debugging guide.
Does not replace judgment. Pattern matching misses implicit syncs inside logging, decorators that hide
```
.tolist()
```
, runtime-data-dependent partition mismatches. Read the source too, especially in borderline cases.

Examples

A worked assessment of the bundled

assets/examples/

fixtures (an example, not a template):

Verdict: LIGHT REFACTOR.
scales_well.py
translates cleanly;
needs_refactor.py
needs one allocation hoisted;
blocks_scaling.py
syncs every iteration via
.item()
.
What works:
scales_well.py:23-31
(stencil R005),
:40-44
(reduction R002),
:18-22
(elementwise R001). What blocks:
blocks_scaling.py:51-58
(R104 —
.item()
in hot loop) → RR-sync. What's fixable:
needs_refactor.py:21-28
(R201 — alloc in loop) → RR-alloc. Next: apply the recipes; re-walk to READY; enable
CUPYNUMERIC_DOCTOR=1
on the first real run.

The full worked report is in

assets/sample_report.md

Authoritative upstream references

Comparison table (source for
```
assets/api-support.md
```
): https://nv-legate.github.io/cupynumeric/api/comparison.html (mirror, most current) /
```
.../latest/api/comparison.html
```
on docs.nvidia.com (canonical)
Best practices, Doctor, profiling, differences with NumPy, Legate launcher — under https://docs.nvidia.com/cupynumeric/latest/ (
```
user/practices.html
```
,
```
user/doctor.html
```
,
```
user/profiling_debugging.html
```
,
```
user/differences.html
```
) and https://docs.nvidia.com/legate/latest/manual/usage/running.html
Source: https://github.com/nv-legate/cupynumeric

Available Scripts

Script	Purpose	Arguments
`scripts/fetch_api_support.py`	Scrape the upstream comparison table into `assets/api-support.md` . Python stdlib only; standalone.	`--default-path` (write the committed `assets/api-support.md` ); `--docs-nvidia-url` (use canonical `docs.nvidia.com` instead of the default GitHub Pages mirror)

The user runs this to refresh the manifest (

python scripts/fetch_api_support.py --default-path

Bundled references and assets

The

references/

files are enumerated under Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets:

assets/api-support.md

(committed API snapshot, load in Step 2),

assets/sample_report.md

and

assets/examples/*.py

(worked report and fixtures).

Troubleshooting

Symptom	Cause	Fix
`Fetched:` line in the manifest > ~90 days old	Stale snapshot	Run `fetch_api_support.py --default-path` (user-run)
Manifest missing or scraper fails	Upstream HTML changed	`WebFetch` the comparison table for that assessment
NOT RECOMMENDED for many fixable BLOCKS	Heuristics applied out of order	Re-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count
Kernel authoring or post-migration profiling	Out of scope	Decline and redirect (see "When to use") — no verdict

cupynumeric-migration-readiness

NPX Install

Tags

SKILL.md Content