Expo Skill Eval
Evaluates skills in
for trigger accuracy, generated code quality, and/or runtime rendering in Expo Go.
Requirements: macOS with Xcode (iOS simulators), Android SDK with at least one AVD, and
. No other device tooling is assumed.
Workspace root:
/private/tmp/expo-skill-eval-<skill-name>/iteration-N/
(e.g.
/private/tmp/expo-skill-eval-expo-ui/iteration-4/
).
Before starting — clarify scope
Confirm all of the following up front, before any pipeline work — don't skip any (only skip a given item if the request already states that choice). Batch them into
calls of ≤4 questions each, in this order:
- Which skill to eval (if not clear from the request).
- Prompts — which prompts drive the eval. Built-in prompts (from the skill's eval cases) are pre-selected all; drop any, add a custom text prompt, or build from an uploaded screenshot (a target UI the skill must reproduce). See Prompts below.
- What to verify — one multi-select of three options: Runtime + screenshots / Trigger accuracy / Code checks (no device). See What to verify below.
- Expo SDK — latest (default, auto-detected) or a pinned version.
- Runner — Expo Go (default) or development build.
- Platforms — iOS / Android / web (always offer all three).
- Permission flag for — skip-permissions (default) or accept-edits.
- Viewer delivery — local only (default) or publish a shareable Artifact.
- If trigger accuracy is selected — confirm the published plugin is disabled (or not installed).
Each is detailed below. Items 4–6 (SDK, runner, platforms) fit naturally in one
call.
If the skill to eval is not clear from the request, list available skills from
and ask which one to evaluate.
How the skill under test is loaded — two mechanisms, one per phase (don't pick one globally): executor runs reference it by
file path (
SKILL_PATH = plugins/expo/skills/<skill>/SKILL.md
, read explicitly), while the trigger eval loads it as a
plugin (
--plugin-dir plugins/expo
, so the model can auto-select it from its description). Both point at the
local, in-repo version — that's what you're evaluating. You do
not need any special flag to
launch the harness session itself (the harness finds the skill by repo path); the mechanisms apply to the
subprocesses it spawns. See steps 1 and 3 for why each phase differs.
One pre-run check (required when the trigger eval is in scope): if the
published plugin is installed/enabled, disable it (via
)
before launching the harness and re-enable after. A single disable is a global-config change that both this session and the spawned
subprocesses inherit. Why it's required for the trigger eval: that phase loads the local skill via
, and a second installed
collides with it — the model may trigger the
published , and since detection only sees the tool-call name you'd silently score the published description instead of your local edits (the collision could also just error). The
executor / runtime / static phases are
not affected — they read the skill under test by its local
with no
— so a run with no trigger eval can skip the disable. Disabling
does
not disable
(a standalone project skill, not part of the
plugin), so the harness stays available.
Surface this to the user as an explicit up-front confirmation — the same way you confirm which skill to eval. When the trigger eval is in scope, ask the user to confirm the published
plugin is disabled (or not installed)
before you start step 1; if it's still enabled, pause and have them disable it via
. Don't run the trigger eval until they confirm — the harness can't reliably detect installed plugins on its own (reading the global plugin config or
would prompt), so this is a manual confirmation, not an auto-check.
Pick the prompts — built-in, custom, or a target screenshot. The prompts are the
inputs that drive the executor (with-skill and without-skill); they are separate from what you
verify. Confirm them with
(skip if the request already names a prompt):
- Built-in prompts — representative prompts you generate by reading the skill under test (its + ) and
references/runtime-matrix.md
, covering the skill's standard use cases. (If the skill already ships eval cases under , fold their fields in too — but most skills don't, so you usually derive them.) Pre-select all so the default run exercises the skill's standard cases; let the user deselect any.
- Custom text prompt — a one-off prompt the user types. Don't spend a dedicated option slot on this: auto-adds a "Type something" / Other entry, and anything typed there becomes a custom text case.
- Build from an uploaded screenshot — the user gives the path to a target screenshot (a UI to reproduce). The executor is told to open it — reads PNGs with its Read tool — and build an app matching it; the case records the path as , and grading compares the generated app to that target (step 6). This is the strongest visual test for a UI skill: "build this."
Respect 's 4-option-per-question cap with this priority (the bug to avoid: the upload option silently dropped once the four slots fill up):
- Always reserve a slot for "Build from an uploaded screenshot." It's the whole point of the visual eval and must never be the option that gets dropped.
- Don't add an explicit "Custom text prompt" option — the auto "Type something" / Other entry already covers it.
- Fill the remaining ≤3 slots with the built-in/representative prompts, pre-selected. If there are more than 3, collapse them into one pre-selected "All built-in prompts (default)" option and offer subset-picking in a short follow-up, so the upload option still fits.
Present it as a multi-select. When "Build from an uploaded screenshot" is picked, ask for the target image path in a follow-up. Each selected prompt (built-in, typed, or image) becomes one eval case (run with-skill and without-skill).
Always confirm what to verify unless the request makes it unambiguous. Present these options and let the user pick one or more (defaults in bold based on the skill's
references/runtime-matrix.md
entry):
| Option | What it does | When to suggest as default |
|---|
| Runtime + screenshots | Full pipeline: fixture → executor → static gate → run the app on iOS/Android and screenshot it. The runner (Expo Go or dev build) is a separate question — don't name it here. | Default for any skill that renders an app screen (the / rows in references/runtime-matrix.md
). Requires a booted simulator/emulator. |
| Trigger accuracy | Run realistic prompts via , check whether the skill is read. Measures recall (should-trigger queries only). | Always useful as a standalone check. |
| Code checks (no device) | + diff-aware lint + , plus the grader checks the generated code against any custom expectations you provide. No device. | Default for and skills, and whenever you want to verify code patterns (correct import path, a wrapper, …) without running the app. |
| Present these as ONE multi-select question — "What do you want to verify?" These are grading dimensions (how to judge what gets built) — distinct from the Prompts phase (what to build). The user may pick any combination. When a prompt is an uploaded screenshot (see Prompts), include "Runtime + screenshots" so the harness captures the generated app and the grader can score it against the target. | | |
Read
references/runtime-matrix.md
to find the skill's default mode before suggesting. If the request already specifies a mode (e.g. "just check if it triggers", "run it on device"), skip the question and proceed.
Pick the Expo SDK version — once, up front. Detect the latest with
bash /abs/path/expo-skill-eval/scripts/latest-sdk.sh
(it prints the major, e.g.
; internally it uses
to run
npm view expo dist-tags --json
and read the major via
/
, and it's covered by the bash-scripts rule — so don't run the registry query inline yourself, which would prompt). Then confirm with
: default to that latest SDK, or let the user pin an older one (e.g. to reproduce a version-specific issue). Use the chosen version everywhere the fixture is built — pass it as the
arg to
and write it into each eval case's
. If the request already names a version ("eval on SDK 54"), skip detection and use it.
Default to the latest — it stays compatible with the Expo Go that
installs on the device. Pinning an SDK
older than the device's installed Expo Go makes
try to prompt "Install the recommended Expo Go version?"; with no TTY (the snapshot scripts read stdin from
) it dies with
Input is required, but 'npx expo' is in non-interactive mode
and
every snapshot fails. So only pin an older SDK when you also pre-install a matching Expo Go on the simulator/emulator — otherwise stick with latest.
Pick the runner — Expo Go (default) or a development build. Ask with
(skip if the request already says which):
- Expo Go (default) — the snapshot scripts run the app with / as-is. Fast (no native compile), and it runs anything Expo Go bundles (including on SDK 56+). Cannot run custom native code (expo-modules, config plugins, native deps not in Expo Go).
- Development build — the snapshot scripts run / instead, compiling a native dev client per fixture. Use this for skills whose output needs custom native code (the cases that would otherwise be ). Much slower — prebuilds and natively compiles each fixture (minutes, especially the first), and needs the full iOS/Android build toolchain — so only choose it when the skill actually requires native code. Disk-heavy: each fixture's native build is multi-GB. The snapshot phase runs after each fixture to keep peak usage to ~one build, but still prefer fewer eval cases and a single platform for dev-build runs, and keep a few GB free. removes the per-fixture build output (, , , , , and the fixture's iOS DerivedData) and keeps the app source + git. The lever for dev-build disk is fewer eval cases + one platform — it only reclaims per-fixture build output and never touches shared dependency caches, so nothing gets re-downloaded.
Pass the choice to the snapshot scripts via the
env var (
default, or
), and reflect it in each eval case's
(
or
). See step 5.
Pick the platforms — always ask, regardless of skill. Offer iOS / Android / web (multi-select) with
; default to iOS + Android, but always present web as an option — don't pre-filter by skill.
Web is a valid choice for most skills:
's
universal components (
,
,
,
,
, …) render on web, as do
, NativeWind/Tailwind, API routes, and plain React Native. The only thing that won't show on web is a
platform-specific native tree (
or
), which renders blank there — and that blank is itself a useful signal, so it's still the user's call. Web runs via
(
+ Playwright/Chromium)
regardless of the runner (
is native-only; there's no web dev build), and it's the least-exercised path. Write the chosen set into each eval case's
and have
loop them.
Confirm how subprocesses run — once, before starting. Ask with
whether they may run with
--dangerously-skip-permissions
, then apply the same answer to every subprocess this run (never re-prompt mid-run):
- Skip permissions (recommended) — pass
--dangerously-skip-permissions
. Each subprocess runs unattended inside a throwaway fixture under /private/tmp/expo-skill-eval-*
and can write files and run setup commands without prompting.
- Accept edits only — pass
--permission-mode acceptEdits
instead. Bash/installs are auto-denied (no TTY), so some evals may produce partial output.
A bare
with neither flag can't write files at all. If the request already states a preference ("skip permissions", "don't use the dangerous flag"), skip the question.
Confirm how to deliver the results viewer — once, up front. Publishing to claude.ai is outward-facing, so never do it mid-run by surprise; ask in the same up-front
(alongside the permission flag):
- Local only (default) — writes and opens it in the local browser. Nothing leaves the machine.
- Publish a shareable Artifact — additionally render the viewer to a claude.ai Artifact (a default-private web page the user can share with teammates) at the very end. Only do this if the user opts in here.
If the request already says whether to share/publish, skip the question. See the Viewer section for the publish mechanics.
Eval case schema
You generate the run's eval cases — one per chosen prompt — and write them to
<workspace>/iteration-N/evals.json
(the viewer reads them from there). Each case extends the standard skill-creator eval-case shape with a
block and visual expectations:
json
{
"id": 1,
"prompt": "Build me a settings screen with a dark mode toggle and a list of options",
"expected_output": "Working Expo Router screen",
"expectations": [
"Uses Expo Router file-based routing",
"TypeScript compiles with no errors"
],
"runtime": {
"mode": "expo-go",
"platforms": ["ios", "android"],
"sdk": "56"
},
"visual_expectations": [
"No red error screen or Expo Go error overlay on any platform",
"A settings screen with a visible toggle control is rendered"
]
}
-
: how the eval runs after the static gate —
- : run in Expo Go () and screenshot. Fast, JS-only. Default.
- : build a native dev client () and screenshot. For skills whose output uses custom native code; much slower (native compile per fixture).
- : stop after the static gate — for skills that produce no UI, or when you don't want to run a device at all (CI).
Consult
references/runtime-matrix.md
for which repo skills support which mode. (
lets you actually run skills that previously had to be
for needing native code.)
-
: subset of
,
,
— chosen up front (always offered, not gated on the skill; see
Before starting). Defaults to
.
-
: Expo SDK major for the fixture app — set it to the version chosen up front (see
Before starting — clarify scope). Omit to use the latest template.
-
(optional —
image prompt): absolute path to a
target screenshot the skill must reproduce. When set, the executor is told to open it (via its Read tool) and build a matching app, and the grader scores how closely the generated app reproduces it (step 6) on top of the usual expectations. Set in the
Prompts phase via "build from an uploaded screenshot."
An image-prompt case is a normal case with
set; enable "Runtime + screenshots" so the harness captures the result to compare against the target:
json
{
"prompt": "Build an app whose UI matches the attached reference screenshot.",
"reference_image": "/abs/path/to/target.png",
"runtime": { "mode": "expo-go", "platforms": ["ios"], "sdk": "56" },
"visual_expectations": ["Matches the reference's layout, components, and color treatment"]
}
Pipeline per eval case
Orchestration model — on the main thread you run and almost nothing else. Every phase is driven by a small Python orchestrator you
into the workspace and run with
python3 /private/tmp/expo-skill-eval-<skill>/<phase>.py
(covered by the
rule). The orchestrators are the
only place the
files are invoked — always via
subprocess.run(["bash", "<scripts>/<name>.sh", …])
, which runs as a child of
and needs no rule of its own — and the only place parallelism, logging, and directory creation live. So on the main thread you only ever:
Write orchestrators,
run them with
,
inspect outputs with the
/
/
tools, and
spawn the grader subagent. Never put a command inside a chained/backgrounded/piped shell construct, and never run ad-hoc
/
/
/
/
— that is what prompts. (A single standalone
bash …/scripts/<name>.sh …
is fine for one-off manual debugging, e.g. re-running one flaky snapshot, but the pipeline itself goes through the orchestrators.)
Run each orchestrator in the foreground — let the tool call block until it finishes; the orchestrators already parallelize
within a phase, so you don't need to overlap phases. Do
not shell-background a phase with
/
(the
,
, and
segments have no rule and prompt). If you genuinely must run a phase while continuing other work, use the
Bash tool's parameter on a plain
python3 <orchestrator> 2>&1 | tee <ws>/…log
call — never hand-rolled shell
.
Expect exactly one permission prompt at the very start: the first
into the workspace.
can suppress
/
but not
/
, so choose
"allow all edits in this directory for the session" on that first prompt — it covers every orchestrator,
, and viewer file for the whole run.
0. Workspace setup
Create the run's directory tree once, with the workspace script —
never with ad-hoc (a raw
prompts: there is no
rule, and a
variable can't match a path glob anyway):
bash
bash /abs/path/expo-skill-eval/scripts/make-workspace.sh /private/tmp/expo-skill-eval-<skill> iteration-N <num-evals>
This creates
and
iteration-N/eval-<i>/{with_skill,without_skill}/outputs
for every eval. It is covered by
Bash(bash *expo-skill-eval/scripts/*)
, and the
s inside run as children of the script (no rule of their own). After this, every other directory is made by the scripts/orchestrators that need it (
, the executor orchestrator's
, the snapshot scripts) or by the
tool auto-creating parents — so you never need another
.
1. Trigger eval (should-trigger only)
Write a
script under the workspace's
directory. Use
only queries — the expo plugin is a family of complementary skills, so multiple skills triggering on the same prompt is not a failure. Measure recall only: realistic prompts that should use the skill, scored by trigger rate.
The script should run
per query (with
--output-format=stream-json --verbose --include-partial-messages
,
stripped from the env, and the permission flag confirmed up front in
Before starting — clarify scope) and detect whether the target skill was triggered by watching for its
or
tool call in the stream. Note:
--include-partial-messages
requires both
--output-format=stream-json
and
— omitting either causes an immediate CLI error.
Load the skill under test — pass to every trigger subprocess. The trigger eval measures whether the skill's
description makes the model reach for it, so the subprocess must have the
local skill (the version with your edits) loaded. A
subprocess does
not inherit the parent session's
, so add it explicitly:
--plugin-dir <plugin-root>
, where
is the
absolute path to the plugin directory that owns the skill — the
ancestor containing
.claude-plugin/plugin.json
(e.g.
--plugin-dir /Users/.../skills/plugins/expo
). It must be absolute: the subprocess runs from the throwaway
cwd, so a relative
won't resolve — and a missing plugin dir silently loads nothing, which masquerades as a 0% trigger rate. Then watch for the skill triggering under its plugin-qualified name (
, e.g.
). Two caveats: (1) if the
published plugin is also installed globally, disable it (via
) for the run and re-enable after — otherwise two copies of
collide in the subprocess and the model may trigger the
published , silently scoring its description instead of your local edits (trigger detection only sees the tool-call name, so it can't tell the copies apart; dev checkouts usually don't have it installed). (2) Never make a synthetic duplicate of the skill — a real loaded copy always wins, so the synthetic harness scores 0%. (Executors are unaffected by an installed plugin: they read the local
directly and pass no
.)
Run each query's subprocess from an empty throwaway cwd (e.g.
), not the repo root. A should-trigger prompt like "build me a settings screen" can make the subprocess write files, and with
--dangerously-skip-permissions
those writes would otherwise land in the skills repo. Trigger detection only needs the skill's
/
call to appear in the stream — it doesn't need a fixture — so any incidental writes are throwaway.
Set a per-query subprocess timeout of at least 300 seconds. A 180s limit is too short — some queries cause the model to start generating code before triggering the skill, which pushes total runtime past 3 minutes.
Run trigger evals once per skill, not per code eval case.
2. Fixture
Each executor run gets a fresh Expo app, created by
scripts/make-fixture.sh <app-path> <sdk> [clean|full]
:
bash
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> # blank app (default)
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> full # keep example tabs
The script creates the app with
bunx create-expo-app -t default@sdk-<version>
(or the latest template when no version is given) once per SDK version + variant, caches it under
~/.cache/expo-skill-eval/fixtures/
, and clones the cache with APFS copy-on-write — so the first run per variant pays the install cost and every later run is near-instant. The default
variant runs the template's
script, so executors start from a blank app and every screen in the output is theirs — a much cleaner grading signal. Use
only when the eval prompt assumes an existing app (e.g. "I have an app with two tabs..."). The script also resets git inside the clone, so
in the app shows exactly what the executor changed (useful evidence for the grader).
Build fixtures sequentially, then fan out executors — never create fixtures concurrently. shares a cache under
~/.cache/expo-skill-eval/fixtures/
keyed by SDK+variant. If two runs both find the cache cold and call
at the same time, bun's link step collides and one fails with
/ "could not determine executable to run for package create-expo-app". So in the executor orchestrator (step 3), create
all fixtures one at a time first — a plain Python loop calling
subprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant])
(where
is the version chosen up front) —
then fan out the
executors with a
. Sequential creation is cheap: only the first fixture per SDK+variant pays the install cost; the rest are ~1s APFS clones. (And never fan fixtures out with ad-hoc shell like
make-fixture.sh A & make-fixture.sh B & wait
— the
/
segments prompt; the sequential Python loop avoids both the race and the prompt.)
3. Generate (executor subagents)
Run executors as
subprocess calls from a Python script,
not via the
tool. The
tool spawns subagents with their own permission context — file edits inside the fixture app will prompt the user. A
subprocess is a separate process outside the permission system entirely (the same pattern the trigger eval harness uses).
Write a Python script to
/private/tmp/expo-skill-eval-<skill>/run_executors.py
.
First create every run's fixture in a sequential loop —
subprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant], …)
one at a time (concurrent creation races the shared bun cache — see step 2).
Then run the with-skill and without-skill
calls in parallel via a
. Both phases live inside Python (covered by the
rule), so nothing runs as ad-hoc shell on the main thread. Each executor prompt must include:
- The skill path (with-skill runs only) and the eval prompt.
- Image-prompt cases ( set): the absolute path to the target screenshot plus an instruction like "Open the reference screenshot at with your Read tool and build an app whose UI matches it as closely as you can — layout, components, spacing, and colors." ( renders PNGs read this way, so the executor can actually see the target.)
- The fixture app path: "Make your changes inside . The project already exists and has dependencies installed. Use absolute paths for all file operations."
- "Before writing any files, inspect the project layout — run , read and — to find the correct routes directory. Recent SDK default templates place Expo Router routes in ; older ones use at the project root — inspect to confirm which this fixture uses."
- "Do NOT start the dev server, boot simulators, or take screenshots — the harness does that after you finish."
- Where to save a short summary of what was built.
Flags for the
subprocess:
- Strip from the environment (
env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}
) — otherwise hangs silently when nested inside a running Claude Code session.
- A permission flag, confirmed with the user up front (see Before starting — clarify scope): either
--dangerously-skip-permissions
or --permission-mode acceptEdits
. Bake the chosen flag into the generated script. A bare with neither flag can't write files — it has no TTY to approve the edit and emits code as text instead.
- Do NOT pass to executors (unlike the trigger eval). The with-skill run already reads the skill by its absolute , so it tests the local content directly; and the without-skill run must have no skill available at all — loading the plugin would let the skill auto-trigger and contaminate the baseline. Keeping executors path-based also cleanly separates the two questions: the executor measures content quality (is the skill useful once read?), the trigger eval measures triggering (does the description get it picked?).
Capture stdout/stderr per run to a log file next to the fixture for grading evidence. Set timeout to 900s per executor — with-skill runs read multiple reference files before coding and regularly take 5–10 minutes.
4. Static gate
Write
and run it with
. For each eval/config app it calls
subprocess.run(["bash", "<scripts>/check-static.sh", app, "ios,android"], capture_output=True, …)
across a
(static gates are independent — run them concurrently
inside Python, never with shell
/
), and writes each result to
eval-<i>/<config>/static.json
(exit code + captured output) for the grader.
runs
,
, and
for the listed platforms. A passing export catches most import/syntax/missing-module failures without touching a device; a failing export short-circuits step 5 with a clean FAIL — record it and have the snapshot orchestrator skip that app.
5. Run + screenshot (serial across evals)
Write
and run it with
. Simulators and emulators are shared resources, so this orchestrator runs
serially (no thread pool): for each app that passed the static gate, and each platform, it
the
dir and calls
subprocess.run(["bash", "<scripts>/snapshot-<platform>.sh", app, f"{outputs}/<platform>.png", port], env={**os.environ, "EXPO_SKILL_EVAL_RUNNER": runner}, …)
. Pass the port as a positional argument: use
for iOS and
for Android —
expo run:ios/android --port N
is supported and using separate ports lets you run both platforms without port collisions if you ever parallelize. Screenshots land in the run's
directory so the viewer renders them inline.
Reclaim disk after each fixture — essential for runs. Once all selected platforms for an app are captured (and
before the next fixture builds), call
subprocess.run(["bash", "<scripts>/clean-fixture.sh", app])
. Each
leaves multi-GB native build output (iOS Pods + DerivedData, Android Gradle build); without this, evals × configs × iterations pile up and fill the disk mid-run (the instability you'll see is the disk filling).
removes the heavy regenerable dirs (
,
,
,
,
) and the fixture's iOS DerivedData, keeping the app source + git so the grader's
still works. With serial snapshots + per-fixture cleanup, peak disk stays at ~one fixture's build instead of all of them. (Harmless for
runs too — they just have little to reclaim.)
is the up-front choice (
default, or
). The snapshot scripts honor
:
launches with
(and the Expo Go install/deep-link dance);
launches with
expo run:<platform> --port <port>
, which compiles+installs a native dev client and skips the Expo Go steps. The scripts already default the
timeout to 900s, but bump
EXPO_SKILL_EVAL_BUNDLE_TIMEOUT
higher if the first native compile needs it.
pre-installs
in every fixture so the dev-client URL scheme is registered before
tries to deep-link the app open.
Snapshot scripts always capture the initial route . They open the app via a deep link and take one screenshot — they cannot tap or navigate. Design eval prompts so the feature under test renders at the root route. If the executor places the main UI behind a navigation action (e.g. an "Open Settings" button on the index), the snapshot will miss the feature entirely and all visual expectations will fail.
Each
frees its Metro port on startup (kills any stale process left on it by a crashed prior run) and tears Metro down on exit — so you never need to run
/
/
yourself to clear ports (that would prompt, and it's already handled). It then starts Metro, waits for the "Bundled" line in the Metro log, settles, captures a screenshot, and tears Metro down. iOS boots the newest available iPhone simulator if none is booted; Android boots the first AVD if no device is attached (the slow path — boot once and reuse across the whole iteration). Android first
recycles a wedged/ emulator (graceful
, then force-kill + adb reset) so a half-dead instance can't poison the run, and boots with
hardware GPU (
, Metal-accelerated on Apple Silicon). If
self-aborts the emulator on a given machine (qemu
deep in gfxstream/Metal — possible on Apple Silicon under load), edit
in
to a software mode (
renders reliably but slowly — bump the settle; avoid
, which
hangs at boot on arm64).
runs only when
includes web. Each script writes a Metro log next to the screenshot (
) — include it in the grader's inputs. If a script exits non-zero it still attempts a best-effort screenshot (an error screen is evidence too).
dev-build relaunch: after Metro is up, the scripts relaunch the app via
(iOS) and
adb shell am start -n <pkg>/.MainActivity
(Android) — both avoid the "Open in X?" system dialog that a URL-scheme deep link triggers on first launch.
After all screenshots for the iteration are captured, always generate the viewer — pass the workspace root to the checked-in script:
bash
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>
It writes
into the workspace root (one level above
) and opens it in the browser itself (via
) — so no separate
command (and no
rule) is needed. See the
Viewer section below.
6. Grade
Spawn a grader subagent in the foreground. Its prompt must include:
- The eval prompt, expectations list, and visual_expectations from the eval case.
- The instructions in (screenshot grading, redbox detection).
- The screenshot files, Metro logs, and the step-4 as inputs.
- Image-prompt cases (case has ): also include the target screenshot (),
references/design-rubric.md
, and the fixture's . Tell the grader to compare the generated screenshot(s) to the target and emit the + blocks below.
The grader writes
next to the outputs with this shape:
json
{
"score": 8.5,
"max_score": 9,
"expectations": [
{"text": "...", "passed": true, "evidence": "..."}
],
"reference_match": {
"score": 7, "max": 10,
"evidence": "ios.png vs target.png: same two-section grouped list + toggle; accent color differs (blue vs target's green); row spacing tighter than target"
},
"quality": {
"dimensions": [
{"name": "Layout & hierarchy", "score": 2, "max": 3, "evidence": "ios.png: …"}
],
"subtotal": 17,
"max": 24,
"summary": "…"
},
"user_notes_summary": {"needs_review": false, "notes": ""}
}
Visual expectations go into the same
array with evidence naming the screenshot file and describing what is visible. The
block (how closely the generated app reproduces the target screenshot) and the
block (design-rubric scores from
references/design-rubric.md
) are emitted
only for image-prompt cases — or when a quality grade is explicitly requested. Omit both for plain text-prompt runs.
Rollout phases
Build out and debug the pipeline in this order — each phase is independently useful:
- Static: steps 1–4 only (
runtime.mode: "static-only"
for everything). No devices needed; CI-friendly.
- iOS: add to the loop. is the most scriptable target.
- Android: add . Emulator boot is the slowest part — keep one emulator running for the whole session.
- Web: add for skills that target web (uses Playwright via ; first run downloads Chromium).
Practical notes
-
Temp locations: all eval workspaces go under
/private/tmp/expo-skill-eval-<skill-name>/iteration-N/
. Everything in this run —
,
,
, and
— is covered by the
frontmatter, so a correctly-loaded skill runs prompt-free.
-
Permission rule forms (why this skill stays prompt-free): the rule syntax matters and the two tool families behave differently:
- rules — path-scoped to the skill's own code (no broad interpreters).
Bash(python3 /private/tmp/expo-skill-eval-*)
(plus the alias) runs the Python orchestrators you generate under the workspace; Bash(python3 *expo-skill-eval/scripts/*)
runs the checked-in scripts/generate_viewer.py
; Bash(tee /private/tmp/expo-skill-eval-*)
(+ ) lets python3 … 2>&1 | tee <workspace>/…log
write a log without prompting; Bash(bash *expo-skill-eval/scripts/*)
runs only this skill's . Because every path is pinned, the escape hatches stay denied: , , , and running code anywhere else do not match (verified empirically — a scoped rule allows but blocks and any other path). Commands the scripts call internally — , , , , , — are children of the script, not Bash tool calls, so they need no rule. Do not run ad-hoc //// from the main thread (they have no rule and prompt — and a raw can't match a path glob because the path is an unexpanded variable): create the directory tree with (step 0), let orchestrators create their own dirs (), and inspect results with the // tools (no Bash rule needed).
- Bash rule matching (tested, non-obvious): a Bash rule is a gitignore-style glob over the command string. matches any run of characters including and spaces and works mid-pattern — so
Bash(python3 /private/tmp/expo-skill-eval-*)
matches python3 /private/tmp/expo-skill-eval-x/run.py 2>&1
, and Bash(bash *expo-skill-eval/scripts/*)
matches bash /any/abs/path/expo-skill-eval/scripts/foo.sh args
. Two gotchas that burned earlier attempts: is matched literally (never use it in a Bash rule), and the suffix only works right after the command token () — not after a partial path ( does not match). Compound commands split on , , , , and each segment needs its own matching rule.
- rules suppress prompts; / rules do not. This is a Claude Code asymmetry (not a pattern bug, and not reload — in a session where the / rules from this same frontmatter are clearly working, still prompts): file creation/editing always goes through Claude Code's edit-approval flow regardless of . The frontmatter still scopes // to (both the and forms, since macOS doesn't auto-resolve the symlink) as documentation and a guardrail, but those / entries won't silence the prompt on their own. Practical consequence: at the start of a run you get one Write prompt for the workspace — choose "Yes, allow all edits in this directory for the session" and every later orchestrator / / viewer write under that workspace goes through silently. That single directory approval, not a rule, is what makes file-writing prompt-free.
- Reload after editing frontmatter — a full restart, not . is read once when the skill loads at session start; reloads the skill body but does not reliably refresh the permission rules. After editing this file, quit Claude Code entirely and start a new session, then re-run the skill — otherwise a stale (cached) ruleset keeps prompting even though the file on disk is correct.
- Grader subagents run with their own permission context and will still prompt for file access — that is expected and separate from the main thread's rules.
-
Calling eval scripts — one standalone command, never chained. Invoke each script as its own Bash call with an absolute path:
bash /abs/path/expo-skill-eval/scripts/snapshot-ios.sh arg1 arg2
(covered by
Bash(bash *expo-skill-eval/scripts/*)
). Do
not combine it with
,
,
,
,
,
,
, or
— compound commands are checked per segment, and those extra segments have no rule, so the whole thing prompts even though the
part is allowed. (The one allowed pipe is
… 2>&1 | tee <workspace>/…log
, since the scoped
rule covers it.) Need parallelism or output trimming? Put it in a Python orchestrator (covered by
python3 /…/expo-skill-eval-*
), which runs scripts via
across a
. Inspect results with the
/
/
tools, not
/
/
.
General rule: under this skill's tight scoping, any ad-hoc shell the agent improvises will prompt — the fix is to move it into a script/orchestrator (or use the scoped ), never to broaden a rule.
-
Inspecting outputs (screenshots, logs, files) — use tools, not shell. To find files use the
Glob tool (e.g.
/private/tmp/expo-skill-eval-<skill>/iteration-N/**/ios.png
); to view them use the
Read tool — Read renders PNGs visually, which is exactly what you need to confirm a screenshot rendered. To search file contents use
Grep. Never use
/
/
for this: they prompt, and
is deliberately
not allowed because its
can run anything (e.g.
). These tools are scoped and prompt-free; reach for them every time you'd otherwise type
/
/
.
-
Generated Python scripts: write orchestration/aggregation scripts under the workspace (e.g.
/private/tmp/expo-skill-eval-<skill>/aggregate.py
) and run them with
(covered by
Bash(python3 /private/tmp/expo-skill-eval-*)
). The viewer is the exception — it's the checked-in
scripts/generate_viewer.py
, run via
Bash(python3 *expo-skill-eval/scripts/*)
.
auto-creates parent dirs but prompts the first time — approve the workspace directory once (see the
/
note above). Capture output either by having the script write its own log or via
python3 … 2>&1 | tee <workspace>/…log
(covered by the scoped
rule); read logs back with the
tool. Don't use
for setup (the scoped rule only matches a workspace script
path, so a bare
prompts).
-
Trigger evals vs installed plugin: detect the real installed skill name (e.g.
) in the stream — a synthetic-duplicate harness always scores 0% when the real plugin is installed because the model picks the genuine skill over the synthetic copy.
-
Benchmark aggregation: save each run's
+
under
. Write a Python aggregation script under the workspace and run it with
.
-
Expo Go ceiling: anything requiring custom native code (expo-module, App Clips, brownfield) cannot run in Expo Go. Use
mode for those — see
references/runtime-matrix.md
before writing eval cases for a skill (note:
does run in Expo Go on SDK 56+).
-
API-route skills: instead of a screenshot, verify with
against the route while Metro is up; record the response as an output file for grading.
-
Timing data: capture token counts and duration into
immediately after each executor run — it is not recoverable later. To capture token counts, add
--output-format=stream-json --verbose
to the executor
call and parse the
/
events from the log. Without these flags the log only contains prose and elapsed seconds are the only recoverable metric.
-
First-launch dialogs: Expo Go occasionally shows a one-time prompt on a fresh simulator. If a screenshot captures a dialog instead of the app, re-run the snapshot script (it reopens the URL) and re-capture.
Viewer
After taking screenshots, always generate and open the HTML viewer so the user can see results immediately without being asked. The viewer is the checked-in
scripts/generate_viewer.py
— run it with the workspace root as its argument:
bash
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>
It writes a self-contained
/private/tmp/expo-skill-eval-<skill>/viewer.html
and opens it in the browser itself (
). What it renders:
- A tab per iteration ( under the workspace root; remembers the last active tab in ).
- For each eval case (read from ): side-by-side with_skill / without_skill columns, each showing static-gate status, score, the platform screenshots (click to zoom; embedded as base64 URIs so the file is self-contained), the expectation list with PASS/FAIL badges, and reviewer notes.
- For image-prompt cases (a with / ): the target screenshot beside the generated ones, the score (generated vs target), the rubric per config (one bar per dimension with its score/max plus the subtotal), and the quality delta (with_skill − without_skill subtotal) in the summary bar alongside the correctness delta.
- A summary bar with with_skill %, without_skill %, and delta.
- A trigger accuracy table when
trigger-evals/trigger_results.json
exists.
- A dark background with color-coded scores (green ≥85%, amber ≥65%, red below).
Publishing the viewer (only if opted in up front)
The local
is always generated.
Only when the user chose "Publish a shareable Artifact" in the up-front confirmation, additionally render it to a claude.ai Artifact at the very end — never publish without that opt-in (it's outward-facing and a published page can be cached/indexed). Mechanics:
- The tool wraps the file in its own
<!doctype html>…<head></head><body>
skeleton, so the file you hand it must be page content only — inline /, base64 images, and a , but no <!DOCTYPE>/<html>/<head>/<body>
tags of its own (a full standalone document gets double-wrapped and renders wrong).
- The script emits an Artifact-friendly variant when you add :
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill> --artifact
writes (same content, skeleton stripped, no browser open). Pass that file to the tool (), not the standalone one.
- The viewer is already self-contained (base64 screenshots, inline CSS/JS), so it satisfies the Artifact CSP (no external hosts).
References
references/runtime-matrix.md
— per-skill runtime applicability (expo-go vs static-only, platform notes).
- — screenshot grading instructions for the grader subagent.