Blacksmith Testbox
Scope
Use Testbox when you need remote CI parity, injected secrets, hosted services,
or an OS/runtime image that your local machine cannot provide cheaply.
For OpenClaw, Crabbox is a supported alternative when Blacksmith is unavailable
or owned cloud capacity is preferable.
Do not default to Testbox for every local test/build loop. If the repo has
documented local commands for normal iteration, use those first so you keep
warm caches, local build state, and fast feedback.
Testbox is the expensive path. Reach for it deliberately.
OpenClaw maintainers can opt into Testbox-first validation by setting
in their environment or standing agent rules. This mode is
maintainers-only and requires Blacksmith access.
- Pre-warm a Testbox early for longer, wider, or uncertain work.
- Prefer Testbox for gates, e2e, package-like proof, and broad suites.
- Reuse the same Testbox ID for every run command in the same task/session.
- Use local commands only when the task explicitly sets
OPENCLAW_LOCAL_CHECK_MODE=throttled|full
, or when the user asks for local
proof.
Install the CLI
If
is not installed, install it:
curl -fsSL https://get.blacksmith.sh | sh
For the canary channel (bleeding-edge):
BLACKSMITH_CHANNEL=canary sh -c 'curl -fsSL https://get.blacksmith.sh | sh'
Then authenticate:
Agent-triggered browser auth (non-interactive)
When an agent needs to ensure the user is authenticated before running testbox
commands (e.g. warmup, run), use browser-based auth with non-interactive mode.
This opens the browser for the user to sign in; the agent does not interact with
the browser. The org selector in the dashboard is skipped, so the user only sees
the sign-in flow.
Required command (
is required with
):
blacksmith auth login --non-interactive --organization <org-slug>
The org slug can come from
env var or the
global flag.
If neither is set, the agent should use the project's known org (e.g. from repo
config or user context). Example:
blacksmith auth login --non-interactive --organization acme-corp
blacksmith --org acme-corp auth login --non-interactive --organization acme-corp
Flow: The CLI starts a local callback server, opens the browser to the
dashboard auth page, and blocks for up to 2 minutes. The user completes sign-in
and authorization in the browser. The dashboard redirects to localhost with the
token; the CLI saves credentials and exits. The agent then proceeds.
Do not use for this flow — that is for headless/token-based
auth. This skill focuses on browser-based auth when the user prefers signing in
via the web UI.
Optional flags:
- — Override dashboard URL (e.g. for staging)
Decide first: local or Testbox
Before warming anything up, check the repo's own instructions.
Prefer local commands when:
- the repo documents a supported local test/build workflow
- you are iterating on unit tests, lint, typecheck, formatting, or other
local-only validation
- the value comes from warm local caches and fast repeat runs
- the command does not need remote secrets, hosted services, or CI-only images
Prefer Testbox when:
- the repo explicitly requires CI-parity or remote validation
- the command needs secrets, service containers, or provisioned infra
- you are reproducing CI-only failures
- you need the exact workflow image/job environment from GitHub Actions
For OpenClaw specifically, normal local iteration stays local unless maintainer
Testbox mode is enabled with
:
pnpm test <path-or-filter>
If
is enabled, run those same repo commands inside the
warm Testbox. If the user wants laptop-friendly local proof for one command, use
the explicit escape hatch
OPENCLAW_LOCAL_CHECK_MODE=throttled
.
For installable-package product proof, prefer the GitHub
workflow over an ad hoc Testbox command. It resolves one package candidate
(
,
,
, or
), uploads it as
, and runs the reusable Docker E2E lanes against that exact
tarball on GitHub/Blacksmith runners. Use
for the trusted
workflow/harness code and
for the source ref to pack when testing
an older trusted branch, tag, or SHA.
Setup: Warmup before coding
If you decided Testbox is warranted, warm one up early. This returns an ID
instantly and boots the CI environment in the background while you work:
blacksmith testbox warmup ci-check-testbox.yml
# → tbx_01jkz5b3t9...
Save this ID in the current session. You need it for every
command.
Treat
as diagnostics, not a reusable work queue.
Listed boxes can be visible at the org/repo level while still being unusable or
stale for the current local agent lane.
For OpenClaw maintainer Testbox mode, pre-warm at the start of longer or wider
tasks:
blacksmith testbox warmup ci-check-testbox.yml --ref main --idle-timeout 90
pnpm testbox:claim --id <ID>
Use the build-artifact warmup when e2e/package/build proof benefits from seeded
,
, and build-all caches:
blacksmith testbox warmup ci-build-artifacts-testbox.yml --ref main --idle-timeout 90
pnpm testbox:claim --id <ID>
Warmup dispatches a GitHub Actions workflow that provisions a VM with the
full CI environment: dependencies installed, services started, secrets
injected, and a clean checkout of the repo at the default branch.
In OpenClaw, raw commit SHAs are not reliable dispatch refs for
;
use a branch or tag. The build-artifact workflow resolves
and
to SHA cache keys internally.
Options:
--ref <branch|tag> Git ref to dispatch against (default: repo's default branch)
--job <name> Specific job within the workflow (if it has multiple)
--idle-timeout <min> Idle timeout in minutes (default: 30)
CRITICAL: Always run from the repo root
ALWAYS invoke
commands from the
root of the git
repository. The CLI syncs the current working directory to the testbox
using rsync with
. If you run from a subdirectory (e.g.
cd backend && blacksmith testbox run ...
), rsync will mirror only that
subdirectory and
delete everything else on the testbox — wiping other
directories like
,
, etc.
# CORRECT — run from repo root, use paths in the command
blacksmith testbox run --id <ID> "cd backend && php artisan test"
blacksmith testbox run --id <ID> "cd dashboard && npm test"
# WRONG — do NOT cd into a subdirectory before invoking the CLI
cd backend && blacksmith testbox run --id <ID> "php artisan test"
If your shell is in a subdirectory,
back to the repo root first:
cd "$(git rev-parse --show-toplevel)"
blacksmith testbox run --id <ID> "cd backend && php artisan test"
Running commands
blacksmith testbox run --id <ID> "<command>"
The
command automatically waits for the testbox to become ready if
it is still booting, so you can call
immediately after warmup without
needing to check status first.
In OpenClaw, prefer the guarded runner wrapper so stale/reused ids fail before
the Blacksmith CLI spends time syncing or emits a confusing missing-key error:
pnpm testbox:run --id <ID> -- "OPENCLAW_TESTBOX=1 pnpm check:changed"
The wrapper refuses to run when the local per-Testbox key is missing or when the
id was not claimed by this OpenClaw checkout with
pnpm testbox:claim --id <ID>
. Treat that as the expected remediation, not as a GitHub account or
normal SSH-key problem. A local key alone is not enough; a ready box may still
carry stale rsync state from another lane.
If the agent crashes, the remote box relies on Blacksmith's idle timeout. The
local OpenClaw claim marker is not deleted automatically, so the wrapper treats
claims older than 12 hours as stale. Override only for intentional long-running
work with
OPENCLAW_TESTBOX_CLAIM_TTL_MINUTES=<minutes>
.
Before spending a broad gate on a manually assembled command, you can also run:
pnpm testbox:sanity -- --id <ID>
Downloading files from a testbox
Use the
command to retrieve files or directories from a running
testbox to your local machine. This is useful for fetching build artifacts,
test results, coverage reports, or any output generated on the testbox.
blacksmith testbox download --id <ID> <remote-path> [local-path]
The remote path is relative to the testbox working directory (same as
).
If no local path is specified, the file is saved to the current directory
using the same base name.
To download a directory, append a trailing
to the remote path — this
triggers recursive mode:
# Download a single file
blacksmith testbox download --id <ID> coverage/report.html
# Download a file to a specific local path
blacksmith testbox download --id <ID> build/output.tar.gz ./output.tar.gz
# Download an entire directory
blacksmith testbox download --id <ID> test-results/ ./results/
Options:
--ssh-private-key <path> Path to SSH private key (if warmup used --ssh-public-key)
How file sync works
Understanding this model is critical for using Testbox correctly.
When you call
, the CLI performs a
delta sync of your local changes
to the remote testbox before executing your command:
-
The testbox VM starts from a clean
at the warmup ref.
The workflow's setup steps (e.g.
,
,
)
run during warmup and populate dependency directories on the remote VM.
-
On each
, the CLI uses
git to detect which files changed locally
since the last sync. It syncs ONLY tracked files and untracked non-ignored
files (i.e. files that
reports).
-
'd directories are never synced. This means directories
like
,
,
,
,
, etc. are
NOT transferred from your local machine. The testbox uses its own copies
of those directories, populated during the warmup workflow steps.
-
If nothing has changed since the last sync (same git commit and working
tree state), the sync is skipped entirely for speed.
Why this matters
-
Changing dependencies: If you modify
,
,
,
, or similar dependency manifests, the lock/manifest
file will be synced but the actual dependency directory will NOT. You must
re-run the install command on the testbox:
blacksmith testbox run --id <ID> "npm install && npm test"
blacksmith testbox run --id <ID> "pip install -r requirements.txt && pytest"
blacksmith testbox run --id <ID> "composer install && phpunit"
-
Generated/build artifacts: If your tests depend on a build step (e.g.
,
), and you changed source files that affect the build
output, re-run the build on the testbox before testing.
-
New untracked files: New files you create locally ARE synced (as long as
they are not gitignored). You do not need to
them first.
-
Deleted files: Files you delete locally are also deleted on the remote
testbox. The sync model keeps the remote in lockstep with your local managed
file set.
CRITICAL: Do not ban local tests
Do not assume local validation is forbidden. Many repos intentionally invest in
fast, warm local loops, and forcing every run through Testbox destroys that
advantage.
Use Testbox for the checks that actually need it: remote parity, secrets,
services, CI-only runners, or reproducibility against the workflow image.
If the repo says local tests/builds are the normal path, follow the repo.
OpenClaw maintainer exception: if
is set by the user or
agent environment, treat Testbox as the normal validation path for this repo.
Use
OPENCLAW_LOCAL_CHECK_MODE=throttled|full
as the explicit local escape
hatch.
When to use
Use Testbox when:
- running database migrations or destructive environment checks
- running commands that depend on secrets or environment variables not present locally
- reproducing CI-only failures or validating against the workflow image
- validating behavior that needs provisioned services or remote runners
- doing a final parity check before commit/push when the repo or user wants that
Trim that list based on repo guidance. If the repo documents supported local
tests/builds, prefer local for routine iteration and keep Testbox for the
checks that need parity or remote state.
Workflow
- Decide whether the repo's local loop is the right default. For OpenClaw,
makes Testbox the maintainer default.
- If Testbox is warranted, warm up early:
blacksmith testbox warmup ci-check-testbox.yml --ref main --idle-timeout 90
→ save the ID,
then pnpm testbox:claim --id <ID>
- Write code while the testbox boots in the background.
- Run the remote command when needed:
pnpm testbox:run --id <ID> -- "OPENCLAW_TESTBOX=1 pnpm check:changed"
- If tests fail, fix code and re-run against the same warm box.
- If you changed dependency manifests (package.json, etc.), prepend
the install command:
blacksmith testbox run --id <ID> "npm install && npm test"
- If a narrow PR reports a full sync or the box was reused/expired, sanity
check the remote copy before a slow gate:
pnpm testbox:run --id <ID> -- "pnpm testbox:sanity"
.
If it reports missing root files or mass tracked deletions, stop the box and
warm a fresh one. Use OPENCLAW_TESTBOX_ALLOW_MASS_DELETIONS=1
only for an
intentional large deletion PR.
- If you need artifacts (coverage reports, build outputs, etc.), download them:
blacksmith testbox download --id <ID> coverage/ ./coverage/
- Once green, commit and push.
OpenClaw full test suite
For OpenClaw, use the repo package manager and the measured stable full-suite
profile below. It keeps six Vitest project shards active while limiting each
shard to one worker to avoid worker OOMs on Testbox:
blacksmith testbox run --id <ID> "env NODE_OPTIONS=--max-old-space-size=4096 OPENCLAW_TEST_PROJECTS_PARALLEL=6 OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test"
Observed full-suite time on Blacksmith Testbox is about 3-4 minutes:
- 173-180s on a warmed box
- 219s on a fresh 32-vCPU box
When validating before commit/push in maintainer Testbox mode, run
inside the warmed box first when appropriate, then the full
suite with the profile above if broad confidence is needed.
Run
inside the warmed box before the broad command when
the sync looks suspicious. It checks that root files such as
still exist and fails on 200 or more tracked deletions. That catches stale or
corrupted rsync state before dependency install or Vitest failures hide the real
problem.
Examples
blacksmith testbox warmup ci-check-testbox.yml
# → tbx_01jkz5b3t9...
# Run tests
blacksmith testbox run --id <ID> "npm test -- --testPathPattern=handler.test"
blacksmith testbox run --id <ID> "go test ./pkg/api/... -run TestHandler -v"
blacksmith testbox run --id <ID> "python -m pytest tests/test_api.py -k test_auth"
# Re-install deps after changing package.json, then test
blacksmith testbox run --id <ID> "npm install && npm test"
# Build and test
blacksmith testbox run --id <ID> "npm run build && npm test"
# Download artifacts from the testbox
blacksmith testbox download --id <ID> coverage/lcov-report/ ./coverage/
blacksmith testbox download --id <ID> build/output.tar.gz
Waiting for the testbox to be ready
The
command automatically waits for the testbox, so explicit waiting is
usually unnecessary. If you do need to check readiness separately (e.g. before
a series of runs), use the
flag. Do NOT use a sleep-and-recheck loop.
Correct: block until ready with a timeout:
blacksmith testbox status --id <ID> --wait [--wait-timeout 5m]
Wrong: never use sleep + status in a loop:
# BAD — do not do this
sleep 30 && blacksmith testbox status --id <ID>
while ! blacksmith testbox status --id <ID> | grep ready; do sleep 5; done
polls the status and exits as soon as the testbox is ready (or when the
timeout is reached). Default timeout is 5m; use
for longer
(e.g.
,
).
Managing testboxes
# Check status of a specific testbox
blacksmith testbox status --id <ID>
# List all active testboxes for the current repo
blacksmith testbox list
# Stop a testbox when you're done (frees resources)
blacksmith testbox stop --id <ID>
Testboxes automatically shut down after being idle (default: 30 minutes).
If you need a longer session, increase the timeout at warmup time. For OpenClaw
maintainer work, use 90 minutes for long-running sessions:
blacksmith testbox warmup ci-check-testbox.yml --idle-timeout 90
blacksmith testbox warmup ci-build-artifacts-testbox.yml --idle-timeout 90
With options
blacksmith testbox warmup ci-check-testbox.yml --ref main
blacksmith testbox warmup ci-check-testbox.yml --idle-timeout 90
blacksmith testbox run --id <ID> "go test ./..."