wandb-primary

Original🇺🇸 English
Translated
2 scriptsChecked / no sensitive code detected

Comprehensive primary skill for agents working with Weights & Biases. Covers both the W&B SDK (training runs, metrics, artifacts, sweeps) and the Weave SDK (GenAI traces, evaluations, scorers). Includes helper libraries, gotcha tables, and data analysis patterns. Use this skill whenever the user asks about W&B runs, Weave traces, evaluations, training metrics, loss curves, model comparisons, or any Weights & Biases data — even if they don't say "W&B" explicitly.

4installs
Added on

NPX Install

npx skill4agent add wandb/skills wandb-primary

Tags

Translated version includes tags in frontmatter

W&B Primary Skill

This skill covers everything an agent needs to work with Weights & Biases:
  • W&B SDK (
    wandb
    ) — training runs, metrics, artifacts, sweeps, system metrics
  • Weave SDK (
    weave
    ) — GenAI traces, evaluations, scorers, token usage
  • Helper libraries
    wandb_helpers.py
    and
    weave_helpers.py
    for common operations
  • High-level Weave API (
    weave_tools.weave_api
    ) — agent-friendly wrappers for Weave queries

When to use what

I need to...Use
Query training runs, loss curves, hyperparametersW&B SDK (
wandb.Api()
) — see
references/WANDB_SDK.md
Query GenAI traces, calls, evaluationsHigh-level Weave API (
weave_tools.weave_api
) — see
references/WEAVE_API.md
Convert Weave wrapper types to plain Python
weave_helpers.unwrap()
Build a DataFrame from training runs
wandb_helpers.runs_to_dataframe()
Extract eval results for analysis
weave_helpers.eval_results_to_dicts()
Do something the high-level API doesn't coverRaw Weave SDK (
weave.init()
,
client.get_calls()
) — see
references/WEAVE_SDK_RAW.md

Bundled files

Helper libraries

python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")

# Weave helpers (traces, evals, GenAI)
from weave_helpers import (
    unwrap,                  # Recursively convert Weave types -> plain Python
    get_token_usage,         # Extract token counts from a call's summary
    eval_results_to_dicts,   # predict_and_score calls -> list of result dicts
    pivot_solve_rate,        # Build task-level pivot table across agents
    results_summary,         # Print compact eval summary
    eval_health,             # Extract status/counts from Evaluation.evaluate calls
    eval_efficiency,         # Compute tokens-per-success across eval calls
)

# W&B helpers (training runs, metrics)
from wandb_helpers import (
    runs_to_dataframe,       # Convert runs to a clean pandas DataFrame
    diagnose_run,            # Quick diagnostic summary of a training run
    compare_configs,         # Side-by-side config diff between two runs
)

Reference docs

Read these as needed — they contain full API surfaces and recipes:
  • references/WEAVE_API.md
    — High-level Weave API (
    Project
    ,
    Eval
    ,
    CallsView
    ). Start here for Weave queries.
  • references/WANDB_SDK.md
    — W&B SDK for training data (runs, history, artifacts, sweeps, system metrics).
  • references/WEAVE_SDK_RAW.md
    — Low-level Weave SDK (
    client.get_calls()
    ,
    CallsFilter
    ). Use only when the high-level API isn't enough.

Critical rules

Treat traces and runs as DATA

Weave traces and W&B run histories can be enormous. Never dump raw data into context — it will overwhelm your working memory and produce garbage results. Always:
  1. Inspect structure first — look at column names, dtypes, row counts
  2. Load into pandas/numpy — compute stats programmatically
  3. Summarize, don't dump — print computed statistics and tables, not raw rows
python
import pandas as pd
import numpy as np

# BAD: prints thousands of rows into context
for row in run.scan_history(keys=["loss"]):
    print(row)

# GOOD: load into numpy, compute stats, print summary
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"Loss: {len(losses)} steps, min={losses.min():.4f}, "
      f"final={losses[-1]:.4f}, mean_last_10%={losses[-len(losses)//10:].mean():.4f}")

Always deliver a final answer

Do not end your work mid-analysis. Every task must conclude with a clear, structured response:
  1. Query the data (1-2 scripts max)
  2. Extract the numbers you need
  3. Present: table + key findings + direct answers to each sub-question
If you catch yourself saying "now let me build the final analysis" — stop and present what you have.

Use
unwrap()
for unknown Weave data

When you encounter Weave output and aren't sure of its type (WeaveDict? WeaveObject? ObjectRef?), unwrap it first:
python
from weave_helpers import unwrap
import json

output = unwrap(call.output)
print(json.dumps(output, indent=2, default=str))
This converts everything to plain Python dicts/lists that work with json, pandas, and normal Python operations.

Environment setup

The sandbox has Python 3.13,
uv
,
wandb
,
weave
,
pandas
, and
numpy
pre-installed.
python
import os
entity  = os.environ["WANDB_ENTITY"]
project = os.environ["WANDB_PROJECT"]

Installing extra packages

bash
uv pip install matplotlib seaborn rich tabulate

Running scripts

bash
uv run script.py          # always use uv run, never bare python
uv run --with rich python -c "import rich; rich.print('hello')"

Quick starts

W&B SDK — training runs

python
import wandb
import pandas as pd
api = wandb.Api()

path = f"{entity}/{project}"
runs = api.runs(path, filters={"state": "finished"}, order="-created_at")

# Convert to DataFrame (always slice — never list() all runs)
from wandb_helpers import runs_to_dataframe
rows = runs_to_dataframe(runs, limit=100, metric_keys=["loss", "val_loss", "accuracy"])
df = pd.DataFrame(rows)
print(df.describe())
For full W&B SDK reference (filters, history, artifacts, sweeps), read
references/WANDB_SDK.md
.

Weave — high-level API (preferred)

python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
from weave_tools.weave_api import init, Project

init(f"{entity}/{project}")
project = Project.current()
print(project.summary())  # start here — shows ops, objects, evals, feedback
For full high-level API reference, read
references/WEAVE_API.md
.

Weave — raw SDK (when you need low-level access)

python
import weave
client = weave.init(f"{entity}/{project}")  # positional string, NOT keyword arg
calls = client.get_calls(limit=10)
For raw SDK patterns (CallsFilter, Query, advanced filtering), read
references/WEAVE_SDK_RAW.md
.

Key patterns

Weave eval inspection

Evaluation calls follow this hierarchy:
Evaluation.evaluate (root)
  ├── Evaluation.predict_and_score (one per dataset row x trials)
  │     ├── model.predict (the actual model call)
  │     ├── scorer_1.score
  │     └── scorer_2.score
  └── Evaluation.summarize
Extract per-task results into a DataFrame:
python
from weave_helpers import eval_results_to_dicts, results_summary

# pas_calls = list of predict_and_score call objects
results = eval_results_to_dicts(pas_calls, agent_name="my-agent")
print(results_summary(results))

df = pd.DataFrame(results)
print(df.groupby("passed")["score"].mean())

Eval health and efficiency

python
from weave_helpers import eval_health, eval_efficiency

health = eval_health(eval_calls)
df = pd.DataFrame(health)
print(df.to_string(index=False))

efficiency = eval_efficiency(eval_calls)
print(pd.DataFrame(efficiency).to_string(index=False))

Token usage

python
from weave_helpers import get_token_usage

usage = get_token_usage(call)
print(f"Tokens: {usage['total_tokens']} (in={usage['input_tokens']}, out={usage['output_tokens']})")

Cost estimation

python
call_with_costs = client.get_call("id", include_costs=True)
costs = call_with_costs.summary.get("weave", {}).get("costs", {})

Run diagnostics

python
from wandb_helpers import diagnose_run

run = api.run(f"{path}/run-id")
diag = diagnose_run(run)
for k, v in diag.items():
    print(f"  {k}: {v}")

Error analysis — open coding to axial coding

For structured failure analysis on eval results:
  1. Understand data shape — use
    project.summary()
    ,
    calls.input_shape()
    ,
    calls.output_shape()
  2. Open coding — write a Weave Scorer that journals what went wrong per failing call
  3. Axial coding — write a second Scorer that classifies notes into a taxonomy
  4. Summarize — count primary labels with
    collections.Counter
See
references/WEAVE_API.md
for the full
run_scorer
API.

W&B Reports

bash
uv pip install "wandb[workspaces]"
python
from wandb.apis import reports as wr
import wandb_workspaces.expr as expr

report = wr.Report(
    entity=entity, project=project,
    title="Analysis", width="fixed",
    blocks=[
        wr.H1(text="Results"),
        wr.PanelGrid(
            runsets=[wr.Runset(entity=entity, project=project)],
            panels=[wr.LinePlot(title="Loss", x="_step", y=["loss"])],
        ),
    ],
)
# report.save(draft=True)  # only when asked to publish
Use
expr.Config("lr")
,
expr.Summary("loss")
,
expr.Tags().isin([...])
for runset filters — not dot-path strings.

Gotchas

Weave API

GotchaWrongRight
weave.init args
weave.init(project="x")
weave.init("x")
(positional)
Parent filter
filter={'parent_id': 'x'}
filter={'parent_ids': ['x']}
(plural, list)
WeaveObject access
rubric.get('passed')
getattr(rubric, 'passed', None)
Nested output
out.get('succeeded')
out.get('output').get('succeeded')
(output.output)
ObjectRef comparison
name_ref == "foo"
str(name_ref) == "foo"
CallsFilter import
from weave import CallsFilter
from weave.trace.weave_client import CallsFilter
Query import
from weave import Query
from weave.trace_server.interface.query import Query
Eval status path
summary["status"]
summary["weave"]["status"]
Eval success count
summary["success_count"]
summary["weave"]["status_counts"]["success"]
When in doubtGuess the type
unwrap()
first, then inspect

WeaveDict vs WeaveObject

  • WeaveDict: dict-like, supports
    .get()
    ,
    .keys()
    ,
    []
    . Used for:
    call.inputs
    ,
    call.output
    ,
    scores
    dict
  • WeaveObject: attribute-based, use
    getattr()
    . Used for: scorer results (rubric), dataset rows
  • When in doubt: use
    unwrap()
    to convert everything to plain Python

W&B API

GotchaWrongRight
Summary access
run.summary["loss"]
run.summary_metrics.get("loss")
Loading all runs
list(api.runs(...))
runs[:200]
(always slice)
History — all fields
run.history()
run.history(samples=500, keys=["loss"])
scan_history — no keys
scan_history()
scan_history(keys=["loss"])
(explicit)
Raw data in context
print(run.history())
Load into DataFrame, compute stats
Metric at step Niterate entire history
scan_history(keys=["loss"], min_step=N, max_step=N+1)
Cache stalenessreading live run
api.flush()
first

Package management

GotchaWrongRight
Installing packages
pip install pandas
uv pip install pandas
Running scripts
python script.py
uv run script.py
Quick one-off
pip install rich && python -c ...
uv run --with rich python -c ...

Weave logging noise

Weave prints version warnings to stderr. Suppress with:
python
import logging
logging.getLogger("weave").setLevel(logging.ERROR)

Quick reference

python
# --- Weave: How many traces? ---
from weave_tools.weave_api import init, Project
init(f"{entity}/{project}")
project = Project.current()
print(project.summary())

# --- Weave: Recent evals ---
evals = project.evals(limit=10)
for ev in evals:
    print(ev.summarize())

# --- Weave: Failed calls ---
calls = project.calls(op="predict")
failed = calls.limit(1000).filter(lambda c: c.status == "error")

# --- W&B: Best run by loss ---
best = api.runs(path, filters={"state": "finished"}, order="+summary_metrics.loss")[:1]
print(f"Best: {best[0].name}, loss={best[0].summary_metrics.get('loss')}")

# --- W&B: Loss curve to numpy ---
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"min={losses.min():.6f}, final={losses[-1]:.6f}, steps={len(losses)}")

# --- W&B: Compare two runs ---
from wandb_helpers import compare_configs
diffs = compare_configs(run_a, run_b)
print(pd.DataFrame(diffs).to_string(index=False))