wandb-primary
Original:🇺🇸 English
Translated
2 scriptsChecked / no sensitive code detected
Comprehensive primary skill for agents working with Weights & Biases. Covers both the W&B SDK (training runs, metrics, artifacts, sweeps) and the Weave SDK (GenAI traces, evaluations, scorers). Includes helper libraries, gotcha tables, and data analysis patterns. Use this skill whenever the user asks about W&B runs, Weave traces, evaluations, training metrics, loss curves, model comparisons, or any Weights & Biases data — even if they don't say "W&B" explicitly.
4installs
Sourcewandb/skills
Added on
NPX Install
npx skill4agent add wandb/skills wandb-primaryTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →W&B Primary Skill
This skill covers everything an agent needs to work with Weights & Biases:
- W&B SDK () — training runs, metrics, artifacts, sweeps, system metrics
wandb - Weave SDK () — GenAI traces, evaluations, scorers, token usage
weave - Helper libraries — and
wandb_helpers.pyfor common operationsweave_helpers.py - High-level Weave API () — agent-friendly wrappers for Weave queries
weave_tools.weave_api
When to use what
| I need to... | Use |
|---|---|
| Query training runs, loss curves, hyperparameters | W&B SDK ( |
| Query GenAI traces, calls, evaluations | High-level Weave API ( |
| Convert Weave wrapper types to plain Python | |
| Build a DataFrame from training runs | |
| Extract eval results for analysis | |
| Do something the high-level API doesn't cover | Raw Weave SDK ( |
Bundled files
Helper libraries
python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
# Weave helpers (traces, evals, GenAI)
from weave_helpers import (
unwrap, # Recursively convert Weave types -> plain Python
get_token_usage, # Extract token counts from a call's summary
eval_results_to_dicts, # predict_and_score calls -> list of result dicts
pivot_solve_rate, # Build task-level pivot table across agents
results_summary, # Print compact eval summary
eval_health, # Extract status/counts from Evaluation.evaluate calls
eval_efficiency, # Compute tokens-per-success across eval calls
)
# W&B helpers (training runs, metrics)
from wandb_helpers import (
runs_to_dataframe, # Convert runs to a clean pandas DataFrame
diagnose_run, # Quick diagnostic summary of a training run
compare_configs, # Side-by-side config diff between two runs
)Reference docs
Read these as needed — they contain full API surfaces and recipes:
- — High-level Weave API (
references/WEAVE_API.md,Project,Eval). Start here for Weave queries.CallsView - — W&B SDK for training data (runs, history, artifacts, sweeps, system metrics).
references/WANDB_SDK.md - — Low-level Weave SDK (
references/WEAVE_SDK_RAW.md,client.get_calls()). Use only when the high-level API isn't enough.CallsFilter
Critical rules
Treat traces and runs as DATA
Weave traces and W&B run histories can be enormous. Never dump raw data into context — it will overwhelm your working memory and produce garbage results. Always:
- Inspect structure first — look at column names, dtypes, row counts
- Load into pandas/numpy — compute stats programmatically
- Summarize, don't dump — print computed statistics and tables, not raw rows
python
import pandas as pd
import numpy as np
# BAD: prints thousands of rows into context
for row in run.scan_history(keys=["loss"]):
print(row)
# GOOD: load into numpy, compute stats, print summary
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"Loss: {len(losses)} steps, min={losses.min():.4f}, "
f"final={losses[-1]:.4f}, mean_last_10%={losses[-len(losses)//10:].mean():.4f}")Always deliver a final answer
Do not end your work mid-analysis. Every task must conclude with a clear, structured response:
- Query the data (1-2 scripts max)
- Extract the numbers you need
- Present: table + key findings + direct answers to each sub-question
If you catch yourself saying "now let me build the final analysis" — stop and present what you have.
Use unwrap()
for unknown Weave data
unwrap()When you encounter Weave output and aren't sure of its type (WeaveDict? WeaveObject? ObjectRef?), unwrap it first:
python
from weave_helpers import unwrap
import json
output = unwrap(call.output)
print(json.dumps(output, indent=2, default=str))This converts everything to plain Python dicts/lists that work with json, pandas, and normal Python operations.
Environment setup
The sandbox has Python 3.13, , , , , and pre-installed.
uvwandbweavepandasnumpypython
import os
entity = os.environ["WANDB_ENTITY"]
project = os.environ["WANDB_PROJECT"]Installing extra packages
bash
uv pip install matplotlib seaborn rich tabulateRunning scripts
bash
uv run script.py # always use uv run, never bare python
uv run --with rich python -c "import rich; rich.print('hello')"Quick starts
W&B SDK — training runs
python
import wandb
import pandas as pd
api = wandb.Api()
path = f"{entity}/{project}"
runs = api.runs(path, filters={"state": "finished"}, order="-created_at")
# Convert to DataFrame (always slice — never list() all runs)
from wandb_helpers import runs_to_dataframe
rows = runs_to_dataframe(runs, limit=100, metric_keys=["loss", "val_loss", "accuracy"])
df = pd.DataFrame(rows)
print(df.describe())For full W&B SDK reference (filters, history, artifacts, sweeps), read .
references/WANDB_SDK.mdWeave — high-level API (preferred)
python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
from weave_tools.weave_api import init, Project
init(f"{entity}/{project}")
project = Project.current()
print(project.summary()) # start here — shows ops, objects, evals, feedbackFor full high-level API reference, read .
references/WEAVE_API.mdWeave — raw SDK (when you need low-level access)
python
import weave
client = weave.init(f"{entity}/{project}") # positional string, NOT keyword arg
calls = client.get_calls(limit=10)For raw SDK patterns (CallsFilter, Query, advanced filtering), read .
references/WEAVE_SDK_RAW.mdKey patterns
Weave eval inspection
Evaluation calls follow this hierarchy:
Evaluation.evaluate (root)
├── Evaluation.predict_and_score (one per dataset row x trials)
│ ├── model.predict (the actual model call)
│ ├── scorer_1.score
│ └── scorer_2.score
└── Evaluation.summarizeExtract per-task results into a DataFrame:
python
from weave_helpers import eval_results_to_dicts, results_summary
# pas_calls = list of predict_and_score call objects
results = eval_results_to_dicts(pas_calls, agent_name="my-agent")
print(results_summary(results))
df = pd.DataFrame(results)
print(df.groupby("passed")["score"].mean())Eval health and efficiency
python
from weave_helpers import eval_health, eval_efficiency
health = eval_health(eval_calls)
df = pd.DataFrame(health)
print(df.to_string(index=False))
efficiency = eval_efficiency(eval_calls)
print(pd.DataFrame(efficiency).to_string(index=False))Token usage
python
from weave_helpers import get_token_usage
usage = get_token_usage(call)
print(f"Tokens: {usage['total_tokens']} (in={usage['input_tokens']}, out={usage['output_tokens']})")Cost estimation
python
call_with_costs = client.get_call("id", include_costs=True)
costs = call_with_costs.summary.get("weave", {}).get("costs", {})Run diagnostics
python
from wandb_helpers import diagnose_run
run = api.run(f"{path}/run-id")
diag = diagnose_run(run)
for k, v in diag.items():
print(f" {k}: {v}")Error analysis — open coding to axial coding
For structured failure analysis on eval results:
- Understand data shape — use ,
project.summary(),calls.input_shape()calls.output_shape() - Open coding — write a Weave Scorer that journals what went wrong per failing call
- Axial coding — write a second Scorer that classifies notes into a taxonomy
- Summarize — count primary labels with
collections.Counter
See for the full API.
references/WEAVE_API.mdrun_scorerW&B Reports
bash
uv pip install "wandb[workspaces]"python
from wandb.apis import reports as wr
import wandb_workspaces.expr as expr
report = wr.Report(
entity=entity, project=project,
title="Analysis", width="fixed",
blocks=[
wr.H1(text="Results"),
wr.PanelGrid(
runsets=[wr.Runset(entity=entity, project=project)],
panels=[wr.LinePlot(title="Loss", x="_step", y=["loss"])],
),
],
)
# report.save(draft=True) # only when asked to publishUse , , for runset filters — not dot-path strings.
expr.Config("lr")expr.Summary("loss")expr.Tags().isin([...])Gotchas
Weave API
| Gotcha | Wrong | Right |
|---|---|---|
| weave.init args | | |
| Parent filter | | |
| WeaveObject access | | |
| Nested output | | |
| ObjectRef comparison | | |
| CallsFilter import | | |
| Query import | | |
| Eval status path | | |
| Eval success count | | |
| When in doubt | Guess the type | |
WeaveDict vs WeaveObject
- WeaveDict: dict-like, supports ,
.get(),.keys(). Used for:[],call.inputs,call.outputdictscores - WeaveObject: attribute-based, use . Used for: scorer results (rubric), dataset rows
getattr() - When in doubt: use to convert everything to plain Python
unwrap()
W&B API
| Gotcha | Wrong | Right |
|---|---|---|
| Summary access | | |
| Loading all runs | | |
| History — all fields | | |
| scan_history — no keys | | |
| Raw data in context | | Load into DataFrame, compute stats |
| Metric at step N | iterate entire history | |
| Cache staleness | reading live run | |
Package management
| Gotcha | Wrong | Right |
|---|---|---|
| Installing packages | | |
| Running scripts | | |
| Quick one-off | | |
Weave logging noise
Weave prints version warnings to stderr. Suppress with:
python
import logging
logging.getLogger("weave").setLevel(logging.ERROR)Quick reference
python
# --- Weave: How many traces? ---
from weave_tools.weave_api import init, Project
init(f"{entity}/{project}")
project = Project.current()
print(project.summary())
# --- Weave: Recent evals ---
evals = project.evals(limit=10)
for ev in evals:
print(ev.summarize())
# --- Weave: Failed calls ---
calls = project.calls(op="predict")
failed = calls.limit(1000).filter(lambda c: c.status == "error")
# --- W&B: Best run by loss ---
best = api.runs(path, filters={"state": "finished"}, order="+summary_metrics.loss")[:1]
print(f"Best: {best[0].name}, loss={best[0].summary_metrics.get('loss')}")
# --- W&B: Loss curve to numpy ---
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"min={losses.min():.6f}, final={losses[-1]:.6f}, steps={len(losses)}")
# --- W&B: Compare two runs ---
from wandb_helpers import compare_configs
diffs = compare_configs(run_a, run_b)
print(pd.DataFrame(diffs).to_string(index=False))