wandb-primary
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseW&B Primary Skill
W&B核心技能指南
This skill covers everything an agent needs to work with Weights & Biases:
- W&B SDK () — training runs, metrics, artifacts, sweeps, system metrics
wandb - Weave SDK () — GenAI traces, evaluations, scorers, token usage
weave - Helper libraries — and
wandb_helpers.pyfor common operationsweave_helpers.py - High-level Weave API () — agent-friendly wrappers for Weave queries
weave_tools.weave_api
本技能涵盖了Agent使用Weights & Biases所需的全部内容:
- W&B SDK () — 训练运行、指标、工件、超参数调优、系统指标
wandb - Weave SDK () — GenAI追踪、评估、评分器、Token使用量
weave - 辅助库 — 和
wandb_helpers.py用于执行常见操作weave_helpers.py - 高级Weave API () — 为Agent优化的Weave查询封装器
weave_tools.weave_api
When to use what
场景选择指南
| I need to... | Use |
|---|---|
| Query training runs, loss curves, hyperparameters | W&B SDK ( |
| Query GenAI traces, calls, evaluations | High-level Weave API ( |
| Convert Weave wrapper types to plain Python | |
| Build a DataFrame from training runs | |
| Extract eval results for analysis | |
| Do something the high-level API doesn't cover | Raw Weave SDK ( |
| 我需要... | 使用工具 |
|---|---|
| 查询训练运行、损失曲线、超参数 | W&B SDK ( |
| 查询GenAI追踪、调用记录、评估结果 | 高级Weave API ( |
| 将Weave封装类型转换为纯Python类型 | |
| 从训练运行数据构建DataFrame | |
| 提取评估结果用于分析 | |
| 执行高级API未覆盖的操作 | 原生Weave SDK ( |
Bundled files
配套文件
Helper libraries
辅助库
python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")Weave helpers (traces, evals, GenAI)
Weave helpers (traces, evals, GenAI)
from weave_helpers import (
unwrap, # Recursively convert Weave types -> plain Python
get_token_usage, # Extract token counts from a call's summary
eval_results_to_dicts, # predict_and_score calls -> list of result dicts
pivot_solve_rate, # Build task-level pivot table across agents
results_summary, # Print compact eval summary
eval_health, # Extract status/counts from Evaluation.evaluate calls
eval_efficiency, # Compute tokens-per-success across eval calls
)
from weave_helpers import (
unwrap, # Recursively convert Weave types -> plain Python
get_token_usage, # Extract token counts from a call's summary
eval_results_to_dicts, # predict_and_score calls -> list of result dicts
pivot_solve_rate, # Build task-level pivot table across agents
results_summary, # Print compact eval summary
eval_health, # Extract status/counts from Evaluation.evaluate calls
eval_efficiency, # Compute tokens-per-success across eval calls
)
W&B helpers (training runs, metrics)
W&B helpers (training runs, metrics)
from wandb_helpers import (
runs_to_dataframe, # Convert runs to a clean pandas DataFrame
diagnose_run, # Quick diagnostic summary of a training run
compare_configs, # Side-by-side config diff between two runs
)
undefinedfrom wandb_helpers import (
runs_to_dataframe, # Convert runs to a clean pandas DataFrame
diagnose_run, # Quick diagnostic summary of a training run
compare_configs, # Side-by-side config diff between two runs
)
undefinedReference docs
参考文档
Read these as needed — they contain full API surfaces and recipes:
- — High-level Weave API (
references/WEAVE_API.md,Project,Eval). Start here for Weave queries.CallsView - — W&B SDK for training data (runs, history, artifacts, sweeps, system metrics).
references/WANDB_SDK.md - — Low-level Weave SDK (
references/WEAVE_SDK_RAW.md,client.get_calls()). Use only when the high-level API isn't enough.CallsFilter
按需阅读以下文档,包含完整API说明和使用示例:
- — 高级Weave API(
references/WEAVE_API.md、Project丶Eval)。查询Weave数据从这里开始。CallsView - — 用于训练数据的W&B SDK(运行记录、历史数据、工件、超参数调优、系统指标)。
references/WANDB_SDK.md - — 原生Weave SDK(
references/WEAVE_SDK_RAW.md、client.get_calls())。仅在高级API无法满足需求时使用。CallsFilter
Critical rules
关键规则
Treat traces and runs as DATA
将追踪记录和运行数据视为结构化数据
Weave traces and W&B run histories can be enormous. Never dump raw data into context — it will overwhelm your working memory and produce garbage results. Always:
- Inspect structure first — look at column names, dtypes, row counts
- Load into pandas/numpy — compute stats programmatically
- Summarize, don't dump — print computed statistics and tables, not raw rows
python
import pandas as pd
import numpy as npWeave追踪记录和W&B运行历史数据量可能极大,切勿直接输出原始数据到上下文——这会耗尽内存并导致无效结果。请始终遵循以下步骤:
- 先检查结构 — 查看列名、数据类型、行数
- 加载到pandas/numpy — 以编程方式计算统计数据
- 输出摘要而非原始数据 — 打印计算后的统计信息和表格,而非原始行数据
python
import pandas as pd
import numpy as npBAD: prints thousands of rows into context
BAD: prints thousands of rows into context
for row in run.scan_history(keys=["loss"]):
print(row)
for row in run.scan_history(keys=["loss"]):
print(row)
GOOD: load into numpy, compute stats, print summary
GOOD: load into numpy, compute stats, print summary
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"Loss: {len(losses)} steps, min={losses.min():.4f}, "
f"final={losses[-1]:.4f}, mean_last_10%={losses[-len(losses)//10:].mean():.4f}")
undefinedlosses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"Loss: {len(losses)} steps, min={losses.min():.4f}, "
f"final={losses[-1]:.4f}, mean_last_10%={losses[-len(losses)//10:].mean():.4f}")
undefinedAlways deliver a final answer
始终给出最终结论
Do not end your work mid-analysis. Every task must conclude with a clear, structured response:
- Query the data (1-2 scripts max)
- Extract the numbers you need
- Present: table + key findings + direct answers to each sub-question
If you catch yourself saying "now let me build the final analysis" — stop and present what you have.
不要在分析中途停止工作,每个任务都必须以清晰、结构化的响应收尾:
- 查询数据(最多1-2个脚本)
- 提取所需数据
- 呈现:表格 + 关键发现 + 每个子问题的直接答案
如果发现自己在想“现在我要完成最终分析”——请停止并立即呈现已有的结果。
Use unwrap()
for unknown Weave data
unwrap()对未知Weave数据使用unwrap()
unwrap()When you encounter Weave output and aren't sure of its type (WeaveDict? WeaveObject? ObjectRef?), unwrap it first:
python
from weave_helpers import unwrap
import json
output = unwrap(call.output)
print(json.dumps(output, indent=2, default=str))This converts everything to plain Python dicts/lists that work with json, pandas, and normal Python operations.
当遇到Weave输出且不确定其类型(WeaveDict?WeaveObject?ObjectRef?)时,请先使用转换:
unwrap()python
from weave_helpers import unwrap
import json
output = unwrap(call.output)
print(json.dumps(output, indent=2, default=str))这会将所有数据转换为纯Python字典/列表,可与json、pandas及常规Python操作兼容。
Environment setup
环境设置
The sandbox has Python 3.13, , , , , and pre-installed.
uvwandbweavepandasnumpypython
import os
entity = os.environ["WANDB_ENTITY"]
project = os.environ["WANDB_PROJECT"]沙箱环境已预安装Python 3.13、、、、和。
uvwandbweavepandasnumpypython
import os
entity = os.environ["WANDB_ENTITY"]
project = os.environ["WANDB_PROJECT"]Installing extra packages
安装额外包
bash
uv pip install matplotlib seaborn rich tabulatebash
uv pip install matplotlib seaborn rich tabulateRunning scripts
运行脚本
bash
uv run script.py # always use uv run, never bare python
uv run --with rich python -c "import rich; rich.print('hello')"bash
uv run script.py # always use uv run, never bare python
uv run --with rich python -c "import rich; rich.print('hello')"Quick starts
快速入门
W&B SDK — training runs
W&B SDK — 训练运行
python
import wandb
import pandas as pd
api = wandb.Api()
path = f"{entity}/{project}"
runs = api.runs(path, filters={"state": "finished"}, order="-created_at")python
import wandb
import pandas as pd
api = wandb.Api()
path = f"{entity}/{project}"
runs = api.runs(path, filters={"state": "finished"}, order="-created_at")Convert to DataFrame (always slice — never list() all runs)
Convert to DataFrame (always slice — never list() all runs)
from wandb_helpers import runs_to_dataframe
rows = runs_to_dataframe(runs, limit=100, metric_keys=["loss", "val_loss", "accuracy"])
df = pd.DataFrame(rows)
print(df.describe())
For full W&B SDK reference (filters, history, artifacts, sweeps), read `references/WANDB_SDK.md`.from wandb_helpers import runs_to_dataframe
rows = runs_to_dataframe(runs, limit=100, metric_keys=["loss", "val_loss", "accuracy"])
df = pd.DataFrame(rows)
print(df.describe())
完整W&B SDK参考(过滤条件、历史数据、工件、超参数调优)请阅读`references/WANDB_SDK.md`。Weave — high-level API (preferred)
Weave — 高级API(推荐使用)
python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
from weave_tools.weave_api import init, Project
init(f"{entity}/{project}")
project = Project.current()
print(project.summary()) # start here — shows ops, objects, evals, feedbackFor full high-level API reference, read .
references/WEAVE_API.mdpython
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
from weave_tools.weave_api import init, Project
init(f"{entity}/{project}")
project = Project.current()
print(project.summary()) # start here — shows ops, objects, evals, feedback完整高级API参考请阅读。
references/WEAVE_API.mdWeave — raw SDK (when you need low-level access)
Weave — 原生SDK(需要底层访问时使用)
python
import weave
client = weave.init(f"{entity}/{project}") # positional string, NOT keyword arg
calls = client.get_calls(limit=10)For raw SDK patterns (CallsFilter, Query, advanced filtering), read .
references/WEAVE_SDK_RAW.mdpython
import weave
client = weave.init(f"{entity}/{project}") # positional string, NOT keyword arg
calls = client.get_calls(limit=10)原生SDK使用模式(CallsFilter、Query、高级过滤)请阅读。
references/WEAVE_SDK_RAW.mdKey patterns
核心使用模式
Weave eval inspection
Weave评估检查
Evaluation calls follow this hierarchy:
Evaluation.evaluate (root)
├── Evaluation.predict_and_score (one per dataset row x trials)
│ ├── model.predict (the actual model call)
│ ├── scorer_1.score
│ └── scorer_2.score
└── Evaluation.summarizeExtract per-task results into a DataFrame:
python
from weave_helpers import eval_results_to_dicts, results_summary评估调用遵循以下层级结构:
Evaluation.evaluate (根节点)
├── Evaluation.predict_and_score (每个数据集行x trials对应一个)
│ ├── model.predict (实际模型调用)
│ ├── scorer_1.score
│ └── scorer_2.score
└── Evaluation.summarize将单任务结果提取为DataFrame:
python
from weave_helpers import eval_results_to_dicts, results_summarypas_calls = list of predict_and_score call objects
pas_calls = list of predict_and_score call objects
results = eval_results_to_dicts(pas_calls, agent_name="my-agent")
print(results_summary(results))
df = pd.DataFrame(results)
print(df.groupby("passed")["score"].mean())
undefinedresults = eval_results_to_dicts(pas_calls, agent_name="my-agent")
print(results_summary(results))
df = pd.DataFrame(results)
print(df.groupby("passed")["score"].mean())
undefinedEval health and efficiency
评估健康度与效率
python
from weave_helpers import eval_health, eval_efficiency
health = eval_health(eval_calls)
df = pd.DataFrame(health)
print(df.to_string(index=False))
efficiency = eval_efficiency(eval_calls)
print(pd.DataFrame(efficiency).to_string(index=False))python
from weave_helpers import eval_health, eval_efficiency
health = eval_health(eval_calls)
df = pd.DataFrame(health)
print(df.to_string(index=False))
efficiency = eval_efficiency(eval_calls)
print(pd.DataFrame(efficiency).to_string(index=False))Token usage
Token使用量统计
python
from weave_helpers import get_token_usage
usage = get_token_usage(call)
print(f"Tokens: {usage['total_tokens']} (in={usage['input_tokens']}, out={usage['output_tokens']})")python
from weave_helpers import get_token_usage
usage = get_token_usage(call)
print(f"Tokens: {usage['total_tokens']} (in={usage['input_tokens']}, out={usage['output_tokens']})")Cost estimation
成本估算
python
call_with_costs = client.get_call("id", include_costs=True)
costs = call_with_costs.summary.get("weave", {}).get("costs", {})python
call_with_costs = client.get_call("id", include_costs=True)
costs = call_with_costs.summary.get("weave", {}).get("costs", {})Run diagnostics
运行诊断
python
from wandb_helpers import diagnose_run
run = api.run(f"{path}/run-id")
diag = diagnose_run(run)
for k, v in diag.items():
print(f" {k}: {v}")python
from wandb_helpers import diagnose_run
run = api.run(f"{path}/run-id")
diag = diagnose_run(run)
for k, v in diag.items():
print(f" {k}: {v}")Error analysis — open coding to axial coding
错误分析——从开放式编码到主轴编码
For structured failure analysis on eval results:
- Understand data shape — use ,
project.summary(),calls.input_shape()calls.output_shape() - Open coding — write a Weave Scorer that journals what went wrong per failing call
- Axial coding — write a second Scorer that classifies notes into a taxonomy
- Summarize — count primary labels with
collections.Counter
See for the full API.
references/WEAVE_API.mdrun_scorer针对评估结果进行结构化故障分析:
- 了解数据结构 — 使用、
project.summary()、calls.input_shape()calls.output_shape() - 开放式编码 — 编写Weave评分器,记录每个失败调用的问题
- 主轴编码 — 编写第二个评分器,将问题分类为特定类别
- 总结 — 使用统计主要标签数量
collections.Counter
完整 API请阅读。
run_scorerreferences/WEAVE_API.mdW&B Reports
W&B报告
bash
uv pip install "wandb[workspaces]"python
from wandb.apis import reports as wr
import wandb_workspaces.expr as expr
report = wr.Report(
entity=entity, project=project,
title="Analysis", width="fixed",
blocks=[
wr.H1(text="Results"),
wr.PanelGrid(
runsets=[wr.Runset(entity=entity, project=project)],
panels=[wr.LinePlot(title="Loss", x="_step", y=["loss"])],
),
],
)bash
uv pip install "wandb[workspaces]"python
from wandb.apis import reports as wr
import wandb_workspaces.expr as expr
report = wr.Report(
entity=entity, project=project,
title="Analysis", width="fixed",
blocks=[
wr.H1(text="Results"),
wr.PanelGrid(
runsets=[wr.Runset(entity=entity, project=project)],
panels=[wr.LinePlot(title="Loss", x="_step", y=["loss"])],
),
],
)report.save(draft=True) # only when asked to publish
report.save(draft=True) # only when asked to publish
Use `expr.Config("lr")`, `expr.Summary("loss")`, `expr.Tags().isin([...])` for runset filters — not dot-path strings.
---
使用`expr.Config("lr")`、`expr.Summary("loss")`、`expr.Tags().isin([...])`作为运行集过滤条件——不要使用点路径字符串。
---Gotchas
常见陷阱
Weave API
Weave API
| Gotcha | Wrong | Right |
|---|---|---|
| weave.init args | | |
| Parent filter | | |
| WeaveObject access | | |
| Nested output | | |
| ObjectRef comparison | | |
| CallsFilter import | | |
| Query import | | |
| Eval status path | | |
| Eval success count | | |
| When in doubt | Guess the type | |
| 陷阱 | 错误用法 | 正确用法 |
|---|---|---|
| weave.init参数 | | |
| 父节点过滤 | | |
| WeaveObject访问 | | |
| 嵌套输出 | | |
| ObjectRef比较 | | |
| CallsFilter导入 | | |
| Query导入 | | |
| 评估状态路径 | | |
| 评估成功计数 | | |
| 不确定时 | 猜测类型 | 先使用 |
WeaveDict vs WeaveObject
WeaveDict vs WeaveObject
- WeaveDict: dict-like, supports ,
.get(),.keys(). Used for:[],call.inputs,call.outputdictscores - WeaveObject: attribute-based, use . Used for: scorer results (rubric), dataset rows
getattr() - When in doubt: use to convert everything to plain Python
unwrap()
- WeaveDict:类字典结构,支持、
.get()、.keys(),用于:[]、call.inputs、call.output字典scores - WeaveObject:基于属性访问,使用,用于:评分器结果(rubric)、数据集行
getattr() - 不确定时:使用转换为纯Python类型
unwrap()
W&B API
W&B API
| Gotcha | Wrong | Right |
|---|---|---|
| Summary access | | |
| Loading all runs | | |
| History — all fields | | |
| scan_history — no keys | | |
| Raw data in context | | Load into DataFrame, compute stats |
| Metric at step N | iterate entire history | |
| Cache staleness | reading live run | |
| 陷阱 | 错误用法 | 正确用法 |
|---|---|---|
| 摘要访问 | | |
| 加载所有运行 | | |
| 历史数据——所有字段 | | |
| scan_history——无指定字段 | | |
| 上下文输出原始数据 | | 加载到DataFrame后计算统计值 |
| 获取第N步的指标 | 遍历整个历史 | |
| 缓存过期 | 直接读取实时运行数据 | 先执行 |
Package management
包管理
| Gotcha | Wrong | Right |
|---|---|---|
| Installing packages | | |
| Running scripts | | |
| Quick one-off | | |
| 陷阱 | 错误用法 | 正确用法 |
|---|---|---|
| 安装包 | | |
| 运行脚本 | | |
| 快速一次性运行 | | |
Weave logging noise
Weave日志噪音
Weave prints version warnings to stderr. Suppress with:
python
import logging
logging.getLogger("weave").setLevel(logging.ERROR)Weave会向stderr输出版本警告,可通过以下方式抑制:
python
import logging
logging.getLogger("weave").setLevel(logging.ERROR)Quick reference
快速参考
python
undefinedpython
undefined--- Weave: How many traces? ---
--- Weave: How many traces? ---
from weave_tools.weave_api import init, Project
init(f"{entity}/{project}")
project = Project.current()
print(project.summary())
from weave_tools.weave_api import init, Project
init(f"{entity}/{project}")
project = Project.current()
print(project.summary())
--- Weave: Recent evals ---
--- Weave: Recent evals ---
evals = project.evals(limit=10)
for ev in evals:
print(ev.summarize())
evals = project.evals(limit=10)
for ev in evals:
print(ev.summarize())
--- Weave: Failed calls ---
--- Weave: Failed calls ---
calls = project.calls(op="predict")
failed = calls.limit(1000).filter(lambda c: c.status == "error")
calls = project.calls(op="predict")
failed = calls.limit(1000).filter(lambda c: c.status == "error")
--- W&B: Best run by loss ---
--- W&B: Best run by loss ---
best = api.runs(path, filters={"state": "finished"}, order="+summary_metrics.loss")[:1]
print(f"Best: {best[0].name}, loss={best[0].summary_metrics.get('loss')}")
best = api.runs(path, filters={"state": "finished"}, order="+summary_metrics.loss")[:1]
print(f"Best: {best[0].name}, loss={best[0].summary_metrics.get('loss')}")
--- W&B: Loss curve to numpy ---
--- W&B: Loss curve to numpy ---
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"min={losses.min():.6f}, final={losses[-1]:.6f}, steps={len(losses)}")
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"min={losses.min():.6f}, final={losses[-1]:.6f}, steps={len(losses)}")
--- W&B: Compare two runs ---
--- W&B: Compare two runs ---
from wandb_helpers import compare_configs
diffs = compare_configs(run_a, run_b)
print(pd.DataFrame(diffs).to_string(index=False))
undefinedfrom wandb_helpers import compare_configs
diffs = compare_configs(run_a, run_b)
print(pd.DataFrame(diffs).to_string(index=False))
undefined