wandb-primary

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

W&B Primary Skill

W&B核心技能指南

This skill covers everything an agent needs to work with Weights & Biases:
  • W&B SDK (
    wandb
    ) — training runs, metrics, artifacts, sweeps, system metrics
  • Weave SDK (
    weave
    ) — GenAI traces, evaluations, scorers, token usage
  • Helper libraries
    wandb_helpers.py
    and
    weave_helpers.py
    for common operations
  • High-level Weave API (
    weave_tools.weave_api
    ) — agent-friendly wrappers for Weave queries
本技能涵盖了Agent使用Weights & Biases所需的全部内容:
  • W&B SDK (
    wandb
    ) — 训练运行、指标、工件、超参数调优、系统指标
  • Weave SDK (
    weave
    ) — GenAI追踪、评估、评分器、Token使用量
  • 辅助库
    wandb_helpers.py
    weave_helpers.py
    用于执行常见操作
  • 高级Weave API (
    weave_tools.weave_api
    ) — 为Agent优化的Weave查询封装器

When to use what

场景选择指南

I need to...Use
Query training runs, loss curves, hyperparametersW&B SDK (
wandb.Api()
) — see
references/WANDB_SDK.md
Query GenAI traces, calls, evaluationsHigh-level Weave API (
weave_tools.weave_api
) — see
references/WEAVE_API.md
Convert Weave wrapper types to plain Python
weave_helpers.unwrap()
Build a DataFrame from training runs
wandb_helpers.runs_to_dataframe()
Extract eval results for analysis
weave_helpers.eval_results_to_dicts()
Do something the high-level API doesn't coverRaw Weave SDK (
weave.init()
,
client.get_calls()
) — see
references/WEAVE_SDK_RAW.md

我需要...使用工具
查询训练运行、损失曲线、超参数W&B SDK (
wandb.Api()
) — 详见
references/WANDB_SDK.md
查询GenAI追踪、调用记录、评估结果高级Weave API (
weave_tools.weave_api
) — 详见
references/WEAVE_API.md
将Weave封装类型转换为纯Python类型
weave_helpers.unwrap()
从训练运行数据构建DataFrame
wandb_helpers.runs_to_dataframe()
提取评估结果用于分析
weave_helpers.eval_results_to_dicts()
执行高级API未覆盖的操作原生Weave SDK (
weave.init()
,
client.get_calls()
) — 详见
references/WEAVE_SDK_RAW.md

Bundled files

配套文件

Helper libraries

辅助库

python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")

Weave helpers (traces, evals, GenAI)

Weave helpers (traces, evals, GenAI)

from weave_helpers import ( unwrap, # Recursively convert Weave types -> plain Python get_token_usage, # Extract token counts from a call's summary eval_results_to_dicts, # predict_and_score calls -> list of result dicts pivot_solve_rate, # Build task-level pivot table across agents results_summary, # Print compact eval summary eval_health, # Extract status/counts from Evaluation.evaluate calls eval_efficiency, # Compute tokens-per-success across eval calls )
from weave_helpers import ( unwrap, # Recursively convert Weave types -> plain Python get_token_usage, # Extract token counts from a call's summary eval_results_to_dicts, # predict_and_score calls -> list of result dicts pivot_solve_rate, # Build task-level pivot table across agents results_summary, # Print compact eval summary eval_health, # Extract status/counts from Evaluation.evaluate calls eval_efficiency, # Compute tokens-per-success across eval calls )

W&B helpers (training runs, metrics)

W&B helpers (training runs, metrics)

from wandb_helpers import ( runs_to_dataframe, # Convert runs to a clean pandas DataFrame diagnose_run, # Quick diagnostic summary of a training run compare_configs, # Side-by-side config diff between two runs )
undefined
from wandb_helpers import ( runs_to_dataframe, # Convert runs to a clean pandas DataFrame diagnose_run, # Quick diagnostic summary of a training run compare_configs, # Side-by-side config diff between two runs )
undefined

Reference docs

参考文档

Read these as needed — they contain full API surfaces and recipes:
  • references/WEAVE_API.md
    — High-level Weave API (
    Project
    ,
    Eval
    ,
    CallsView
    ). Start here for Weave queries.
  • references/WANDB_SDK.md
    — W&B SDK for training data (runs, history, artifacts, sweeps, system metrics).
  • references/WEAVE_SDK_RAW.md
    — Low-level Weave SDK (
    client.get_calls()
    ,
    CallsFilter
    ). Use only when the high-level API isn't enough.

按需阅读以下文档,包含完整API说明和使用示例:
  • references/WEAVE_API.md
    — 高级Weave API(
    Project
    Eval
    CallsView
    )。查询Weave数据从这里开始。
  • references/WANDB_SDK.md
    — 用于训练数据的W&B SDK(运行记录、历史数据、工件、超参数调优、系统指标)。
  • references/WEAVE_SDK_RAW.md
    — 原生Weave SDK(
    client.get_calls()
    CallsFilter
    )。仅在高级API无法满足需求时使用。

Critical rules

关键规则

Treat traces and runs as DATA

将追踪记录和运行数据视为结构化数据

Weave traces and W&B run histories can be enormous. Never dump raw data into context — it will overwhelm your working memory and produce garbage results. Always:
  1. Inspect structure first — look at column names, dtypes, row counts
  2. Load into pandas/numpy — compute stats programmatically
  3. Summarize, don't dump — print computed statistics and tables, not raw rows
python
import pandas as pd
import numpy as np
Weave追踪记录和W&B运行历史数据量可能极大,切勿直接输出原始数据到上下文——这会耗尽内存并导致无效结果。请始终遵循以下步骤:
  1. 先检查结构 — 查看列名、数据类型、行数
  2. 加载到pandas/numpy — 以编程方式计算统计数据
  3. 输出摘要而非原始数据 — 打印计算后的统计信息和表格,而非原始行数据
python
import pandas as pd
import numpy as np

BAD: prints thousands of rows into context

BAD: prints thousands of rows into context

for row in run.scan_history(keys=["loss"]): print(row)
for row in run.scan_history(keys=["loss"]): print(row)

GOOD: load into numpy, compute stats, print summary

GOOD: load into numpy, compute stats, print summary

losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])]) print(f"Loss: {len(losses)} steps, min={losses.min():.4f}, " f"final={losses[-1]:.4f}, mean_last_10%={losses[-len(losses)//10:].mean():.4f}")
undefined
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])]) print(f"Loss: {len(losses)} steps, min={losses.min():.4f}, " f"final={losses[-1]:.4f}, mean_last_10%={losses[-len(losses)//10:].mean():.4f}")
undefined

Always deliver a final answer

始终给出最终结论

Do not end your work mid-analysis. Every task must conclude with a clear, structured response:
  1. Query the data (1-2 scripts max)
  2. Extract the numbers you need
  3. Present: table + key findings + direct answers to each sub-question
If you catch yourself saying "now let me build the final analysis" — stop and present what you have.
不要在分析中途停止工作,每个任务都必须以清晰、结构化的响应收尾:
  1. 查询数据(最多1-2个脚本)
  2. 提取所需数据
  3. 呈现:表格 + 关键发现 + 每个子问题的直接答案
如果发现自己在想“现在我要完成最终分析”——请停止并立即呈现已有的结果。

Use
unwrap()
for unknown Weave data

对未知Weave数据使用
unwrap()

When you encounter Weave output and aren't sure of its type (WeaveDict? WeaveObject? ObjectRef?), unwrap it first:
python
from weave_helpers import unwrap
import json

output = unwrap(call.output)
print(json.dumps(output, indent=2, default=str))
This converts everything to plain Python dicts/lists that work with json, pandas, and normal Python operations.

当遇到Weave输出且不确定其类型(WeaveDict?WeaveObject?ObjectRef?)时,请先使用
unwrap()
转换:
python
from weave_helpers import unwrap
import json

output = unwrap(call.output)
print(json.dumps(output, indent=2, default=str))
这会将所有数据转换为纯Python字典/列表,可与json、pandas及常规Python操作兼容。

Environment setup

环境设置

The sandbox has Python 3.13,
uv
,
wandb
,
weave
,
pandas
, and
numpy
pre-installed.
python
import os
entity  = os.environ["WANDB_ENTITY"]
project = os.environ["WANDB_PROJECT"]
沙箱环境已预安装Python 3.13、
uv
wandb
weave
pandas
numpy
python
import os
entity  = os.environ["WANDB_ENTITY"]
project = os.environ["WANDB_PROJECT"]

Installing extra packages

安装额外包

bash
uv pip install matplotlib seaborn rich tabulate
bash
uv pip install matplotlib seaborn rich tabulate

Running scripts

运行脚本

bash
uv run script.py          # always use uv run, never bare python
uv run --with rich python -c "import rich; rich.print('hello')"

bash
uv run script.py          # always use uv run, never bare python
uv run --with rich python -c "import rich; rich.print('hello')"

Quick starts

快速入门

W&B SDK — training runs

W&B SDK — 训练运行

python
import wandb
import pandas as pd
api = wandb.Api()

path = f"{entity}/{project}"
runs = api.runs(path, filters={"state": "finished"}, order="-created_at")
python
import wandb
import pandas as pd
api = wandb.Api()

path = f"{entity}/{project}"
runs = api.runs(path, filters={"state": "finished"}, order="-created_at")

Convert to DataFrame (always slice — never list() all runs)

Convert to DataFrame (always slice — never list() all runs)

from wandb_helpers import runs_to_dataframe rows = runs_to_dataframe(runs, limit=100, metric_keys=["loss", "val_loss", "accuracy"]) df = pd.DataFrame(rows) print(df.describe())

For full W&B SDK reference (filters, history, artifacts, sweeps), read `references/WANDB_SDK.md`.
from wandb_helpers import runs_to_dataframe rows = runs_to_dataframe(runs, limit=100, metric_keys=["loss", "val_loss", "accuracy"]) df = pd.DataFrame(rows) print(df.describe())

完整W&B SDK参考(过滤条件、历史数据、工件、超参数调优)请阅读`references/WANDB_SDK.md`。

Weave — high-level API (preferred)

Weave — 高级API(推荐使用)

python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
from weave_tools.weave_api import init, Project

init(f"{entity}/{project}")
project = Project.current()
print(project.summary())  # start here — shows ops, objects, evals, feedback
For full high-level API reference, read
references/WEAVE_API.md
.
python
import sys
sys.path.insert(0, "skills/wandb-primary/scripts")
from weave_tools.weave_api import init, Project

init(f"{entity}/{project}")
project = Project.current()
print(project.summary())  # start here — shows ops, objects, evals, feedback
完整高级API参考请阅读
references/WEAVE_API.md

Weave — raw SDK (when you need low-level access)

Weave — 原生SDK(需要底层访问时使用)

python
import weave
client = weave.init(f"{entity}/{project}")  # positional string, NOT keyword arg
calls = client.get_calls(limit=10)
For raw SDK patterns (CallsFilter, Query, advanced filtering), read
references/WEAVE_SDK_RAW.md
.

python
import weave
client = weave.init(f"{entity}/{project}")  # positional string, NOT keyword arg
calls = client.get_calls(limit=10)
原生SDK使用模式(CallsFilter、Query、高级过滤)请阅读
references/WEAVE_SDK_RAW.md

Key patterns

核心使用模式

Weave eval inspection

Weave评估检查

Evaluation calls follow this hierarchy:
Evaluation.evaluate (root)
  ├── Evaluation.predict_and_score (one per dataset row x trials)
  │     ├── model.predict (the actual model call)
  │     ├── scorer_1.score
  │     └── scorer_2.score
  └── Evaluation.summarize
Extract per-task results into a DataFrame:
python
from weave_helpers import eval_results_to_dicts, results_summary
评估调用遵循以下层级结构:
Evaluation.evaluate (根节点)
  ├── Evaluation.predict_and_score (每个数据集行x trials对应一个)
  │     ├── model.predict (实际模型调用)
  │     ├── scorer_1.score
  │     └── scorer_2.score
  └── Evaluation.summarize
将单任务结果提取为DataFrame:
python
from weave_helpers import eval_results_to_dicts, results_summary

pas_calls = list of predict_and_score call objects

pas_calls = list of predict_and_score call objects

results = eval_results_to_dicts(pas_calls, agent_name="my-agent") print(results_summary(results))
df = pd.DataFrame(results) print(df.groupby("passed")["score"].mean())
undefined
results = eval_results_to_dicts(pas_calls, agent_name="my-agent") print(results_summary(results))
df = pd.DataFrame(results) print(df.groupby("passed")["score"].mean())
undefined

Eval health and efficiency

评估健康度与效率

python
from weave_helpers import eval_health, eval_efficiency

health = eval_health(eval_calls)
df = pd.DataFrame(health)
print(df.to_string(index=False))

efficiency = eval_efficiency(eval_calls)
print(pd.DataFrame(efficiency).to_string(index=False))
python
from weave_helpers import eval_health, eval_efficiency

health = eval_health(eval_calls)
df = pd.DataFrame(health)
print(df.to_string(index=False))

efficiency = eval_efficiency(eval_calls)
print(pd.DataFrame(efficiency).to_string(index=False))

Token usage

Token使用量统计

python
from weave_helpers import get_token_usage

usage = get_token_usage(call)
print(f"Tokens: {usage['total_tokens']} (in={usage['input_tokens']}, out={usage['output_tokens']})")
python
from weave_helpers import get_token_usage

usage = get_token_usage(call)
print(f"Tokens: {usage['total_tokens']} (in={usage['input_tokens']}, out={usage['output_tokens']})")

Cost estimation

成本估算

python
call_with_costs = client.get_call("id", include_costs=True)
costs = call_with_costs.summary.get("weave", {}).get("costs", {})
python
call_with_costs = client.get_call("id", include_costs=True)
costs = call_with_costs.summary.get("weave", {}).get("costs", {})

Run diagnostics

运行诊断

python
from wandb_helpers import diagnose_run

run = api.run(f"{path}/run-id")
diag = diagnose_run(run)
for k, v in diag.items():
    print(f"  {k}: {v}")
python
from wandb_helpers import diagnose_run

run = api.run(f"{path}/run-id")
diag = diagnose_run(run)
for k, v in diag.items():
    print(f"  {k}: {v}")

Error analysis — open coding to axial coding

错误分析——从开放式编码到主轴编码

For structured failure analysis on eval results:
  1. Understand data shape — use
    project.summary()
    ,
    calls.input_shape()
    ,
    calls.output_shape()
  2. Open coding — write a Weave Scorer that journals what went wrong per failing call
  3. Axial coding — write a second Scorer that classifies notes into a taxonomy
  4. Summarize — count primary labels with
    collections.Counter
See
references/WEAVE_API.md
for the full
run_scorer
API.
针对评估结果进行结构化故障分析:
  1. 了解数据结构 — 使用
    project.summary()
    calls.input_shape()
    calls.output_shape()
  2. 开放式编码 — 编写Weave评分器,记录每个失败调用的问题
  3. 主轴编码 — 编写第二个评分器,将问题分类为特定类别
  4. 总结 — 使用
    collections.Counter
    统计主要标签数量
完整
run_scorer
API请阅读
references/WEAVE_API.md

W&B Reports

W&B报告

bash
uv pip install "wandb[workspaces]"
python
from wandb.apis import reports as wr
import wandb_workspaces.expr as expr

report = wr.Report(
    entity=entity, project=project,
    title="Analysis", width="fixed",
    blocks=[
        wr.H1(text="Results"),
        wr.PanelGrid(
            runsets=[wr.Runset(entity=entity, project=project)],
            panels=[wr.LinePlot(title="Loss", x="_step", y=["loss"])],
        ),
    ],
)
bash
uv pip install "wandb[workspaces]"
python
from wandb.apis import reports as wr
import wandb_workspaces.expr as expr

report = wr.Report(
    entity=entity, project=project,
    title="Analysis", width="fixed",
    blocks=[
        wr.H1(text="Results"),
        wr.PanelGrid(
            runsets=[wr.Runset(entity=entity, project=project)],
            panels=[wr.LinePlot(title="Loss", x="_step", y=["loss"])],
        ),
    ],
)

report.save(draft=True) # only when asked to publish

report.save(draft=True) # only when asked to publish


Use `expr.Config("lr")`, `expr.Summary("loss")`, `expr.Tags().isin([...])` for runset filters — not dot-path strings.

---

使用`expr.Config("lr")`、`expr.Summary("loss")`、`expr.Tags().isin([...])`作为运行集过滤条件——不要使用点路径字符串。

---

Gotchas

常见陷阱

Weave API

Weave API

GotchaWrongRight
weave.init args
weave.init(project="x")
weave.init("x")
(positional)
Parent filter
filter={'parent_id': 'x'}
filter={'parent_ids': ['x']}
(plural, list)
WeaveObject access
rubric.get('passed')
getattr(rubric, 'passed', None)
Nested output
out.get('succeeded')
out.get('output').get('succeeded')
(output.output)
ObjectRef comparison
name_ref == "foo"
str(name_ref) == "foo"
CallsFilter import
from weave import CallsFilter
from weave.trace.weave_client import CallsFilter
Query import
from weave import Query
from weave.trace_server.interface.query import Query
Eval status path
summary["status"]
summary["weave"]["status"]
Eval success count
summary["success_count"]
summary["weave"]["status_counts"]["success"]
When in doubtGuess the type
unwrap()
first, then inspect
陷阱错误用法正确用法
weave.init参数
weave.init(project="x")
weave.init("x")
(位置参数)
父节点过滤
filter={'parent_id': 'x'}
filter={'parent_ids': ['x']}
(复数形式,列表)
WeaveObject访问
rubric.get('passed')
getattr(rubric, 'passed', None)
嵌套输出
out.get('succeeded')
out.get('output').get('succeeded')
(output.output)
ObjectRef比较
name_ref == "foo"
str(name_ref) == "foo"
CallsFilter导入
from weave import CallsFilter
from weave.trace.weave_client import CallsFilter
Query导入
from weave import Query
from weave.trace_server.interface.query import Query
评估状态路径
summary["status"]
summary["weave"]["status"]
评估成功计数
summary["success_count"]
summary["weave"]["status_counts"]["success"]
不确定时猜测类型先使用
unwrap()
再检查

WeaveDict vs WeaveObject

WeaveDict vs WeaveObject

  • WeaveDict: dict-like, supports
    .get()
    ,
    .keys()
    ,
    []
    . Used for:
    call.inputs
    ,
    call.output
    ,
    scores
    dict
  • WeaveObject: attribute-based, use
    getattr()
    . Used for: scorer results (rubric), dataset rows
  • When in doubt: use
    unwrap()
    to convert everything to plain Python
  • WeaveDict:类字典结构,支持
    .get()
    .keys()
    []
    ,用于:
    call.inputs
    call.output
    scores
    字典
  • WeaveObject:基于属性访问,使用
    getattr()
    ,用于:评分器结果(rubric)、数据集行
  • 不确定时:使用
    unwrap()
    转换为纯Python类型

W&B API

W&B API

GotchaWrongRight
Summary access
run.summary["loss"]
run.summary_metrics.get("loss")
Loading all runs
list(api.runs(...))
runs[:200]
(always slice)
History — all fields
run.history()
run.history(samples=500, keys=["loss"])
scan_history — no keys
scan_history()
scan_history(keys=["loss"])
(explicit)
Raw data in context
print(run.history())
Load into DataFrame, compute stats
Metric at step Niterate entire history
scan_history(keys=["loss"], min_step=N, max_step=N+1)
Cache stalenessreading live run
api.flush()
first
陷阱错误用法正确用法
摘要访问
run.summary["loss"]
run.summary_metrics.get("loss")
加载所有运行
list(api.runs(...))
runs[:200]
(始终切片)
历史数据——所有字段
run.history()
run.history(samples=500, keys=["loss"])
scan_history——无指定字段
scan_history()
scan_history(keys=["loss"])
(显式指定)
上下文输出原始数据
print(run.history())
加载到DataFrame后计算统计值
获取第N步的指标遍历整个历史
scan_history(keys=["loss"], min_step=N, max_step=N+1)
缓存过期直接读取实时运行数据先执行
api.flush()

Package management

包管理

GotchaWrongRight
Installing packages
pip install pandas
uv pip install pandas
Running scripts
python script.py
uv run script.py
Quick one-off
pip install rich && python -c ...
uv run --with rich python -c ...
陷阱错误用法正确用法
安装包
pip install pandas
uv pip install pandas
运行脚本
python script.py
uv run script.py
快速一次性运行
pip install rich && python -c ...
uv run --with rich python -c ...

Weave logging noise

Weave日志噪音

Weave prints version warnings to stderr. Suppress with:
python
import logging
logging.getLogger("weave").setLevel(logging.ERROR)

Weave会向stderr输出版本警告,可通过以下方式抑制:
python
import logging
logging.getLogger("weave").setLevel(logging.ERROR)

Quick reference

快速参考

python
undefined
python
undefined

--- Weave: How many traces? ---

--- Weave: How many traces? ---

from weave_tools.weave_api import init, Project init(f"{entity}/{project}") project = Project.current() print(project.summary())
from weave_tools.weave_api import init, Project init(f"{entity}/{project}") project = Project.current() print(project.summary())

--- Weave: Recent evals ---

--- Weave: Recent evals ---

evals = project.evals(limit=10) for ev in evals: print(ev.summarize())
evals = project.evals(limit=10) for ev in evals: print(ev.summarize())

--- Weave: Failed calls ---

--- Weave: Failed calls ---

calls = project.calls(op="predict") failed = calls.limit(1000).filter(lambda c: c.status == "error")
calls = project.calls(op="predict") failed = calls.limit(1000).filter(lambda c: c.status == "error")

--- W&B: Best run by loss ---

--- W&B: Best run by loss ---

best = api.runs(path, filters={"state": "finished"}, order="+summary_metrics.loss")[:1] print(f"Best: {best[0].name}, loss={best[0].summary_metrics.get('loss')}")
best = api.runs(path, filters={"state": "finished"}, order="+summary_metrics.loss")[:1] print(f"Best: {best[0].name}, loss={best[0].summary_metrics.get('loss')}")

--- W&B: Loss curve to numpy ---

--- W&B: Loss curve to numpy ---

losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])]) print(f"min={losses.min():.6f}, final={losses[-1]:.6f}, steps={len(losses)}")
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])]) print(f"min={losses.min():.6f}, final={losses[-1]:.6f}, steps={len(losses)}")

--- W&B: Compare two runs ---

--- W&B: Compare two runs ---

from wandb_helpers import compare_configs diffs = compare_configs(run_a, run_b) print(pd.DataFrame(diffs).to_string(index=False))
undefined
from wandb_helpers import compare_configs diffs = compare_configs(run_a, run_b) print(pd.DataFrame(diffs).to_string(index=False))
undefined