Weights & Biases: ML Experiment Tracking & MLOps

Weights & Biases：ML实验跟踪与MLOps

When to Use This Skill

何时使用该工具

Use Weights & Biases (W&B) when you need to:

Track ML experiments with automatic metric logging
Visualize training in real-time dashboards
Compare runs across hyperparameters and configurations
Optimize hyperparameters with automated sweeps
Manage model registry with versioning and lineage
Collaborate on ML projects with team workspaces
Track artifacts (datasets, models, code) with lineage

Users: 200,000+ ML practitioners | GitHub Stars: 10.5k+ | Integrations: 100+

在以下场景中使用Weights & Biases (W&B)：

借助自动指标记录跟踪ML实验
在实时仪表盘中可视化训练过程
对比不同超参数和配置下的实验运行
借助自动调优工具优化超参数
借助版本控制和溯源功能管理模型注册表
通过团队工作区协作开展ML项目
借助溯源功能跟踪工件（数据集、模型、代码）

用户: 20万+ ML从业者 | GitHub星标: 10.5k+ | 集成工具: 100+

Installation

安装

bash

undefined

bash

undefined

Install W&B

pip install wandb

Login (creates API key)

wandb login

Or set API key programmatically

export WANDB_API_KEY=your_api_key_here

undefined

export WANDB_API_KEY=your_api_key_here

undefined

Quick Start

快速开始

Basic Experiment Tracking

基础实验跟踪

python

import wandb

python

import wandb

Initialize a run

run = wandb.init( project="my-project", config={ "learning_rate": 0.001, "epochs": 10, "batch_size": 32, "architecture": "ResNet50" } )

Training loop

for epoch in range(run.config.epochs): # Your training code train_loss = train_epoch() val_loss = validate()

# Log metrics
wandb.log({
    "epoch": epoch,
    "train/loss": train_loss,
    "val/loss": val_loss,
    "train/accuracy": train_acc,
    "val/accuracy": val_acc
})

for epoch in range(run.config.epochs): # Your training code train_loss = train_epoch() val_loss = validate()

# Log metrics
wandb.log({
    "epoch": epoch,
    "train/loss": train_loss,
    "val/loss": val_loss,
    "train/accuracy": train_acc,
    "val/accuracy": val_acc
})

Finish the run

wandb.finish()

undefined

wandb.finish()

undefined

With PyTorch

与PyTorch结合使用

python

import torch
import wandb

python

import torch
import wandb

Initialize

wandb.init(project="pytorch-demo", config={ "lr": 0.001, "epochs": 10 })

Access config

config = wandb.config

Training loop

for epoch in range(config.epochs): for batch_idx, (data, target) in enumerate(train_loader): # Forward pass output = model(data) loss = criterion(output, target)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log every 100 batches
    if batch_idx % 100 == 0:
        wandb.log({
            "loss": loss.item(),
            "epoch": epoch,
            "batch": batch_idx
        })

for epoch in range(config.epochs): for batch_idx, (data, target) in enumerate(train_loader): # Forward pass output = model(data) loss = criterion(output, target)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log every 100 batches
    if batch_idx % 100 == 0:
        wandb.log({
            "loss": loss.item(),
            "epoch": epoch,
            "batch": batch_idx
        })

Save model

torch.save(model.state_dict(), "model.pth") wandb.save("model.pth") # Upload to W&B

wandb.finish()

undefined

torch.save(model.state_dict(), "model.pth") wandb.save("model.pth") # Upload to W&B

wandb.finish()

undefined

Core Concepts

核心概念

1. Projects and Runs

1. 项目与实验运行

Project: Collection of related experiments Run: Single execution of your training script

python

undefined

Project: 相关实验的集合 Run: 训练脚本的单次执行

python

undefined

Create/use project

run = wandb.init( project="image-classification", name="resnet50-experiment-1", # Optional run name tags=["baseline", "resnet"], # Organize with tags notes="First baseline run" # Add notes )

Each run has unique ID

print(f"Run ID: {run.id}") print(f"Run URL: {run.url}")

undefined

print(f"Run ID: {run.id}") print(f"Run URL: {run.url}")

undefined

2. Configuration Tracking

2. 配置跟踪

Track hyperparameters automatically:

python

config = {
    # Model architecture
    "model": "ResNet50",
    "pretrained": True,

    # Training params
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 50,
    "optimizer": "Adam",

    # Data params
    "dataset": "ImageNet",
    "augmentation": "standard"
}

wandb.init(project="my-project", config=config)

自动跟踪超参数：

python

config = {
    # Model architecture
    "model": "ResNet50",
    "pretrained": True,

    # Training params
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 50,
    "optimizer": "Adam",

    # Data params
    "dataset": "ImageNet",
    "augmentation": "standard"
}

wandb.init(project="my-project", config=config)

Access config during training

lr = wandb.config.learning_rate batch_size = wandb.config.batch_size

undefined

lr = wandb.config.learning_rate batch_size = wandb.config.batch_size

undefined

3. Metric Logging

3. 指标记录

python

undefined

python

undefined

Log scalars

wandb.log({"loss": 0.5, "accuracy": 0.92})

Log multiple metrics

wandb.log({ "train/loss": train_loss, "train/accuracy": train_acc, "val/loss": val_loss, "val/accuracy": val_acc, "learning_rate": current_lr, "epoch": epoch })

Log with custom x-axis

wandb.log({"loss": loss}, step=global_step)

Log media (images, audio, video)

wandb.log({"examples": [wandb.Image(img) for img in images]})

Log histograms

wandb.log({"gradients": wandb.Histogram(gradients)})

Log tables

table = wandb.Table(columns=["id", "prediction", "ground_truth"]) wandb.log({"predictions": table})

undefined

table = wandb.Table(columns=["id", "prediction", "ground_truth"]) wandb.log({"predictions": table})

undefined

4. Model Checkpointing

4. 模型检查点

python

import torch
import wandb

python

import torch
import wandb

Save model checkpoint

checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }

torch.save(checkpoint, 'checkpoint.pth')

checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }

torch.save(checkpoint, 'checkpoint.pth')

Upload to W&B

wandb.save('checkpoint.pth')

Or use Artifacts (recommended)

artifact = wandb.Artifact('model', type='model') artifact.add_file('checkpoint.pth') wandb.log_artifact(artifact)

undefined

artifact = wandb.Artifact('model', type='model') artifact.add_file('checkpoint.pth') wandb.log_artifact(artifact)

undefined

Hyperparameter Sweeps

超参数调优

Automatically search for optimal hyperparameters.

自动搜索最优超参数。

Define Sweep Configuration

定义调优配置

python

sweep_config = {
    'method': 'bayes',  # or 'grid', 'random'
    'metric': {
        'name': 'val/accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        'batch_size': {
            'values': [16, 32, 64, 128]
        },
        'optimizer': {
            'values': ['adam', 'sgd', 'rmsprop']
        },
        'dropout': {
            'distribution': 'uniform',
            'min': 0.1,
            'max': 0.5
        }
    }
}

python

sweep_config = {
    'method': 'bayes',  # or 'grid', 'random'
    'metric': {
        'name': 'val/accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        'batch_size': {
            'values': [16, 32, 64, 128]
        },
        'optimizer': {
            'values': ['adam', 'sgd', 'rmsprop']
        },
        'dropout': {
            'distribution': 'uniform',
            'min': 0.1,
            'max': 0.5
        }
    }
}

Initialize sweep

sweep_id = wandb.sweep(sweep_config, project="my-project")

undefined

sweep_id = wandb.sweep(sweep_config, project="my-project")

undefined

Define Training Function

定义训练函数

python

def train():
    # Initialize run
    run = wandb.init()

    # Access sweep parameters
    lr = wandb.config.learning_rate
    batch_size = wandb.config.batch_size
    optimizer_name = wandb.config.optimizer

    # Build model with sweep config
    model = build_model(wandb.config)
    optimizer = get_optimizer(optimizer_name, lr)

    # Training loop
    for epoch in range(NUM_EPOCHS):
        train_loss = train_epoch(model, optimizer, batch_size)
        val_acc = validate(model)

        # Log metrics
        wandb.log({
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

python

def train():
    # Initialize run
    run = wandb.init()

    # Access sweep parameters
    lr = wandb.config.learning_rate
    batch_size = wandb.config.batch_size
    optimizer_name = wandb.config.optimizer

    # Build model with sweep config
    model = build_model(wandb.config)
    optimizer = get_optimizer(optimizer_name, lr)

    # Training loop
    for epoch in range(NUM_EPOCHS):
        train_loss = train_epoch(model, optimizer, batch_size)
        val_acc = validate(model)

        # Log metrics
        wandb.log({
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

Run sweep

wandb.agent(sweep_id, function=train, count=50) # Run 50 trials

undefined

wandb.agent(sweep_id, function=train, count=50) # Run 50 trials

undefined

Sweep Strategies

调优策略

python

undefined

python

undefined

Grid search - exhaustive

sweep_config = { 'method': 'grid', 'parameters': { 'lr': {'values': [0.001, 0.01, 0.1]}, 'batch_size': {'values': [16, 32, 64]} } }

Random search

sweep_config = { 'method': 'random', 'parameters': { 'lr': {'distribution': 'uniform', 'min': 0.0001, 'max': 0.1}, 'dropout': {'distribution': 'uniform', 'min': 0.1, 'max': 0.5} } }

Bayesian optimization (recommended)

sweep_config = { 'method': 'bayes', 'metric': {'name': 'val/loss', 'goal': 'minimize'}, 'parameters': { 'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1} } }

undefined

sweep_config = { 'method': 'bayes', 'metric': {'name': 'val/loss', 'goal': 'minimize'}, 'parameters': { 'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1} } }

undefined

Artifacts

工件

Track datasets, models, and other files with lineage.

借助溯源功能跟踪数据集、模型和其他文件。

Log Artifacts

记录工件

python

undefined

python

undefined

Create artifact

artifact = wandb.Artifact( name='training-dataset', type='dataset', description='ImageNet training split', metadata={'size': '1.2M images', 'split': 'train'} )

Add files

artifact.add_file('data/train.csv') artifact.add_dir('data/images/')

Log artifact

wandb.log_artifact(artifact)

undefined

wandb.log_artifact(artifact)

undefined

Use Artifacts

使用工件

python

undefined

python

undefined

Download and use artifact

run = wandb.init(project="my-project")

Download artifact

artifact = run.use_artifact('training-dataset:latest') artifact_dir = artifact.download()

Use the data

data = load_data(f"{artifact_dir}/train.csv")

undefined

data = load_data(f"{artifact_dir}/train.csv")

undefined

Model Registry

模型注册表

python

undefined

python

undefined

Log model as artifact

model_artifact = wandb.Artifact( name='resnet50-model', type='model', metadata={'architecture': 'ResNet50', 'accuracy': 0.95} )

model_artifact.add_file('model.pth') wandb.log_artifact(model_artifact, aliases=['best', 'production'])

model_artifact = wandb.Artifact( name='resnet50-model', type='model', metadata={'architecture': 'ResNet50', 'accuracy': 0.95} )

model_artifact.add_file('model.pth') wandb.log_artifact(model_artifact, aliases=['best', 'production'])

Link to model registry

run.link_artifact(model_artifact, 'model-registry/production-models')

undefined

run.link_artifact(model_artifact, 'model-registry/production-models')

undefined

Integration Examples

集成示例

HuggingFace Transformers

python

from transformers import Trainer, TrainingArguments
import wandb

python

from transformers import Trainer, TrainingArguments
import wandb

Initialize W&B

wandb.init(project="hf-transformers")

Training arguments with W&B

training_args = TrainingArguments( output_dir="./results", report_to="wandb", # Enable W&B logging run_name="bert-finetuning", logging_steps=100, save_steps=500 )

Trainer automatically logs to W&B

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset )

trainer.train()

undefined

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset )

trainer.train()

undefined

PyTorch Lightning

python

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

python

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

Create W&B logger

wandb_logger = WandbLogger( project="lightning-demo", log_model=True # Log model checkpoints )

Use with Trainer

trainer = Trainer( logger=wandb_logger, max_epochs=10 )

trainer.fit(model, datamodule=dm)

undefined

trainer = Trainer( logger=wandb_logger, max_epochs=10 )

trainer.fit(model, datamodule=dm)

undefined

Keras/TensorFlow

python

import wandb
from wandb.keras import WandbCallback

python

import wandb
from wandb.keras import WandbCallback

Initialize

wandb.init(project="keras-demo")

Add callback

model.fit( x_train, y_train, validation_data=(x_val, y_val), epochs=10, callbacks=[WandbCallback()] # Auto-logs metrics )

undefined

model.fit( x_train, y_train, validation_data=(x_val, y_val), epochs=10, callbacks=[WandbCallback()] # Auto-logs metrics )

undefined

Visualization & Analysis

可视化与分析

Custom Charts

自定义图表

python

undefined

python

undefined

Log custom visualizations

import matplotlib.pyplot as plt

fig, ax = plt.subplots() ax.plot(x, y) wandb.log({"custom_plot": wandb.Image(fig)})

import matplotlib.pyplot as plt

fig, ax = plt.subplots() ax.plot(x, y) wandb.log({"custom_plot": wandb.Image(fig)})

Log confusion matrix

wandb.log({"conf_mat": wandb.plot.confusion_matrix( probs=None, y_true=ground_truth, preds=predictions, class_names=class_names )})

undefined

wandb.log({"conf_mat": wandb.plot.confusion_matrix( probs=None, y_true=ground_truth, preds=predictions, class_names=class_names )})

undefined

Reports

报告

Create shareable reports in W&B UI:

Combine runs, charts, and text
Markdown support
Embeddable visualizations
Team collaboration

在W&B界面中创建可分享的报告：

整合实验运行、图表和文本
支持Markdown
可嵌入的可视化内容
团队协作

Best Practices

最佳实践

1. Organize with Tags and Groups

1. 使用标签和分组进行组织

python

wandb.init(
    project="my-project",
    tags=["baseline", "resnet50", "imagenet"],
    group="resnet-experiments",  # Group related runs
    job_type="train"             # Type of job
)

python

wandb.init(
    project="my-project",
    tags=["baseline", "resnet50", "imagenet"],
    group="resnet-experiments",  # Group related runs
    job_type="train"             # Type of job
)

2. Log Everything Relevant

2. 记录所有相关内容

python

undefined

python

undefined

Log system metrics

wandb.log({ "gpu/util": gpu_utilization, "gpu/memory": gpu_memory_used, "cpu/util": cpu_utilization })

Log code version

wandb.log({"git_commit": git_commit_hash})

Log data splits

wandb.log({ "data/train_size": len(train_dataset), "data/val_size": len(val_dataset) })

undefined

wandb.log({ "data/train_size": len(train_dataset), "data/val_size": len(val_dataset) })

undefined

3. Use Descriptive Names

3. 使用描述性名称

python

undefined

python

undefined

✅ Good: Descriptive run names

wandb.init( project="nlp-classification", name="bert-base-lr0.001-bs32-epoch10" )

❌ Bad: Generic names

wandb.init(project="nlp", name="run1")

undefined

wandb.init(project="nlp", name="run1")

undefined

4. Save Important Artifacts

4. 保存重要工件

python

undefined

python

undefined

Save final model

artifact = wandb.Artifact('final-model', type='model') artifact.add_file('model.pth') wandb.log_artifact(artifact)

Save predictions for analysis

predictions_table = wandb.Table( columns=["id", "input", "prediction", "ground_truth"], data=predictions_data ) wandb.log({"predictions": predictions_table})

undefined

predictions_table = wandb.Table( columns=["id", "input", "prediction", "ground_truth"], data=predictions_data ) wandb.log({"predictions": predictions_table})

undefined

5. Use Offline Mode for Unstable Connections

5. 在网络不稳定时使用离线模式

python

import os

python

import os

Enable offline mode

os.environ["WANDB_MODE"] = "offline"

wandb.init(project="my-project")

os.environ["WANDB_MODE"] = "offline"

wandb.init(project="my-project")

... your code ...

Sync later

wandb sync <run_directory>

undefined

undefined

Team Collaboration

团队协作

Share Runs

分享实验运行

python

undefined

python

undefined

Runs are automatically shareable via URL

run = wandb.init(project="team-project") print(f"Share this URL: {run.url}")

undefined

run = wandb.init(project="team-project") print(f"Share this URL: {run.url}")

undefined

Team Projects

团队项目

Create team account at wandb.ai
Add team members
Set project visibility (private/public)
Use team-level artifacts and model registry

在wandb.ai创建团队账号
添加团队成员
设置项目可见性（私有/公开）
使用团队级工件和模型注册表

Pricing

定价

Free: Unlimited public projects, 100GB storage
Academic: Free for students/researchers
Teams: $50/seat/month, private projects, unlimited storage
Enterprise: Custom pricing, on-prem options

免费版: 无限制公共项目，100GB存储空间
学术版: 学生/研究人员免费使用
团队版: 50美元/席位/月，私有项目，无限制存储空间
企业版: 定制定价，支持本地部署

Resources

资源

Documentation: https://docs.wandb.ai
GitHub: https://github.com/wandb/wandb (10.5k+ stars)
Examples: https://github.com/wandb/examples
Community: https://wandb.ai/community
Discord: https://wandb.me/discord

文档: https://docs.wandb.ai
GitHub: https://github.com/wandb/wandb (10.5k+ stars)
示例: https://github.com/wandb/examples
社区: https://wandb.ai/community
Discord: https://wandb.me/discord

另请参阅

```
references/sweeps.md
```
- Comprehensive hyperparameter optimization guide
```
references/artifacts.md
```
- Data and model versioning patterns
```
references/integrations.md
```
- Framework-specific examples

```
references/sweeps.md
```
- 超参数优化综合指南
```
references/artifacts.md
```
- 数据与模型版本控制模式
```
references/integrations.md
```
- 框架特定示例

weights-and-biases

Original

Translation

Weights & Biases: ML Experiment Tracking & MLOps

Weights & Biases：ML实验跟踪与MLOps

When to Use This Skill

何时使用该工具

Installation

安装

Install W&B

Install W&B

Login (creates API key)

Login (creates API key)

Or set API key programmatically

Or set API key programmatically

Quick Start

快速开始

Basic Experiment Tracking

基础实验跟踪

Initialize a run

Initialize a run

Training loop

Training loop

Finish the run

Finish the run

With PyTorch

与PyTorch结合使用

Initialize

Initialize

Access config

Access config

Training loop

Training loop

Save model

Save model

Core Concepts

核心概念

1. Projects and Runs

1. 项目与实验运行

Create/use project

Create/use project

Each run has unique ID

Each run has unique ID

2. Configuration Tracking

2. 配置跟踪

Access config during training

Access config during training

3. Metric Logging

3. 指标记录

Log scalars

Log scalars

Log multiple metrics

Log multiple metrics

Log with custom x-axis

Log with custom x-axis

Log media (images, audio, video)

Log media (images, audio, video)

Log histograms

Log histograms

Log tables

Log tables

4. Model Checkpointing

4. 模型检查点

Save model checkpoint

Save model checkpoint

Upload to W&B

Upload to W&B

Or use Artifacts (recommended)

Or use Artifacts (recommended)

Hyperparameter Sweeps

超参数调优

Define Sweep Configuration

定义调优配置

Initialize sweep

Initialize sweep

Define Training Function

定义训练函数

Run sweep

Run sweep

Sweep Strategies