weights-and-biases

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Weights & Biases: ML Experiment Tracking & MLOps

Weights & Biases:ML实验跟踪与MLOps

When to Use This Skill

何时使用该工具

Use Weights & Biases (W&B) when you need to:
  • Track ML experiments with automatic metric logging
  • Visualize training in real-time dashboards
  • Compare runs across hyperparameters and configurations
  • Optimize hyperparameters with automated sweeps
  • Manage model registry with versioning and lineage
  • Collaborate on ML projects with team workspaces
  • Track artifacts (datasets, models, code) with lineage
Users: 200,000+ ML practitioners | GitHub Stars: 10.5k+ | Integrations: 100+
在以下场景中使用Weights & Biases (W&B):
  • 借助自动指标记录跟踪ML实验
  • 在实时仪表盘中可视化训练过程
  • 对比不同超参数和配置下的实验运行
  • 借助自动调优工具优化超参数
  • 借助版本控制和溯源功能管理模型注册表
  • 通过团队工作区协作开展ML项目
  • 借助溯源功能跟踪工件(数据集、模型、代码)
用户: 20万+ ML从业者 | GitHub星标: 10.5k+ | 集成工具: 100+

Installation

安装

bash
undefined
bash
undefined

Install W&B

Install W&B

pip install wandb
pip install wandb

Login (creates API key)

Login (creates API key)

wandb login
wandb login

Or set API key programmatically

Or set API key programmatically

export WANDB_API_KEY=your_api_key_here
undefined
export WANDB_API_KEY=your_api_key_here
undefined

Quick Start

快速开始

Basic Experiment Tracking

基础实验跟踪

python
import wandb
python
import wandb

Initialize a run

Initialize a run

run = wandb.init( project="my-project", config={ "learning_rate": 0.001, "epochs": 10, "batch_size": 32, "architecture": "ResNet50" } )
run = wandb.init( project="my-project", config={ "learning_rate": 0.001, "epochs": 10, "batch_size": 32, "architecture": "ResNet50" } )

Training loop

Training loop

for epoch in range(run.config.epochs): # Your training code train_loss = train_epoch() val_loss = validate()
# Log metrics
wandb.log({
    "epoch": epoch,
    "train/loss": train_loss,
    "val/loss": val_loss,
    "train/accuracy": train_acc,
    "val/accuracy": val_acc
})
for epoch in range(run.config.epochs): # Your training code train_loss = train_epoch() val_loss = validate()
# Log metrics
wandb.log({
    "epoch": epoch,
    "train/loss": train_loss,
    "val/loss": val_loss,
    "train/accuracy": train_acc,
    "val/accuracy": val_acc
})

Finish the run

Finish the run

wandb.finish()
undefined
wandb.finish()
undefined

With PyTorch

与PyTorch结合使用

python
import torch
import wandb
python
import torch
import wandb

Initialize

Initialize

wandb.init(project="pytorch-demo", config={ "lr": 0.001, "epochs": 10 })
wandb.init(project="pytorch-demo", config={ "lr": 0.001, "epochs": 10 })

Access config

Access config

config = wandb.config
config = wandb.config

Training loop

Training loop

for epoch in range(config.epochs): for batch_idx, (data, target) in enumerate(train_loader): # Forward pass output = model(data) loss = criterion(output, target)
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log every 100 batches
    if batch_idx % 100 == 0:
        wandb.log({
            "loss": loss.item(),
            "epoch": epoch,
            "batch": batch_idx
        })
for epoch in range(config.epochs): for batch_idx, (data, target) in enumerate(train_loader): # Forward pass output = model(data) loss = criterion(output, target)
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log every 100 batches
    if batch_idx % 100 == 0:
        wandb.log({
            "loss": loss.item(),
            "epoch": epoch,
            "batch": batch_idx
        })

Save model

Save model

torch.save(model.state_dict(), "model.pth") wandb.save("model.pth") # Upload to W&B
wandb.finish()
undefined
torch.save(model.state_dict(), "model.pth") wandb.save("model.pth") # Upload to W&B
wandb.finish()
undefined

Core Concepts

核心概念

1. Projects and Runs

1. 项目与实验运行

Project: Collection of related experiments Run: Single execution of your training script
python
undefined
Project: 相关实验的集合 Run: 训练脚本的单次执行
python
undefined

Create/use project

Create/use project

run = wandb.init( project="image-classification", name="resnet50-experiment-1", # Optional run name tags=["baseline", "resnet"], # Organize with tags notes="First baseline run" # Add notes )
run = wandb.init( project="image-classification", name="resnet50-experiment-1", # Optional run name tags=["baseline", "resnet"], # Organize with tags notes="First baseline run" # Add notes )

Each run has unique ID

Each run has unique ID

print(f"Run ID: {run.id}") print(f"Run URL: {run.url}")
undefined
print(f"Run ID: {run.id}") print(f"Run URL: {run.url}")
undefined

2. Configuration Tracking

2. 配置跟踪

Track hyperparameters automatically:
python
config = {
    # Model architecture
    "model": "ResNet50",
    "pretrained": True,

    # Training params
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 50,
    "optimizer": "Adam",

    # Data params
    "dataset": "ImageNet",
    "augmentation": "standard"
}

wandb.init(project="my-project", config=config)
自动跟踪超参数:
python
config = {
    # Model architecture
    "model": "ResNet50",
    "pretrained": True,

    # Training params
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 50,
    "optimizer": "Adam",

    # Data params
    "dataset": "ImageNet",
    "augmentation": "standard"
}

wandb.init(project="my-project", config=config)

Access config during training

Access config during training

lr = wandb.config.learning_rate batch_size = wandb.config.batch_size
undefined
lr = wandb.config.learning_rate batch_size = wandb.config.batch_size
undefined

3. Metric Logging

3. 指标记录

python
undefined
python
undefined

Log scalars

Log scalars

wandb.log({"loss": 0.5, "accuracy": 0.92})
wandb.log({"loss": 0.5, "accuracy": 0.92})

Log multiple metrics

Log multiple metrics

wandb.log({ "train/loss": train_loss, "train/accuracy": train_acc, "val/loss": val_loss, "val/accuracy": val_acc, "learning_rate": current_lr, "epoch": epoch })
wandb.log({ "train/loss": train_loss, "train/accuracy": train_acc, "val/loss": val_loss, "val/accuracy": val_acc, "learning_rate": current_lr, "epoch": epoch })

Log with custom x-axis

Log with custom x-axis

wandb.log({"loss": loss}, step=global_step)
wandb.log({"loss": loss}, step=global_step)

Log media (images, audio, video)

Log media (images, audio, video)

wandb.log({"examples": [wandb.Image(img) for img in images]})
wandb.log({"examples": [wandb.Image(img) for img in images]})

Log histograms

Log histograms

wandb.log({"gradients": wandb.Histogram(gradients)})
wandb.log({"gradients": wandb.Histogram(gradients)})

Log tables

Log tables

table = wandb.Table(columns=["id", "prediction", "ground_truth"]) wandb.log({"predictions": table})
undefined
table = wandb.Table(columns=["id", "prediction", "ground_truth"]) wandb.log({"predictions": table})
undefined

4. Model Checkpointing

4. 模型检查点

python
import torch
import wandb
python
import torch
import wandb

Save model checkpoint

Save model checkpoint

checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }
torch.save(checkpoint, 'checkpoint.pth')
checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }
torch.save(checkpoint, 'checkpoint.pth')

Upload to W&B

Upload to W&B

wandb.save('checkpoint.pth')
wandb.save('checkpoint.pth')

Or use Artifacts (recommended)

Or use Artifacts (recommended)

artifact = wandb.Artifact('model', type='model') artifact.add_file('checkpoint.pth') wandb.log_artifact(artifact)
undefined
artifact = wandb.Artifact('model', type='model') artifact.add_file('checkpoint.pth') wandb.log_artifact(artifact)
undefined

Hyperparameter Sweeps

超参数调优

Automatically search for optimal hyperparameters.
自动搜索最优超参数。

Define Sweep Configuration

定义调优配置

python
sweep_config = {
    'method': 'bayes',  # or 'grid', 'random'
    'metric': {
        'name': 'val/accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        'batch_size': {
            'values': [16, 32, 64, 128]
        },
        'optimizer': {
            'values': ['adam', 'sgd', 'rmsprop']
        },
        'dropout': {
            'distribution': 'uniform',
            'min': 0.1,
            'max': 0.5
        }
    }
}
python
sweep_config = {
    'method': 'bayes',  # or 'grid', 'random'
    'metric': {
        'name': 'val/accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': 1e-5,
            'max': 1e-1
        },
        'batch_size': {
            'values': [16, 32, 64, 128]
        },
        'optimizer': {
            'values': ['adam', 'sgd', 'rmsprop']
        },
        'dropout': {
            'distribution': 'uniform',
            'min': 0.1,
            'max': 0.5
        }
    }
}

Initialize sweep

Initialize sweep

sweep_id = wandb.sweep(sweep_config, project="my-project")
undefined
sweep_id = wandb.sweep(sweep_config, project="my-project")
undefined

Define Training Function

定义训练函数

python
def train():
    # Initialize run
    run = wandb.init()

    # Access sweep parameters
    lr = wandb.config.learning_rate
    batch_size = wandb.config.batch_size
    optimizer_name = wandb.config.optimizer

    # Build model with sweep config
    model = build_model(wandb.config)
    optimizer = get_optimizer(optimizer_name, lr)

    # Training loop
    for epoch in range(NUM_EPOCHS):
        train_loss = train_epoch(model, optimizer, batch_size)
        val_acc = validate(model)

        # Log metrics
        wandb.log({
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })
python
def train():
    # Initialize run
    run = wandb.init()

    # Access sweep parameters
    lr = wandb.config.learning_rate
    batch_size = wandb.config.batch_size
    optimizer_name = wandb.config.optimizer

    # Build model with sweep config
    model = build_model(wandb.config)
    optimizer = get_optimizer(optimizer_name, lr)

    # Training loop
    for epoch in range(NUM_EPOCHS):
        train_loss = train_epoch(model, optimizer, batch_size)
        val_acc = validate(model)

        # Log metrics
        wandb.log({
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

Run sweep

Run sweep

wandb.agent(sweep_id, function=train, count=50) # Run 50 trials
undefined
wandb.agent(sweep_id, function=train, count=50) # Run 50 trials
undefined

Sweep Strategies

调优策略

python
undefined
python
undefined

Grid search - exhaustive

Grid search - exhaustive

sweep_config = { 'method': 'grid', 'parameters': { 'lr': {'values': [0.001, 0.01, 0.1]}, 'batch_size': {'values': [16, 32, 64]} } }
sweep_config = { 'method': 'grid', 'parameters': { 'lr': {'values': [0.001, 0.01, 0.1]}, 'batch_size': {'values': [16, 32, 64]} } }

Random search

Random search

sweep_config = { 'method': 'random', 'parameters': { 'lr': {'distribution': 'uniform', 'min': 0.0001, 'max': 0.1}, 'dropout': {'distribution': 'uniform', 'min': 0.1, 'max': 0.5} } }
sweep_config = { 'method': 'random', 'parameters': { 'lr': {'distribution': 'uniform', 'min': 0.0001, 'max': 0.1}, 'dropout': {'distribution': 'uniform', 'min': 0.1, 'max': 0.5} } }

Bayesian optimization (recommended)

Bayesian optimization (recommended)

sweep_config = { 'method': 'bayes', 'metric': {'name': 'val/loss', 'goal': 'minimize'}, 'parameters': { 'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1} } }
undefined
sweep_config = { 'method': 'bayes', 'metric': {'name': 'val/loss', 'goal': 'minimize'}, 'parameters': { 'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1} } }
undefined

Artifacts

工件

Track datasets, models, and other files with lineage.
借助溯源功能跟踪数据集、模型和其他文件。

Log Artifacts

记录工件

python
undefined
python
undefined

Create artifact

Create artifact

artifact = wandb.Artifact( name='training-dataset', type='dataset', description='ImageNet training split', metadata={'size': '1.2M images', 'split': 'train'} )
artifact = wandb.Artifact( name='training-dataset', type='dataset', description='ImageNet training split', metadata={'size': '1.2M images', 'split': 'train'} )

Add files

Add files

artifact.add_file('data/train.csv') artifact.add_dir('data/images/')
artifact.add_file('data/train.csv') artifact.add_dir('data/images/')

Log artifact

Log artifact

wandb.log_artifact(artifact)
undefined
wandb.log_artifact(artifact)
undefined

Use Artifacts

使用工件

python
undefined
python
undefined

Download and use artifact

Download and use artifact

run = wandb.init(project="my-project")
run = wandb.init(project="my-project")

Download artifact

Download artifact

artifact = run.use_artifact('training-dataset:latest') artifact_dir = artifact.download()
artifact = run.use_artifact('training-dataset:latest') artifact_dir = artifact.download()

Use the data

Use the data

data = load_data(f"{artifact_dir}/train.csv")
undefined
data = load_data(f"{artifact_dir}/train.csv")
undefined

Model Registry

模型注册表

python
undefined
python
undefined

Log model as artifact

Log model as artifact

model_artifact = wandb.Artifact( name='resnet50-model', type='model', metadata={'architecture': 'ResNet50', 'accuracy': 0.95} )
model_artifact.add_file('model.pth') wandb.log_artifact(model_artifact, aliases=['best', 'production'])
model_artifact = wandb.Artifact( name='resnet50-model', type='model', metadata={'architecture': 'ResNet50', 'accuracy': 0.95} )
model_artifact.add_file('model.pth') wandb.log_artifact(model_artifact, aliases=['best', 'production'])

Link to model registry

Link to model registry

run.link_artifact(model_artifact, 'model-registry/production-models')
undefined
run.link_artifact(model_artifact, 'model-registry/production-models')
undefined

Integration Examples

集成示例

HuggingFace Transformers

HuggingFace Transformers

python
from transformers import Trainer, TrainingArguments
import wandb
python
from transformers import Trainer, TrainingArguments
import wandb

Initialize W&B

Initialize W&B

wandb.init(project="hf-transformers")
wandb.init(project="hf-transformers")

Training arguments with W&B

Training arguments with W&B

training_args = TrainingArguments( output_dir="./results", report_to="wandb", # Enable W&B logging run_name="bert-finetuning", logging_steps=100, save_steps=500 )
training_args = TrainingArguments( output_dir="./results", report_to="wandb", # Enable W&B logging run_name="bert-finetuning", logging_steps=100, save_steps=500 )

Trainer automatically logs to W&B

Trainer automatically logs to W&B

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset )
trainer.train()
undefined
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset )
trainer.train()
undefined

PyTorch Lightning

PyTorch Lightning

python
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb
python
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

Create W&B logger

Create W&B logger

wandb_logger = WandbLogger( project="lightning-demo", log_model=True # Log model checkpoints )
wandb_logger = WandbLogger( project="lightning-demo", log_model=True # Log model checkpoints )

Use with Trainer

Use with Trainer

trainer = Trainer( logger=wandb_logger, max_epochs=10 )
trainer.fit(model, datamodule=dm)
undefined
trainer = Trainer( logger=wandb_logger, max_epochs=10 )
trainer.fit(model, datamodule=dm)
undefined

Keras/TensorFlow

Keras/TensorFlow

python
import wandb
from wandb.keras import WandbCallback
python
import wandb
from wandb.keras import WandbCallback

Initialize

Initialize

wandb.init(project="keras-demo")
wandb.init(project="keras-demo")

Add callback

Add callback

model.fit( x_train, y_train, validation_data=(x_val, y_val), epochs=10, callbacks=[WandbCallback()] # Auto-logs metrics )
undefined
model.fit( x_train, y_train, validation_data=(x_val, y_val), epochs=10, callbacks=[WandbCallback()] # Auto-logs metrics )
undefined

Visualization & Analysis

可视化与分析

Custom Charts

自定义图表

python
undefined
python
undefined

Log custom visualizations

Log custom visualizations

import matplotlib.pyplot as plt
fig, ax = plt.subplots() ax.plot(x, y) wandb.log({"custom_plot": wandb.Image(fig)})
import matplotlib.pyplot as plt
fig, ax = plt.subplots() ax.plot(x, y) wandb.log({"custom_plot": wandb.Image(fig)})

Log confusion matrix

Log confusion matrix

wandb.log({"conf_mat": wandb.plot.confusion_matrix( probs=None, y_true=ground_truth, preds=predictions, class_names=class_names )})
undefined
wandb.log({"conf_mat": wandb.plot.confusion_matrix( probs=None, y_true=ground_truth, preds=predictions, class_names=class_names )})
undefined

Reports

报告

Create shareable reports in W&B UI:
  • Combine runs, charts, and text
  • Markdown support
  • Embeddable visualizations
  • Team collaboration
在W&B界面中创建可分享的报告:
  • 整合实验运行、图表和文本
  • 支持Markdown
  • 可嵌入的可视化内容
  • 团队协作

Best Practices

最佳实践

1. Organize with Tags and Groups

1. 使用标签和分组进行组织

python
wandb.init(
    project="my-project",
    tags=["baseline", "resnet50", "imagenet"],
    group="resnet-experiments",  # Group related runs
    job_type="train"             # Type of job
)
python
wandb.init(
    project="my-project",
    tags=["baseline", "resnet50", "imagenet"],
    group="resnet-experiments",  # Group related runs
    job_type="train"             # Type of job
)

2. Log Everything Relevant

2. 记录所有相关内容

python
undefined
python
undefined

Log system metrics

Log system metrics

wandb.log({ "gpu/util": gpu_utilization, "gpu/memory": gpu_memory_used, "cpu/util": cpu_utilization })
wandb.log({ "gpu/util": gpu_utilization, "gpu/memory": gpu_memory_used, "cpu/util": cpu_utilization })

Log code version

Log code version

wandb.log({"git_commit": git_commit_hash})
wandb.log({"git_commit": git_commit_hash})

Log data splits

Log data splits

wandb.log({ "data/train_size": len(train_dataset), "data/val_size": len(val_dataset) })
undefined
wandb.log({ "data/train_size": len(train_dataset), "data/val_size": len(val_dataset) })
undefined

3. Use Descriptive Names

3. 使用描述性名称

python
undefined
python
undefined

✅ Good: Descriptive run names

✅ Good: Descriptive run names

wandb.init( project="nlp-classification", name="bert-base-lr0.001-bs32-epoch10" )
wandb.init( project="nlp-classification", name="bert-base-lr0.001-bs32-epoch10" )

❌ Bad: Generic names

❌ Bad: Generic names

wandb.init(project="nlp", name="run1")
undefined
wandb.init(project="nlp", name="run1")
undefined

4. Save Important Artifacts

4. 保存重要工件

python
undefined
python
undefined

Save final model

Save final model

artifact = wandb.Artifact('final-model', type='model') artifact.add_file('model.pth') wandb.log_artifact(artifact)
artifact = wandb.Artifact('final-model', type='model') artifact.add_file('model.pth') wandb.log_artifact(artifact)

Save predictions for analysis

Save predictions for analysis

predictions_table = wandb.Table( columns=["id", "input", "prediction", "ground_truth"], data=predictions_data ) wandb.log({"predictions": predictions_table})
undefined
predictions_table = wandb.Table( columns=["id", "input", "prediction", "ground_truth"], data=predictions_data ) wandb.log({"predictions": predictions_table})
undefined

5. Use Offline Mode for Unstable Connections

5. 在网络不稳定时使用离线模式

python
import os
python
import os

Enable offline mode

Enable offline mode

os.environ["WANDB_MODE"] = "offline"
wandb.init(project="my-project")
os.environ["WANDB_MODE"] = "offline"
wandb.init(project="my-project")

... your code ...

... your code ...

Sync later

Sync later

wandb sync <run_directory>

wandb sync <run_directory>

undefined
undefined

Team Collaboration

团队协作

Share Runs

分享实验运行

python
undefined
python
undefined

Runs are automatically shareable via URL

Runs are automatically shareable via URL

run = wandb.init(project="team-project") print(f"Share this URL: {run.url}")
undefined
run = wandb.init(project="team-project") print(f"Share this URL: {run.url}")
undefined

Team Projects

团队项目

  • Create team account at wandb.ai
  • Add team members
  • Set project visibility (private/public)
  • Use team-level artifacts and model registry
  • 在wandb.ai创建团队账号
  • 添加团队成员
  • 设置项目可见性(私有/公开)
  • 使用团队级工件和模型注册表

Pricing

定价

  • Free: Unlimited public projects, 100GB storage
  • Academic: Free for students/researchers
  • Teams: $50/seat/month, private projects, unlimited storage
  • Enterprise: Custom pricing, on-prem options
  • 免费版: 无限制公共项目,100GB存储空间
  • 学术版: 学生/研究人员免费使用
  • 团队版: 50美元/席位/月,私有项目,无限制存储空间
  • 企业版: 定制定价,支持本地部署

Resources

资源

See Also

另请参阅

  • references/sweeps.md
    - Comprehensive hyperparameter optimization guide
  • references/artifacts.md
    - Data and model versioning patterns
  • references/integrations.md
    - Framework-specific examples
  • references/sweeps.md
    - 超参数优化综合指南
  • references/artifacts.md
    - 数据与模型版本控制模式
  • references/integrations.md
    - 框架特定示例