weights-and-biases
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeights & Biases: ML Experiment Tracking & MLOps
Weights & Biases:ML实验跟踪与MLOps
When to Use This Skill
何时使用该工具
Use Weights & Biases (W&B) when you need to:
- Track ML experiments with automatic metric logging
- Visualize training in real-time dashboards
- Compare runs across hyperparameters and configurations
- Optimize hyperparameters with automated sweeps
- Manage model registry with versioning and lineage
- Collaborate on ML projects with team workspaces
- Track artifacts (datasets, models, code) with lineage
Users: 200,000+ ML practitioners | GitHub Stars: 10.5k+ | Integrations: 100+
在以下场景中使用Weights & Biases (W&B):
- 借助自动指标记录跟踪ML实验
- 在实时仪表盘中可视化训练过程
- 对比不同超参数和配置下的实验运行
- 借助自动调优工具优化超参数
- 借助版本控制和溯源功能管理模型注册表
- 通过团队工作区协作开展ML项目
- 借助溯源功能跟踪工件(数据集、模型、代码)
用户: 20万+ ML从业者 | GitHub星标: 10.5k+ | 集成工具: 100+
Installation
安装
bash
undefinedbash
undefinedInstall W&B
Install W&B
pip install wandb
pip install wandb
Login (creates API key)
Login (creates API key)
wandb login
wandb login
Or set API key programmatically
Or set API key programmatically
export WANDB_API_KEY=your_api_key_here
undefinedexport WANDB_API_KEY=your_api_key_here
undefinedQuick Start
快速开始
Basic Experiment Tracking
基础实验跟踪
python
import wandbpython
import wandbInitialize a run
Initialize a run
run = wandb.init(
project="my-project",
config={
"learning_rate": 0.001,
"epochs": 10,
"batch_size": 32,
"architecture": "ResNet50"
}
)
run = wandb.init(
project="my-project",
config={
"learning_rate": 0.001,
"epochs": 10,
"batch_size": 32,
"architecture": "ResNet50"
}
)
Training loop
Training loop
for epoch in range(run.config.epochs):
# Your training code
train_loss = train_epoch()
val_loss = validate()
# Log metrics
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"train/accuracy": train_acc,
"val/accuracy": val_acc
})for epoch in range(run.config.epochs):
# Your training code
train_loss = train_epoch()
val_loss = validate()
# Log metrics
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"train/accuracy": train_acc,
"val/accuracy": val_acc
})Finish the run
Finish the run
wandb.finish()
undefinedwandb.finish()
undefinedWith PyTorch
与PyTorch结合使用
python
import torch
import wandbpython
import torch
import wandbInitialize
Initialize
wandb.init(project="pytorch-demo", config={
"lr": 0.001,
"epochs": 10
})
wandb.init(project="pytorch-demo", config={
"lr": 0.001,
"epochs": 10
})
Access config
Access config
config = wandb.config
config = wandb.config
Training loop
Training loop
for epoch in range(config.epochs):
for batch_idx, (data, target) in enumerate(train_loader):
# Forward pass
output = model(data)
loss = criterion(output, target)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Log every 100 batches
if batch_idx % 100 == 0:
wandb.log({
"loss": loss.item(),
"epoch": epoch,
"batch": batch_idx
})for epoch in range(config.epochs):
for batch_idx, (data, target) in enumerate(train_loader):
# Forward pass
output = model(data)
loss = criterion(output, target)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Log every 100 batches
if batch_idx % 100 == 0:
wandb.log({
"loss": loss.item(),
"epoch": epoch,
"batch": batch_idx
})Save model
Save model
torch.save(model.state_dict(), "model.pth")
wandb.save("model.pth") # Upload to W&B
wandb.finish()
undefinedtorch.save(model.state_dict(), "model.pth")
wandb.save("model.pth") # Upload to W&B
wandb.finish()
undefinedCore Concepts
核心概念
1. Projects and Runs
1. 项目与实验运行
Project: Collection of related experiments
Run: Single execution of your training script
python
undefinedProject: 相关实验的集合
Run: 训练脚本的单次执行
python
undefinedCreate/use project
Create/use project
run = wandb.init(
project="image-classification",
name="resnet50-experiment-1", # Optional run name
tags=["baseline", "resnet"], # Organize with tags
notes="First baseline run" # Add notes
)
run = wandb.init(
project="image-classification",
name="resnet50-experiment-1", # Optional run name
tags=["baseline", "resnet"], # Organize with tags
notes="First baseline run" # Add notes
)
Each run has unique ID
Each run has unique ID
print(f"Run ID: {run.id}")
print(f"Run URL: {run.url}")
undefinedprint(f"Run ID: {run.id}")
print(f"Run URL: {run.url}")
undefined2. Configuration Tracking
2. 配置跟踪
Track hyperparameters automatically:
python
config = {
# Model architecture
"model": "ResNet50",
"pretrained": True,
# Training params
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50,
"optimizer": "Adam",
# Data params
"dataset": "ImageNet",
"augmentation": "standard"
}
wandb.init(project="my-project", config=config)自动跟踪超参数:
python
config = {
# Model architecture
"model": "ResNet50",
"pretrained": True,
# Training params
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50,
"optimizer": "Adam",
# Data params
"dataset": "ImageNet",
"augmentation": "standard"
}
wandb.init(project="my-project", config=config)Access config during training
Access config during training
lr = wandb.config.learning_rate
batch_size = wandb.config.batch_size
undefinedlr = wandb.config.learning_rate
batch_size = wandb.config.batch_size
undefined3. Metric Logging
3. 指标记录
python
undefinedpython
undefinedLog scalars
Log scalars
wandb.log({"loss": 0.5, "accuracy": 0.92})
wandb.log({"loss": 0.5, "accuracy": 0.92})
Log multiple metrics
Log multiple metrics
wandb.log({
"train/loss": train_loss,
"train/accuracy": train_acc,
"val/loss": val_loss,
"val/accuracy": val_acc,
"learning_rate": current_lr,
"epoch": epoch
})
wandb.log({
"train/loss": train_loss,
"train/accuracy": train_acc,
"val/loss": val_loss,
"val/accuracy": val_acc,
"learning_rate": current_lr,
"epoch": epoch
})
Log with custom x-axis
Log with custom x-axis
wandb.log({"loss": loss}, step=global_step)
wandb.log({"loss": loss}, step=global_step)
Log media (images, audio, video)
Log media (images, audio, video)
wandb.log({"examples": [wandb.Image(img) for img in images]})
wandb.log({"examples": [wandb.Image(img) for img in images]})
Log histograms
Log histograms
wandb.log({"gradients": wandb.Histogram(gradients)})
wandb.log({"gradients": wandb.Histogram(gradients)})
Log tables
Log tables
table = wandb.Table(columns=["id", "prediction", "ground_truth"])
wandb.log({"predictions": table})
undefinedtable = wandb.Table(columns=["id", "prediction", "ground_truth"])
wandb.log({"predictions": table})
undefined4. Model Checkpointing
4. 模型检查点
python
import torch
import wandbpython
import torch
import wandbSave model checkpoint
Save model checkpoint
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')
Upload to W&B
Upload to W&B
wandb.save('checkpoint.pth')
wandb.save('checkpoint.pth')
Or use Artifacts (recommended)
Or use Artifacts (recommended)
artifact = wandb.Artifact('model', type='model')
artifact.add_file('checkpoint.pth')
wandb.log_artifact(artifact)
undefinedartifact = wandb.Artifact('model', type='model')
artifact.add_file('checkpoint.pth')
wandb.log_artifact(artifact)
undefinedHyperparameter Sweeps
超参数调优
Automatically search for optimal hyperparameters.
自动搜索最优超参数。
Define Sweep Configuration
定义调优配置
python
sweep_config = {
'method': 'bayes', # or 'grid', 'random'
'metric': {
'name': 'val/accuracy',
'goal': 'maximize'
},
'parameters': {
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
'batch_size': {
'values': [16, 32, 64, 128]
},
'optimizer': {
'values': ['adam', 'sgd', 'rmsprop']
},
'dropout': {
'distribution': 'uniform',
'min': 0.1,
'max': 0.5
}
}
}python
sweep_config = {
'method': 'bayes', # or 'grid', 'random'
'metric': {
'name': 'val/accuracy',
'goal': 'maximize'
},
'parameters': {
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
'batch_size': {
'values': [16, 32, 64, 128]
},
'optimizer': {
'values': ['adam', 'sgd', 'rmsprop']
},
'dropout': {
'distribution': 'uniform',
'min': 0.1,
'max': 0.5
}
}
}Initialize sweep
Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")
undefinedsweep_id = wandb.sweep(sweep_config, project="my-project")
undefinedDefine Training Function
定义训练函数
python
def train():
# Initialize run
run = wandb.init()
# Access sweep parameters
lr = wandb.config.learning_rate
batch_size = wandb.config.batch_size
optimizer_name = wandb.config.optimizer
# Build model with sweep config
model = build_model(wandb.config)
optimizer = get_optimizer(optimizer_name, lr)
# Training loop
for epoch in range(NUM_EPOCHS):
train_loss = train_epoch(model, optimizer, batch_size)
val_acc = validate(model)
# Log metrics
wandb.log({
"train/loss": train_loss,
"val/accuracy": val_acc
})python
def train():
# Initialize run
run = wandb.init()
# Access sweep parameters
lr = wandb.config.learning_rate
batch_size = wandb.config.batch_size
optimizer_name = wandb.config.optimizer
# Build model with sweep config
model = build_model(wandb.config)
optimizer = get_optimizer(optimizer_name, lr)
# Training loop
for epoch in range(NUM_EPOCHS):
train_loss = train_epoch(model, optimizer, batch_size)
val_acc = validate(model)
# Log metrics
wandb.log({
"train/loss": train_loss,
"val/accuracy": val_acc
})Run sweep
Run sweep
wandb.agent(sweep_id, function=train, count=50) # Run 50 trials
undefinedwandb.agent(sweep_id, function=train, count=50) # Run 50 trials
undefinedSweep Strategies
调优策略
python
undefinedpython
undefinedGrid search - exhaustive
Grid search - exhaustive
sweep_config = {
'method': 'grid',
'parameters': {
'lr': {'values': [0.001, 0.01, 0.1]},
'batch_size': {'values': [16, 32, 64]}
}
}
sweep_config = {
'method': 'grid',
'parameters': {
'lr': {'values': [0.001, 0.01, 0.1]},
'batch_size': {'values': [16, 32, 64]}
}
}
Random search
Random search
sweep_config = {
'method': 'random',
'parameters': {
'lr': {'distribution': 'uniform', 'min': 0.0001, 'max': 0.1},
'dropout': {'distribution': 'uniform', 'min': 0.1, 'max': 0.5}
}
}
sweep_config = {
'method': 'random',
'parameters': {
'lr': {'distribution': 'uniform', 'min': 0.0001, 'max': 0.1},
'dropout': {'distribution': 'uniform', 'min': 0.1, 'max': 0.5}
}
}
Bayesian optimization (recommended)
Bayesian optimization (recommended)
sweep_config = {
'method': 'bayes',
'metric': {'name': 'val/loss', 'goal': 'minimize'},
'parameters': {
'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1}
}
}
undefinedsweep_config = {
'method': 'bayes',
'metric': {'name': 'val/loss', 'goal': 'minimize'},
'parameters': {
'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1}
}
}
undefinedArtifacts
工件
Track datasets, models, and other files with lineage.
借助溯源功能跟踪数据集、模型和其他文件。
Log Artifacts
记录工件
python
undefinedpython
undefinedCreate artifact
Create artifact
artifact = wandb.Artifact(
name='training-dataset',
type='dataset',
description='ImageNet training split',
metadata={'size': '1.2M images', 'split': 'train'}
)
artifact = wandb.Artifact(
name='training-dataset',
type='dataset',
description='ImageNet training split',
metadata={'size': '1.2M images', 'split': 'train'}
)
Add files
Add files
artifact.add_file('data/train.csv')
artifact.add_dir('data/images/')
artifact.add_file('data/train.csv')
artifact.add_dir('data/images/')
Log artifact
Log artifact
wandb.log_artifact(artifact)
undefinedwandb.log_artifact(artifact)
undefinedUse Artifacts
使用工件
python
undefinedpython
undefinedDownload and use artifact
Download and use artifact
run = wandb.init(project="my-project")
run = wandb.init(project="my-project")
Download artifact
Download artifact
artifact = run.use_artifact('training-dataset:latest')
artifact_dir = artifact.download()
artifact = run.use_artifact('training-dataset:latest')
artifact_dir = artifact.download()
Use the data
Use the data
data = load_data(f"{artifact_dir}/train.csv")
undefineddata = load_data(f"{artifact_dir}/train.csv")
undefinedModel Registry
模型注册表
python
undefinedpython
undefinedLog model as artifact
Log model as artifact
model_artifact = wandb.Artifact(
name='resnet50-model',
type='model',
metadata={'architecture': 'ResNet50', 'accuracy': 0.95}
)
model_artifact.add_file('model.pth')
wandb.log_artifact(model_artifact, aliases=['best', 'production'])
model_artifact = wandb.Artifact(
name='resnet50-model',
type='model',
metadata={'architecture': 'ResNet50', 'accuracy': 0.95}
)
model_artifact.add_file('model.pth')
wandb.log_artifact(model_artifact, aliases=['best', 'production'])
Link to model registry
Link to model registry
run.link_artifact(model_artifact, 'model-registry/production-models')
undefinedrun.link_artifact(model_artifact, 'model-registry/production-models')
undefinedIntegration Examples
集成示例
HuggingFace Transformers
HuggingFace Transformers
python
from transformers import Trainer, TrainingArguments
import wandbpython
from transformers import Trainer, TrainingArguments
import wandbInitialize W&B
Initialize W&B
wandb.init(project="hf-transformers")
wandb.init(project="hf-transformers")
Training arguments with W&B
Training arguments with W&B
training_args = TrainingArguments(
output_dir="./results",
report_to="wandb", # Enable W&B logging
run_name="bert-finetuning",
logging_steps=100,
save_steps=500
)
training_args = TrainingArguments(
output_dir="./results",
report_to="wandb", # Enable W&B logging
run_name="bert-finetuning",
logging_steps=100,
save_steps=500
)
Trainer automatically logs to W&B
Trainer automatically logs to W&B
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
undefinedtrainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
undefinedPyTorch Lightning
PyTorch Lightning
python
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandbpython
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandbCreate W&B logger
Create W&B logger
wandb_logger = WandbLogger(
project="lightning-demo",
log_model=True # Log model checkpoints
)
wandb_logger = WandbLogger(
project="lightning-demo",
log_model=True # Log model checkpoints
)
Use with Trainer
Use with Trainer
trainer = Trainer(
logger=wandb_logger,
max_epochs=10
)
trainer.fit(model, datamodule=dm)
undefinedtrainer = Trainer(
logger=wandb_logger,
max_epochs=10
)
trainer.fit(model, datamodule=dm)
undefinedKeras/TensorFlow
Keras/TensorFlow
python
import wandb
from wandb.keras import WandbCallbackpython
import wandb
from wandb.keras import WandbCallbackInitialize
Initialize
wandb.init(project="keras-demo")
wandb.init(project="keras-demo")
Add callback
Add callback
model.fit(
x_train, y_train,
validation_data=(x_val, y_val),
epochs=10,
callbacks=[WandbCallback()] # Auto-logs metrics
)
undefinedmodel.fit(
x_train, y_train,
validation_data=(x_val, y_val),
epochs=10,
callbacks=[WandbCallback()] # Auto-logs metrics
)
undefinedVisualization & Analysis
可视化与分析
Custom Charts
自定义图表
python
undefinedpython
undefinedLog custom visualizations
Log custom visualizations
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x, y)
wandb.log({"custom_plot": wandb.Image(fig)})
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x, y)
wandb.log({"custom_plot": wandb.Image(fig)})
Log confusion matrix
Log confusion matrix
wandb.log({"conf_mat": wandb.plot.confusion_matrix(
probs=None,
y_true=ground_truth,
preds=predictions,
class_names=class_names
)})
undefinedwandb.log({"conf_mat": wandb.plot.confusion_matrix(
probs=None,
y_true=ground_truth,
preds=predictions,
class_names=class_names
)})
undefinedReports
报告
Create shareable reports in W&B UI:
- Combine runs, charts, and text
- Markdown support
- Embeddable visualizations
- Team collaboration
在W&B界面中创建可分享的报告:
- 整合实验运行、图表和文本
- 支持Markdown
- 可嵌入的可视化内容
- 团队协作
Best Practices
最佳实践
1. Organize with Tags and Groups
1. 使用标签和分组进行组织
python
wandb.init(
project="my-project",
tags=["baseline", "resnet50", "imagenet"],
group="resnet-experiments", # Group related runs
job_type="train" # Type of job
)python
wandb.init(
project="my-project",
tags=["baseline", "resnet50", "imagenet"],
group="resnet-experiments", # Group related runs
job_type="train" # Type of job
)2. Log Everything Relevant
2. 记录所有相关内容
python
undefinedpython
undefinedLog system metrics
Log system metrics
wandb.log({
"gpu/util": gpu_utilization,
"gpu/memory": gpu_memory_used,
"cpu/util": cpu_utilization
})
wandb.log({
"gpu/util": gpu_utilization,
"gpu/memory": gpu_memory_used,
"cpu/util": cpu_utilization
})
Log code version
Log code version
wandb.log({"git_commit": git_commit_hash})
wandb.log({"git_commit": git_commit_hash})
Log data splits
Log data splits
wandb.log({
"data/train_size": len(train_dataset),
"data/val_size": len(val_dataset)
})
undefinedwandb.log({
"data/train_size": len(train_dataset),
"data/val_size": len(val_dataset)
})
undefined3. Use Descriptive Names
3. 使用描述性名称
python
undefinedpython
undefined✅ Good: Descriptive run names
✅ Good: Descriptive run names
wandb.init(
project="nlp-classification",
name="bert-base-lr0.001-bs32-epoch10"
)
wandb.init(
project="nlp-classification",
name="bert-base-lr0.001-bs32-epoch10"
)
❌ Bad: Generic names
❌ Bad: Generic names
wandb.init(project="nlp", name="run1")
undefinedwandb.init(project="nlp", name="run1")
undefined4. Save Important Artifacts
4. 保存重要工件
python
undefinedpython
undefinedSave final model
Save final model
artifact = wandb.Artifact('final-model', type='model')
artifact.add_file('model.pth')
wandb.log_artifact(artifact)
artifact = wandb.Artifact('final-model', type='model')
artifact.add_file('model.pth')
wandb.log_artifact(artifact)
Save predictions for analysis
Save predictions for analysis
predictions_table = wandb.Table(
columns=["id", "input", "prediction", "ground_truth"],
data=predictions_data
)
wandb.log({"predictions": predictions_table})
undefinedpredictions_table = wandb.Table(
columns=["id", "input", "prediction", "ground_truth"],
data=predictions_data
)
wandb.log({"predictions": predictions_table})
undefined5. Use Offline Mode for Unstable Connections
5. 在网络不稳定时使用离线模式
python
import ospython
import osEnable offline mode
Enable offline mode
os.environ["WANDB_MODE"] = "offline"
wandb.init(project="my-project")
os.environ["WANDB_MODE"] = "offline"
wandb.init(project="my-project")
... your code ...
... your code ...
Sync later
Sync later
wandb sync <run_directory>
wandb sync <run_directory>
undefinedundefinedTeam Collaboration
团队协作
Share Runs
分享实验运行
python
undefinedpython
undefinedRuns are automatically shareable via URL
Runs are automatically shareable via URL
run = wandb.init(project="team-project")
print(f"Share this URL: {run.url}")
undefinedrun = wandb.init(project="team-project")
print(f"Share this URL: {run.url}")
undefinedTeam Projects
团队项目
- Create team account at wandb.ai
- Add team members
- Set project visibility (private/public)
- Use team-level artifacts and model registry
- 在wandb.ai创建团队账号
- 添加团队成员
- 设置项目可见性(私有/公开)
- 使用团队级工件和模型注册表
Pricing
定价
- Free: Unlimited public projects, 100GB storage
- Academic: Free for students/researchers
- Teams: $50/seat/month, private projects, unlimited storage
- Enterprise: Custom pricing, on-prem options
- 免费版: 无限制公共项目,100GB存储空间
- 学术版: 学生/研究人员免费使用
- 团队版: 50美元/席位/月,私有项目,无限制存储空间
- 企业版: 定制定价,支持本地部署
Resources
资源
- Documentation: https://docs.wandb.ai
- GitHub: https://github.com/wandb/wandb (10.5k+ stars)
- Examples: https://github.com/wandb/examples
- Community: https://wandb.ai/community
- Discord: https://wandb.me/discord
- 文档: https://docs.wandb.ai
- GitHub: https://github.com/wandb/wandb (10.5k+ stars)
- 示例: https://github.com/wandb/examples
- 社区: https://wandb.ai/community
- Discord: https://wandb.me/discord
See Also
另请参阅
- - Comprehensive hyperparameter optimization guide
references/sweeps.md - - Data and model versioning patterns
references/artifacts.md - - Framework-specific examples
references/integrations.md
- - 超参数优化综合指南
references/sweeps.md - - 数据与模型版本控制模式
references/artifacts.md - - 框架特定示例
references/integrations.md