tensorboard

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TensorBoard: Visualization Toolkit for ML

TensorBoard:机器学习可视化工具包

When to Use This Skill

何时使用该工具

Use TensorBoard when you need to:
  • Visualize training metrics like loss and accuracy over time
  • Debug models with histograms and distributions
  • Compare experiments across multiple runs
  • Visualize model graphs and architecture
  • Project embeddings to lower dimensions (t-SNE, PCA)
  • Track hyperparameter experiments
  • Profile performance and identify bottlenecks
  • Visualize images and text during training
Users: 20M+ downloads/year | GitHub Stars: 27k+ | License: Apache 2.0
在以下场景中使用TensorBoard:
  • 可视化训练指标:如随时间变化的损失值和准确率
  • 调试模型:使用直方图和分布图表
  • 对比实验:跨多次运行对比实验结果
  • 可视化模型图:展示模型架构
  • 嵌入投影:将高维数据降维展示(t-SNE、PCA)
  • 追踪超参数:记录超参数实验过程
  • 性能分析:识别性能瓶颈
  • 可视化训练中的图像与文本
用户规模:年下载量超2000万 | GitHub星标:27000+ | 许可证:Apache 2.0

Installation

安装

bash
undefined
bash
undefined

Install TensorBoard

安装TensorBoard

pip install tensorboard
pip install tensorboard

PyTorch integration

PyTorch集成

pip install torch torchvision tensorboard
pip install torch torchvision tensorboard

TensorFlow integration (TensorBoard included)

TensorFlow集成(已包含TensorBoard)

pip install tensorflow
pip install tensorflow

Launch TensorBoard

启动TensorBoard

tensorboard --logdir=runs
tensorboard --logdir=runs
undefined
undefined

Quick Start

快速开始

PyTorch

PyTorch

python
from torch.utils.tensorboard import SummaryWriter
python
from torch.utils.tensorboard import SummaryWriter

Create writer

创建写入器

writer = SummaryWriter('runs/experiment_1')
writer = SummaryWriter('runs/experiment_1')

Training loop

训练循环

for epoch in range(10): train_loss = train_epoch() val_acc = validate()
# Log metrics
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
for epoch in range(10): train_loss = train_epoch() val_acc = validate()
# 记录指标
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)

Close writer

关闭写入器

writer.close()
writer.close()

Launch: tensorboard --logdir=runs

启动:tensorboard --logdir=runs

undefined
undefined

TensorFlow/Keras

TensorFlow/Keras

python
import tensorflow as tf
python
import tensorflow as tf

Create callback

创建回调函数

tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs/fit', histogram_freq=1 )
tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs/fit', histogram_freq=1 )

Train model

训练模型

model.fit( x_train, y_train, epochs=10, validation_data=(x_val, y_val), callbacks=[tensorboard_callback] )
model.fit( x_train, y_train, epochs=10, validation_data=(x_val, y_val), callbacks=[tensorboard_callback] )

Launch: tensorboard --logdir=logs

启动:tensorboard --logdir=logs

undefined
undefined

Core Concepts

核心概念

1. SummaryWriter (PyTorch)

1. SummaryWriter(PyTorch)

python
from torch.utils.tensorboard import SummaryWriter
python
from torch.utils.tensorboard import SummaryWriter

Default directory: runs/CURRENT_DATETIME

默认目录:runs/CURRENT_DATETIME

writer = SummaryWriter()
writer = SummaryWriter()

Custom directory

自定义目录

writer = SummaryWriter('runs/experiment_1')
writer = SummaryWriter('runs/experiment_1')

Custom comment (appended to default directory)

自定义注释(追加到默认目录)

writer = SummaryWriter(comment='baseline')
writer = SummaryWriter(comment='baseline')

Log data

记录数据

writer.add_scalar('Loss/train', 0.5, step=0) writer.add_scalar('Loss/train', 0.3, step=1)
writer.add_scalar('Loss/train', 0.5, step=0) writer.add_scalar('Loss/train', 0.3, step=1)

Flush and close

刷新并关闭

writer.flush() writer.close()
undefined
writer.flush() writer.close()
undefined

2. Logging Scalars

2. 记录标量数据

python
undefined
python
undefined

PyTorch

PyTorch

from torch.utils.tensorboard import SummaryWriter writer = SummaryWriter()
for epoch in range(100): train_loss = train() val_loss = validate()
# Log individual metrics
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/val', val_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)

# Learning rate
lr = optimizer.param_groups[0]['lr']
writer.add_scalar('Learning_rate', lr, epoch)
writer.close()

```python
from torch.utils.tensorboard import SummaryWriter writer = SummaryWriter()
for epoch in range(100): train_loss = train() val_loss = validate()
# 记录单个指标
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/val', val_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)

# 学习率
lr = optimizer.param_groups[0]['lr']
writer.add_scalar('Learning_rate', lr, epoch)
writer.close()

```python

TensorFlow

TensorFlow

import tensorflow as tf
train_summary_writer = tf.summary.create_file_writer('logs/train') val_summary_writer = tf.summary.create_file_writer('logs/val')
for epoch in range(100): with train_summary_writer.as_default(): tf.summary.scalar('loss', train_loss, step=epoch) tf.summary.scalar('accuracy', train_acc, step=epoch)
with val_summary_writer.as_default():
    tf.summary.scalar('loss', val_loss, step=epoch)
    tf.summary.scalar('accuracy', val_acc, step=epoch)
undefined
import tensorflow as tf
train_summary_writer = tf.summary.create_file_writer('logs/train') val_summary_writer = tf.summary.create_file_writer('logs/val')
for epoch in range(100): with train_summary_writer.as_default(): tf.summary.scalar('loss', train_loss, step=epoch) tf.summary.scalar('accuracy', train_acc, step=epoch)
with val_summary_writer.as_default():
    tf.summary.scalar('loss', val_loss, step=epoch)
    tf.summary.scalar('accuracy', val_acc, step=epoch)
undefined

3. Logging Multiple Scalars

3. 记录多个标量数据

python
undefined
python
undefined

PyTorch: Group related metrics

PyTorch:分组相关指标

writer.add_scalars('Loss', { 'train': train_loss, 'validation': val_loss, 'test': test_loss }, epoch)
writer.add_scalars('Metrics', { 'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1_score }, epoch)
undefined
writer.add_scalars('Loss', { 'train': train_loss, 'validation': val_loss, 'test': test_loss }, epoch)
writer.add_scalars('Metrics', { 'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1_score }, epoch)
undefined

4. Logging Images

4. 记录图像

python
undefined
python
undefined

PyTorch

PyTorch

import torch from torchvision.utils import make_grid
import torch from torchvision.utils import make_grid

Single image

单张图像

writer.add_image('Input/sample', img_tensor, epoch)
writer.add_image('Input/sample', img_tensor, epoch)

Multiple images as grid

多张图像组成网格

img_grid = make_grid(images[:64], nrow=8) writer.add_image('Batch/inputs', img_grid, epoch)
img_grid = make_grid(images[:64], nrow=8) writer.add_image('Batch/inputs', img_grid, epoch)

Predictions visualization

预测结果可视化

pred_grid = make_grid(predictions[:16], nrow=4) writer.add_image('Predictions', pred_grid, epoch)

```python
pred_grid = make_grid(predictions[:16], nrow=4) writer.add_image('Predictions', pred_grid, epoch)

```python

TensorFlow

TensorFlow

import tensorflow as tf
with file_writer.as_default(): # Encode images as PNG tf.summary.image('Training samples', images, step=epoch, max_outputs=25)
undefined
import tensorflow as tf
with file_writer.as_default(): # 将图像编码为PNG格式 tf.summary.image('Training samples', images, step=epoch, max_outputs=25)
undefined

5. Logging Histograms

5. 记录直方图

python
undefined
python
undefined

PyTorch: Track weight distributions

PyTorch:追踪权重分布

for name, param in model.named_parameters(): writer.add_histogram(name, param, epoch)
# Track gradients
if param.grad is not None:
    writer.add_histogram(f'{name}.grad', param.grad, epoch)
for name, param in model.named_parameters(): writer.add_histogram(name, param, epoch)
# 追踪梯度
if param.grad is not None:
    writer.add_histogram(f'{name}.grad', param.grad, epoch)

Track activations

追踪激活值

writer.add_histogram('Activations/relu1', activations, epoch)

```python
writer.add_histogram('Activations/relu1', activations, epoch)

```python

TensorFlow

TensorFlow

with file_writer.as_default(): tf.summary.histogram('weights/layer1', layer1.kernel, step=epoch) tf.summary.histogram('activations/relu1', activations, step=epoch)
undefined
with file_writer.as_default(): tf.summary.histogram('weights/layer1', layer1.kernel, step=epoch) tf.summary.histogram('activations/relu1', activations, step=epoch)
undefined

6. Logging Model Graph

6. 记录模型图

python
undefined
python
undefined

PyTorch

PyTorch

import torch
model = MyModel() dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input) writer.close()

```python
import torch
model = MyModel() dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input) writer.close()

```python

TensorFlow (automatic with Keras)

TensorFlow(Keras自动支持)

tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs', write_graph=True )
model.fit(x, y, callbacks=[tensorboard_callback])
undefined
tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs', write_graph=True )
model.fit(x, y, callbacks=[tensorboard_callback])
undefined

Advanced Features

高级功能

Embedding Projector

嵌入投影器

Visualize high-dimensional data (embeddings, features) in 2D/3D.
python
import torch
from torch.utils.tensorboard import SummaryWriter
将高维数据(嵌入向量、特征)以2D/3D形式可视化。
python
import torch
from torch.utils.tensorboard import SummaryWriter

Get embeddings (e.g., word embeddings, image features)

获取嵌入向量(如词嵌入、图像特征)

embeddings = model.get_embeddings(data) # Shape: (N, embedding_dim)
embeddings = model.get_embeddings(data) # 形状:(N, embedding_dim)

Metadata (labels for each point)

元数据(每个点的标签)

metadata = ['class_1', 'class_2', 'class_1', ...]
metadata = ['class_1', 'class_2', 'class_1', ...]

Images (optional, for image embeddings)

图像(可选,用于图像嵌入)

label_images = torch.stack([img1, img2, img3, ...])
label_images = torch.stack([img1, img2, img3, ...])

Log to TensorBoard

记录到TensorBoard

writer.add_embedding( embeddings, metadata=metadata, label_img=label_images, global_step=epoch )

**In TensorBoard:**
- Navigate to "Projector" tab
- Choose PCA, t-SNE, or UMAP visualization
- Search, filter, and explore clusters
writer.add_embedding( embeddings, metadata=metadata, label_img=label_images, global_step=epoch )

**在TensorBoard中操作**:
- 导航至「Projector」标签页
- 选择PCA、t-SNE或UMAP可视化方式
- 搜索、过滤并探索聚类结果

Hyperparameter Tuning

超参数调优

python
from torch.utils.tensorboard import SummaryWriter
python
from torch.utils.tensorboard import SummaryWriter

Try different hyperparameters

尝试不同超参数

for lr in [0.001, 0.01, 0.1]: for batch_size in [16, 32, 64]: # Create unique run directory writer = SummaryWriter(f'runs/lr{lr}_bs{batch_size}')
    # Log hyperparameters
    writer.add_hparams(
        {'lr': lr, 'batch_size': batch_size},
        {'hparam/accuracy': final_acc, 'hparam/loss': final_loss}
    )

    # Train and log
    for epoch in range(10):
        loss = train(lr, batch_size)
        writer.add_scalar('Loss/train', loss, epoch)

    writer.close()
for lr in [0.001, 0.01, 0.1]: for batch_size in [16, 32, 64]: # 创建唯一的运行目录 writer = SummaryWriter(f'runs/lr{lr}_bs{batch_size}')
    # 记录超参数
    writer.add_hparams(
        {'lr': lr, 'batch_size': batch_size},
        {'hparam/accuracy': final_acc, 'hparam/loss': final_loss}
    )

    # 训练并记录
    for epoch in range(10):
        loss = train(lr, batch_size)
        writer.add_scalar('Loss/train', loss, epoch)

    writer.close()

Compare in TensorBoard's "HParams" tab

在TensorBoard的「HParams」标签页中对比

undefined
undefined

Text Logging

文本记录

python
undefined
python
undefined

PyTorch: Log text (e.g., model predictions, summaries)

PyTorch:记录文本(如模型预测结果、摘要)

writer.add_text('Predictions', f'Epoch {epoch}: {predictions}', epoch) writer.add_text('Config', str(config), 0)
writer.add_text('Predictions', f'Epoch {epoch}: {predictions}', epoch) writer.add_text('Config', str(config), 0)

Log markdown tables

记录Markdown表格

markdown_table = """
MetricValue
Accuracy0.95
F1 Score0.93
"""
writer.add_text('Results', markdown_table, epoch)
undefined
markdown_table = """
MetricValue
Accuracy0.95
F1 Score0.93
"""
writer.add_text('Results', markdown_table, epoch)
undefined

PR Curves

PR曲线

Precision-Recall curves for classification.
python
from torch.utils.tensorboard import SummaryWriter
用于分类任务的精确率-召回率曲线。
python
from torch.utils.tensorboard import SummaryWriter

Get predictions and labels

获取预测结果和标签

predictions = model(test_data) # Shape: (N, num_classes) labels = test_labels # Shape: (N,)
predictions = model(test_data) # 形状:(N, num_classes) labels = test_labels # 形状:(N,)

Log PR curve for each class

为每个类别记录PR曲线

for i in range(num_classes): writer.add_pr_curve( f'PR_curve/class_{i}', labels == i, predictions[:, i], global_step=epoch )
undefined
for i in range(num_classes): writer.add_pr_curve( f'PR_curve/class_{i}', labels == i, predictions[:, i], global_step=epoch )
undefined

Integration Examples

集成示例

PyTorch Training Loop

PyTorch训练循环

python
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
python
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter

Setup

初始化

writer = SummaryWriter('runs/resnet_experiment') model = ResNet50() optimizer = torch.optim.Adam(model.parameters(), lr=0.001) criterion = nn.CrossEntropyLoss()
writer = SummaryWriter('runs/resnet_experiment') model = ResNet50() optimizer = torch.optim.Adam(model.parameters(), lr=0.001) criterion = nn.CrossEntropyLoss()

Log model graph

记录模型图

dummy_input = torch.randn(1, 3, 224, 224) writer.add_graph(model, dummy_input)
dummy_input = torch.randn(1, 3, 224, 224) writer.add_graph(model, dummy_input)

Training loop

训练循环

for epoch in range(50): model.train() train_loss = 0.0 train_correct = 0
for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

    train_loss += loss.item()
    pred = output.argmax(dim=1)
    train_correct += pred.eq(target).sum().item()

    # Log batch metrics (every 100 batches)
    if batch_idx % 100 == 0:
        global_step = epoch * len(train_loader) + batch_idx
        writer.add_scalar('Loss/train_batch', loss.item(), global_step)

# Epoch metrics
train_loss /= len(train_loader)
train_acc = train_correct / len(train_loader.dataset)

# Validation
model.eval()
val_loss = 0.0
val_correct = 0

with torch.no_grad():
    for data, target in val_loader:
        output = model(data)
        val_loss += criterion(output, target).item()
        pred = output.argmax(dim=1)
        val_correct += pred.eq(target).sum().item()

val_loss /= len(val_loader)
val_acc = val_correct / len(val_loader.dataset)

# Log epoch metrics
writer.add_scalars('Loss', {'train': train_loss, 'val': val_loss}, epoch)
writer.add_scalars('Accuracy', {'train': train_acc, 'val': val_acc}, epoch)

# Log learning rate
writer.add_scalar('Learning_rate', optimizer.param_groups[0]['lr'], epoch)

# Log histograms (every 5 epochs)
if epoch % 5 == 0:
    for name, param in model.named_parameters():
        writer.add_histogram(name, param, epoch)

# Log sample predictions
if epoch % 10 == 0:
    sample_images = data[:8]
    writer.add_image('Sample_inputs', make_grid(sample_images), epoch)
writer.close()
undefined
for epoch in range(50): model.train() train_loss = 0.0 train_correct = 0
for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

    train_loss += loss.item()
    pred = output.argmax(dim=1)
    train_correct += pred.eq(target).sum().item()

    # 记录批次指标(每100个批次记录一次)
    if batch_idx % 100 == 0:
        global_step = epoch * len(train_loader) + batch_idx
        writer.add_scalar('Loss/train_batch', loss.item(), global_step)

#  epoch指标
train_loss /= len(train_loader)
train_acc = train_correct / len(train_loader.dataset)

# 验证
model.eval()
val_loss = 0.0
val_correct = 0

with torch.no_grad():
    for data, target in val_loader:
        output = model(data)
        val_loss += criterion(output, target).item()
        pred = output.argmax(dim=1)
        val_correct += pred.eq(target).sum().item()

val_loss /= len(val_loader)
val_acc = val_correct / len(val_loader.dataset)

# 记录epoch指标
writer.add_scalars('Loss', {'train': train_loss, 'val': val_loss}, epoch)
writer.add_scalars('Accuracy', {'train': train_acc, 'val': val_acc}, epoch)

# 记录学习率
writer.add_scalar('Learning_rate', optimizer.param_groups[0]['lr'], epoch)

# 记录直方图(每5个epoch记录一次)
if epoch % 5 == 0:
    for name, param in model.named_parameters():
        writer.add_histogram(name, param, epoch)

# 记录样本预测结果(每10个epoch记录一次)
if epoch % 10 == 0:
    sample_images = data[:8]
    writer.add_image('Sample_inputs', make_grid(sample_images), epoch)
writer.close()
undefined

TensorFlow/Keras Training

TensorFlow/Keras训练

python
import tensorflow as tf
python
import tensorflow as tf

Define model

定义模型

model = tf.keras.models.Sequential([ tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D(), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10, activation='softmax') ])
model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] )
model = tf.keras.models.Sequential([ tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D(), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10, activation='softmax') ])
model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] )

TensorBoard callback

TensorBoard回调函数

tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs/fit', histogram_freq=1, # Log histograms every epoch write_graph=True, # Visualize model graph write_images=True, # Visualize weights as images update_freq='epoch', # Log metrics every epoch profile_batch='500,520', # Profile batches 500-520 embeddings_freq=1 # Log embeddings every epoch )
tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs/fit', histogram_freq=1, # 每个epoch记录一次直方图 write_graph=True, # 可视化模型图 write_images=True, # 将权重可视化为图像 update_freq='epoch', # 每个epoch记录一次指标 profile_batch='500,520', # 分析第500-520个批次 embeddings_freq=1 # 每个epoch记录一次嵌入向量 )

Train

训练

model.fit( x_train, y_train, epochs=10, validation_data=(x_val, y_val), callbacks=[tensorboard_callback] )
undefined
model.fit( x_train, y_train, epochs=10, validation_data=(x_val, y_val), callbacks=[tensorboard_callback] )
undefined

Comparing Experiments

对比实验

Multiple Runs

多轮运行

bash
undefined
bash
undefined

Run experiments with different configs

使用不同配置运行实验

python train.py --lr 0.001 --logdir runs/exp1 python train.py --lr 0.01 --logdir runs/exp2 python train.py --lr 0.1 --logdir runs/exp3
python train.py --lr 0.001 --logdir runs/exp1 python train.py --lr 0.01 --logdir runs/exp2 python train.py --lr 0.1 --logdir runs/exp3

View all runs together

同时查看所有运行结果

tensorboard --logdir=runs

**In TensorBoard:**
- All runs appear in the same dashboard
- Toggle runs on/off for comparison
- Use regex to filter run names
- Overlay charts to compare metrics
tensorboard --logdir=runs

**在TensorBoard中操作**:
- 所有运行结果会显示在同一个仪表板中
- 可切换运行结果的显示/隐藏状态进行对比
- 使用正则表达式过滤运行名称
- 叠加图表以对比指标

Organizing Experiments

实验组织

python
undefined
python
undefined

Hierarchical organization

层级化组织

runs/ ├── baseline/ │ ├── run_1/ │ └── run_2/ ├── improved/ │ ├── run_1/ │ └── run_2/ └── final/ └── run_1/
runs/ ├── baseline/ │ ├── run_1/ │ └── run_2/ ├── improved/ │ ├── run_1/ │ └── run_2/ └── final/ └── run_1/

Log with hierarchy

按层级记录

writer = SummaryWriter('runs/baseline/run_1')
undefined
writer = SummaryWriter('runs/baseline/run_1')
undefined

Best Practices

最佳实践

1. Use Descriptive Run Names

1. 使用描述性的运行名称

python
undefined
python
undefined

✅ Good: Descriptive names

✅ 推荐:描述性名称

from datetime import datetime timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') writer = SummaryWriter(f'runs/resnet50_lr0.001_bs32_{timestamp}')
from datetime import datetime timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') writer = SummaryWriter(f'runs/resnet50_lr0.001_bs32_{timestamp}')

❌ Bad: Auto-generated names

❌ 不推荐:自动生成的名称

writer = SummaryWriter() # Creates runs/Jan01_12-34-56_hostname
undefined
writer = SummaryWriter() # 会创建runs/Jan01_12-34-56_hostname目录
undefined

2. Group Related Metrics

2. 对相关指标进行分组

python
undefined
python
undefined

✅ Good: Grouped metrics

✅ 推荐:分组后的指标

writer.add_scalar('Loss/train', train_loss, step) writer.add_scalar('Loss/val', val_loss, step) writer.add_scalar('Accuracy/train', train_acc, step) writer.add_scalar('Accuracy/val', val_acc, step)
writer.add_scalar('Loss/train', train_loss, step) writer.add_scalar('Loss/val', val_loss, step) writer.add_scalar('Accuracy/train', train_acc, step) writer.add_scalar('Accuracy/val', val_acc, step)

❌ Bad: Flat namespace

❌ 不推荐:扁平化命名空间

writer.add_scalar('train_loss', train_loss, step) writer.add_scalar('val_loss', val_loss, step)
undefined
writer.add_scalar('train_loss', train_loss, step) writer.add_scalar('val_loss', val_loss, step)
undefined

3. Log Regularly but Not Too Often

3. 定期记录但不过于频繁

python
undefined
python
undefined

✅ Good: Log epoch metrics always, batch metrics occasionally

✅ 推荐:始终记录epoch指标,偶尔记录批次指标

for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): loss = train_step(data, target)
    # Log every 100 batches
    if batch_idx % 100 == 0:
        writer.add_scalar('Loss/batch', loss, global_step)

# Always log epoch metrics
writer.add_scalar('Loss/epoch', epoch_loss, epoch)
for epoch in range(100): for batch_idx, (data, target) in enumerate(train_loader): loss = train_step(data, target)
    # 每100个批次记录一次
    if batch_idx % 100 == 0:
        writer.add_scalar('Loss/batch', loss, global_step)

# 始终记录epoch指标
writer.add_scalar('Loss/epoch', epoch_loss, epoch)

❌ Bad: Log every batch (creates huge log files)

❌ 不推荐:每个批次都记录(会生成巨大的日志文件)

for batch in train_loader: writer.add_scalar('Loss', loss, step) # Too frequent
undefined
for batch in train_loader: writer.add_scalar('Loss', loss, step) # 过于频繁
undefined

4. Close Writer When Done

4. 使用完毕后关闭写入器

python
undefined
python
undefined

✅ Good: Use context manager

✅ 推荐:使用上下文管理器

with SummaryWriter('runs/exp1') as writer: for epoch in range(10): writer.add_scalar('Loss', loss, epoch)
with SummaryWriter('runs/exp1') as writer: for epoch in range(10): writer.add_scalar('Loss', loss, epoch)

Automatically closes

会自动关闭

Or manually

或者手动关闭

writer = SummaryWriter('runs/exp1')
writer = SummaryWriter('runs/exp1')

... logging ...

... 记录数据 ...

writer.close()
undefined
writer.close()
undefined

5. Use Separate Writers for Train/Val

5. 为训练和验证使用独立的写入器

python
undefined
python
undefined

✅ Good: Separate log directories

✅ 推荐:使用独立的日志目录

train_writer = SummaryWriter('runs/exp1/train') val_writer = SummaryWriter('runs/exp1/val')
train_writer.add_scalar('loss', train_loss, epoch) val_writer.add_scalar('loss', val_loss, epoch)
undefined
train_writer = SummaryWriter('runs/exp1/train') val_writer = SummaryWriter('runs/exp1/val')
train_writer.add_scalar('loss', train_loss, epoch) val_writer.add_scalar('loss', val_loss, epoch)
undefined

Performance Profiling

性能分析

TensorFlow Profiler

TensorFlow分析器

python
undefined
python
undefined

Enable profiling

启用分析功能

tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs', profile_batch='10,20' # Profile batches 10-20 )
model.fit(x, y, callbacks=[tensorboard_callback])
tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='logs', profile_batch='10,20' # 分析第10-20个批次 )
model.fit(x, y, callbacks=[tensorboard_callback])

View in TensorBoard Profile tab

在TensorBoard的Profile标签页中查看

Shows: GPU utilization, kernel stats, memory usage, bottlenecks

展示内容:GPU利用率、内核统计、内存使用情况、性能瓶颈

undefined
undefined

PyTorch Profiler

PyTorch分析器

python
import torch.profiler as profiler

with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA
    ],
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/profiler'),
    record_shapes=True,
    with_stack=True
) as prof:
    for batch in train_loader:
        loss = train_step(batch)
        prof.step()
python
import torch.profiler as profiler

with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA
    ],
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/profiler'),
    record_shapes=True,
    with_stack=True
) as prof:
    for batch in train_loader:
        loss = train_step(batch)
        prof.step()

View in TensorBoard Profile tab

在TensorBoard的Profile标签页中查看

undefined
undefined

Resources

资源

See Also

相关文档

  • references/visualization.md
    - Comprehensive visualization guide
  • references/profiling.md
    - Performance profiling patterns
  • references/integrations.md
    - Framework-specific integration examples
  • references/visualization.md
    - 全面的可视化指南
  • references/profiling.md
    - 性能分析模式
  • references/integrations.md
    - 框架专属集成示例