lambda-labs-gpu-cloud

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Lambda Labs GPU Cloud

Lambda Labs GPU 云服务

何时使用 Lambda Labs

Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.
在以下场景使用 Lambda Labs:
  • 需要具备完整SSH访问权限的专属GPU实例
  • 运行长时间训练任务(数小时至数天)
  • 希望采用无出口费用的简洁定价模式
  • 需要跨会话的持久化存储
  • 要求高性能多节点集群(16-512块GPU)
  • 想要预安装的机器学习栈(包含PyTorch、CUDA、NCCL的Lambda Stack)
核心特性:
  • GPU多样性:B200、H100、GH200、A100、A10、A6000、V100
  • Lambda Stack:预安装PyTorch、TensorFlow、CUDA、cuDNN、NCCL
  • 持久化文件系统:实例重启后仍保留数据
  • 一键式集群:配备InfiniBand的16-512 GPU Slurm集群
  • 简洁定价:按分钟计费,无出口费用
  • 全球区域:覆盖全球12+个区域
可选择替代方案的场景:
  • Modal:适用于无服务器、自动扩缩容的工作负载
  • SkyPilot:适用于多云编排与成本优化
  • RunPod:适用于更便宜的竞价实例与无服务器端点
  • Vast.ai:适用于价格最低的GPU市场

When to use Lambda Labs

快速开始

账户设置

Use Lambda Labs when:
  • Need dedicated GPU instances with full SSH access
  • Running long training jobs (hours to days)
  • Want simple pricing with no egress fees
  • Need persistent storage across sessions
  • Require high-performance multi-node clusters (16-512 GPUs)
  • Want pre-installed ML stack (Lambda Stack with PyTorch, CUDA, NCCL)
Key features:
  • GPU variety: B200, H100, GH200, A100, A10, A6000, V100
  • Lambda Stack: Pre-installed PyTorch, TensorFlow, CUDA, cuDNN, NCCL
  • Persistent filesystems: Keep data across instance restarts
  • 1-Click Clusters: 16-512 GPU Slurm clusters with InfiniBand
  • Simple pricing: Pay-per-minute, no egress fees
  • Global regions: 12+ regions worldwide
Use alternatives instead:
  • Modal: For serverless, auto-scaling workloads
  • SkyPilot: For multi-cloud orchestration and cost optimization
  • RunPod: For cheaper spot instances and serverless endpoints
  • Vast.ai: For GPU marketplace with lowest prices
  1. https://lambda.ai 创建账户
  2. 添加支付方式
  3. 从控制台生成API密钥
  4. 添加SSH密钥(启动实例前必须完成)

Quick start

通过控制台启动实例

Account setup

  1. Create account at https://lambda.ai
  2. Add payment method
  3. Generate API key from dashboard
  4. Add SSH key (required before launching instances)
  1. 访问 https://cloud.lambda.ai/instances
  2. 点击“启动实例”
  3. 选择GPU类型与区域
  4. 选择SSH密钥
  5. 可选:挂载文件系统
  6. 启动实例,等待3-15分钟

Launch via console

通过SSH连接

  1. Go to https://cloud.lambda.ai/instances
  2. Click "Launch instance"
  3. Select GPU type and region
  4. Choose SSH key
  5. Optionally attach filesystem
  6. Launch and wait 3-15 minutes
bash
undefined

Connect via SSH

从控制台获取实例IP

bash
undefined
ssh ubuntu@<INSTANCE-IP>

Get instance IP from console

或使用指定密钥

ssh ubuntu@<INSTANCE-IP>
ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
undefined

Or with specific key

GPU实例

可用GPU型号

ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
undefined
GPU显存每GPU每小时价格最佳适用场景
B200 SXM6180 GB$4.99大模型训练、最快训练速度
H100 SXM80 GB$2.99-3.29大模型训练
H100 PCIe80 GB$2.49高性价比H100
GH20096 GB$1.49单GPU大模型训练
A100 80GB80 GB$1.79生产级训练
A100 40GB40 GB$1.29标准训练
A1024 GB$0.75推理、微调
A600048 GB$0.80显存性价比高
V10016 GB$0.55经济型训练

GPU instances

实例配置

Available GPUs

GPUVRAMPrice/GPU/hrBest For
B200 SXM6180 GB$4.99Largest models, fastest training
H100 SXM80 GB$2.99-3.29Large model training
H100 PCIe80 GB$2.49Cost-effective H100
GH20096 GB$1.49Single-GPU large models
A100 80GB80 GB$1.79Production training
A100 40GB40 GB$1.29Standard training
A1024 GB$0.75Inference, fine-tuning
A600048 GB$0.80Good VRAM/price ratio
V10016 GB$0.55Budget training
8x GPU:最适合分布式训练(DDP、FSDP)
4x GPU:大模型、多GPU训练
2x GPU:中等工作负载
1x GPU:微调、推理、开发

Instance configurations

启动时长

8x GPU: Best for distributed training (DDP, FSDP)
4x GPU: Large models, multi-GPU training
2x GPU: Medium workloads
1x GPU: Fine-tuning, inference, development
  • 单GPU实例:3-5分钟
  • 多GPU实例:10-15分钟

Launch times

Lambda Stack

  • Single-GPU: 3-5 minutes
  • Multi-GPU: 10-15 minutes
所有实例均预安装Lambda Stack:
bash
undefined

Lambda Stack

包含的软件

All instances come with Lambda Stack pre-installed:
bash
undefined
  • Ubuntu 22.04 LTS
  • NVIDIA驱动(最新版)
  • CUDA 12.x
  • cuDNN 8.x
  • NCCL(用于多GPU)
  • PyTorch(最新版)
  • TensorFlow(最新版)
  • JAX
  • JupyterLab
undefined

Included software

验证安装

  • Ubuntu 22.04 LTS
  • NVIDIA drivers (latest)
  • CUDA 12.x
  • cuDNN 8.x
  • NCCL (for multi-GPU)
  • PyTorch (latest)
  • TensorFlow (latest)
  • JAX
  • JupyterLab
undefined
bash
undefined

Verify installation

检查GPU

bash
undefined
nvidia-smi

Check GPU

检查PyTorch

nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

Check PyTorch

检查CUDA版本

python -c "import torch; print(torch.cuda.is_available())"
nvcc --version
undefined

Check CUDA version

Python API

安装

nvcc --version
undefined
bash
pip install lambda-cloud-client

Python API

身份验证

Installation

bash
pip install lambda-cloud-client
python
import os
import lambda_cloud_client

Authentication

使用API密钥配置

python
import os
import lambda_cloud_client
configuration = lambda_cloud_client.Configuration( host="https://cloud.lambdalabs.com/api/v1", access_token=os.environ["LAMBDA_API_KEY"] )
undefined

Configure with API key

列出可用实例

configuration = lambda_cloud_client.Configuration( host="https://cloud.lambdalabs.com/api/v1", access_token=os.environ["LAMBDA_API_KEY"] )
undefined
python
with lambda_cloud_client.ApiClient(configuration) as api_client:
    api = lambda_cloud_client.DefaultApi(api_client)

    # 获取可用实例类型
    types = api.instance_types()
    for name, info in types.data.items():
        print(f"{name}: {info.instance_type.description}")

List available instances

启动实例

python
with lambda_cloud_client.ApiClient(configuration) as api_client:
    api = lambda_cloud_client.DefaultApi(api_client)

    # Get available instance types
    types = api.instance_types()
    for name, info in types.data.items():
        print(f"{name}: {info.instance_type.description}")
python
from lambda_cloud_client.models import LaunchInstanceRequest

request = LaunchInstanceRequest(
    region_name="us-west-1",
    instance_type_name="gpu_1x_h100_sxm5",
    ssh_key_names=["my-ssh-key"],
    file_system_names=["my-filesystem"],  # 可选
    name="training-job"
)

response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"已启动:{instance_id}")

Launch instance

列出运行中的实例

python
from lambda_cloud_client.models import LaunchInstanceRequest

request = LaunchInstanceRequest(
    region_name="us-west-1",
    instance_type_name="gpu_1x_h100_sxm5",
    ssh_key_names=["my-ssh-key"],
    file_system_names=["my-filesystem"],  # Optional
    name="training-job"
)

response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"Launched: {instance_id}")
python
instances = api.list_instances()
for instance in instances.data:
    print(f"{instance.name}: {instance.ip} ({instance.status})")

List running instances

终止实例

python
instances = api.list_instances()
for instance in instances.data:
    print(f"{instance.name}: {instance.ip} ({instance.status})")
python
from lambda_cloud_client.models import TerminateInstanceRequest

request = TerminateInstanceRequest(
    instance_ids=[instance_id]
)
api.terminate_instance(request)

Terminate instance

SSH密钥管理

python
from lambda_cloud_client.models import TerminateInstanceRequest

request = TerminateInstanceRequest(
    instance_ids=[instance_id]
)
api.terminate_instance(request)
python
from lambda_cloud_client.models import AddSshKeyRequest

SSH key management

添加SSH密钥

python
from lambda_cloud_client.models import AddSshKeyRequest
request = AddSshKeyRequest( name="my-key", public_key="ssh-rsa AAAA..." ) api.add_ssh_key(request)

Add SSH key

列出密钥

request = AddSshKeyRequest( name="my-key", public_key="ssh-rsa AAAA..." ) api.add_ssh_key(request)
keys = api.list_ssh_keys()

List keys

删除密钥

keys = api.list_ssh_keys()
api.delete_ssh_key(key_id)
undefined

Delete key

基于curl的CLI操作

列出实例类型

api.delete_ssh_key(key_id)
undefined
bash
curl -u $LAMBDA_API_KEY: \
  https://cloud.lambdalabs.com/api/v1/instance-types | jq

CLI with curl

启动实例

List instance types

bash
curl -u $LAMBDA_API_KEY: \
  https://cloud.lambdalabs.com/api/v1/instance-types | jq
bash
curl -u $LAMBDA_API_KEY: \
  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
  -H "Content-Type: application/json" \
  -d '{
    "region_name": "us-west-1",
    "instance_type_name": "gpu_1x_h100_sxm5",
    "ssh_key_names": ["my-key"]
  }' | jq

Launch instance

终止实例

bash
curl -u $LAMBDA_API_KEY: \
  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
  -H "Content-Type: application/json" \
  -d '{
    "region_name": "us-west-1",
    "instance_type_name": "gpu_1x_h100_sxm5",
    "ssh_key_names": ["my-key"]
  }' | jq
bash
curl -u $LAMBDA_API_KEY: \
  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
  -H "Content-Type: application/json" \
  -d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq

Terminate instance

持久化存储

文件系统

bash
curl -u $LAMBDA_API_KEY: \
  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
  -H "Content-Type: application/json" \
  -d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq
文件系统可跨实例重启保留数据:
bash
undefined

Persistent storage

挂载路径

Filesystems

Filesystems persist data across instance restarts:
bash
undefined
/lambda/nfs/<FILESYSTEM_NAME>

Mount location

示例:保存检查点

/lambda/nfs/<FILESYSTEM_NAME>
python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
undefined

Example: save checkpoints

创建文件系统

python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
undefined
  1. 进入Lambda控制台的存储页面
  2. 点击“创建文件系统”
  3. 选择区域(必须与实例区域匹配)
  4. 命名并创建

Create filesystem

挂载到实例

  1. Go to Storage in Lambda console
  2. Click "Create filesystem"
  3. Select region (must match instance region)
  4. Name and create
文件系统必须在实例启动时挂载:
  • 通过控制台:启动时选择文件系统
  • 通过API:在启动请求中包含
    file_system_names
    参数

Attach to instance

最佳实践

Filesystems must be attached at instance launch time:
  • Via console: Select filesystem when launching
  • Via API: Include
    file_system_names
    in launch request
bash
undefined

Best practices

存储在文件系统(持久化)

bash
undefined
/lambda/nfs/storage/ ├── datasets/ ├── checkpoints/ ├── models/ └── outputs/

Store on filesystem (persists)

本地SSD(更快,临时存储)

/lambda/nfs/storage/ ├── datasets/ ├── checkpoints/ ├── models/ └── outputs/
/home/ubuntu/ └── working/ # 临时文件
undefined

Local SSD (faster, ephemeral)

SSH配置

添加SSH密钥

/home/ubuntu/ └── working/ # Temporary files
undefined
bash
undefined

SSH configuration

在本地生成密钥

Add SSH key

bash
undefined
ssh-keygen -t ed25519 -f ~/.ssh/lambda_key

Generate key locally

将公钥添加到Lambda控制台

或通过API添加

ssh-keygen -t ed25519 -f ~/.ssh/lambda_key
undefined

Add public key to Lambda console

多密钥配置

Or via API

undefined
bash
undefined

Multiple keys

在实例上添加更多密钥

bash
undefined
echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
undefined

On instance, add more keys

从GitHub导入

echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
undefined
bash
undefined

Import from GitHub

在实例上执行

bash
undefined
ssh-import-id gh:username
undefined

On instance

SSH隧道

ssh-import-id gh:username
undefined
bash
undefined

SSH tunneling

转发Jupyter

bash
undefined
ssh -L 8888:localhost:8888 ubuntu@<IP>

Forward Jupyter

转发TensorBoard

ssh -L 8888:localhost:8888 ubuntu@<IP>
ssh -L 6006:localhost:6006 ubuntu@<IP>

Forward TensorBoard

多端口转发

ssh -L 6006:localhost:6006 ubuntu@<IP>
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
undefined

Multiple ports

JupyterLab

从控制台启动

ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
undefined
  1. 进入实例页面
  2. 点击“Cloud IDE”列的“启动”按钮
  3. JupyterLab将在浏览器中打开

JupyterLab

手动访问

Launch from console

  1. Go to Instances page
  2. Click "Launch" in Cloud IDE column
  3. JupyterLab opens in browser
bash
undefined

Manual access

在实例上执行

bash
undefined
jupyter lab --ip=0.0.0.0 --port=8888

On instance

从本地机器通过隧道访问

jupyter lab --ip=0.0.0.0 --port=8888
ssh -L 8888:localhost:8888 ubuntu@<IP>

From local machine with tunnel

ssh -L 8888:localhost:8888 ubuntu@<IP>
undefined

训练工作流

单GPU训练

undefined
bash
undefined

Training workflows

SSH连接到实例

Single-GPU training

bash
undefined
ssh ubuntu@<IP>

SSH to instance

克隆仓库

ssh ubuntu@<IP>
git clone https://github.com/user/project cd project

Clone repo

安装依赖

git clone https://github.com/user/project cd project
pip install -r requirements.txt

Install dependencies

开始训练

pip install -r requirements.txt
python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
undefined

Train

多GPU训练(单节点)

python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
undefined
python
undefined

Multi-GPU training (single node)

train_ddp.py

python
undefined
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP
def main(): dist.init_process_group("nccl") rank = dist.get_rank() device = rank % torch.cuda.device_count()
model = MyModel().to(device)
model = DDP(model, device_ids=[device])

# 训练循环...
if name == "main": main()

```bash

train_ddp.py

使用torchrun启动(8块GPU)

import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP
def main(): dist.init_process_group("nccl") rank = dist.get_rank() device = rank % torch.cuda.device_count()
model = MyModel().to(device)
model = DDP(model, device_ids=[device])

# Training loop...
if name == "main": main()

```bash
torchrun --nproc_per_node=8 train_ddp.py
undefined

Launch with torchrun (8 GPUs)

保存检查点到文件系统

torchrun --nproc_per_node=8 train_ddp.py
undefined
python
import os

checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

Checkpoint to filesystem

保存检查点

python
import os

checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)
torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
undefined

Save checkpoint

一键式集群

概述

torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
undefined
高性能Slurm集群具备以下特性:
  • 16-512块NVIDIA H100或B200 GPU
  • NVIDIA Quantum-2 400 Gb/s InfiniBand
  • 3200 Gb/s的GPUDirect RDMA
  • 预安装分布式机器学习栈

1-Click Clusters

包含的软件

Overview

High-performance Slurm clusters with:
  • 16-512 NVIDIA H100 or B200 GPUs
  • NVIDIA Quantum-2 400 Gb/s InfiniBand
  • GPUDirect RDMA at 3200 Gb/s
  • Pre-installed distributed ML stack
  • Ubuntu 22.04 LTS + Lambda Stack
  • NCCL、Open MPI
  • 支持DDP和FSDP的PyTorch
  • TensorFlow
  • OFED驱动

Included software

存储

  • Ubuntu 22.04 LTS + Lambda Stack
  • NCCL, Open MPI
  • PyTorch with DDP and FSDP
  • TensorFlow
  • OFED drivers
  • 每个计算节点配备24 TB NVMe(临时存储)
  • Lambda文件系统用于持久化数据

Storage

多节点训练

  • 24 TB NVMe per compute node (ephemeral)
  • Lambda filesystems for persistent data
bash
undefined

Multi-node training

在Slurm集群上执行

bash
undefined
srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8
torchrun --nnodes=4 --nproc_per_node=8
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500
train.py
undefined

On Slurm cluster

网络

带宽

srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8
torchrun --nnodes=4 --nproc_per_node=8
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500
train.py
undefined
  • 同区域实例间:最高200 Gbps
  • 互联网出站:最高20 Gbps

Networking

防火墙

Bandwidth

  • Inter-instance (same region): up to 200 Gbps
  • Internet outbound: 20 Gbps max
  • 默认:仅开放22端口(SSH)
  • 可在Lambda控制台配置额外端口
  • 默认允许ICMP流量

Firewall

私有IP

  • Default: Only port 22 (SSH) open
  • Configure additional ports in Lambda console
  • ICMP traffic allowed by default
bash
undefined

Private IPs

查看私有IP

bash
undefined
ip addr show | grep 'inet '
undefined

Find private IP

常见工作流

工作流1:大语言模型(LLM)微调

ip addr show | grep 'inet '
undefined
bash
undefined

Common workflows

1. 启动带文件系统的8x H100实例

Workflow 1: Fine-tuning LLM

2. SSH连接并配置

bash
undefined
ssh ubuntu@<IP> pip install transformers accelerate peft

1. Launch 8x H100 instance with filesystem

3. 将模型下载到文件系统

2. SSH and setup

ssh ubuntu@<IP> pip install transformers accelerate peft
python -c " from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf') model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b') "

3. Download model to filesystem

4. 微调并将检查点保存到文件系统

python -c " from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf') model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b') "
accelerate launch --num_processes 8 train.py
--model_path /lambda/nfs/storage/models/llama-2-7b
--output_dir /lambda/nfs/storage/outputs
--checkpoint_dir /lambda/nfs/storage/checkpoints
undefined

4. Fine-tune with checkpoints on filesystem

工作流2:批量推理

accelerate launch --num_processes 8 train.py
--model_path /lambda/nfs/storage/models/llama-2-7b
--output_dir /lambda/nfs/storage/outputs
--checkpoint_dir /lambda/nfs/storage/checkpoints
undefined
bash
undefined

Workflow 2: Batch inference

1. 启动A10实例(推理场景高性价比)

2. 执行推理

bash
undefined
python inference.py
--model /lambda/nfs/storage/models/fine-tuned
--input /lambda/nfs/storage/data/inputs.jsonl
--output /lambda/nfs/storage/data/outputs.jsonl
undefined

1. Launch A10 instance (cost-effective for inference)

成本优化

2. Run inference

选择合适的GPU

python inference.py
--model /lambda/nfs/storage/models/fine-tuned
--input /lambda/nfs/storage/data/inputs.jsonl
--output /lambda/nfs/storage/data/outputs.jsonl
undefined
任务推荐GPU
LLM微调(7B参数)A100 40GB
LLM微调(70B参数)8x H100
推理A10、A6000
开发V100、A10
极致性能B200

Cost optimization

降低成本的方法

Choose right GPU

TaskRecommended GPU
LLM fine-tuning (7B)A100 40GB
LLM fine-tuning (70B)8x H100
InferenceA10, A6000
DevelopmentV100, A10
Maximum performanceB200
  1. 使用文件系统:避免重复下载数据
  2. 频繁保存检查点:可恢复中断的训练
  3. 合理选型:不要过度配置GPU
  4. 终止闲置实例:无自动停止功能,需手动终止

Reduce costs

监控使用情况

  1. Use filesystems: Avoid re-downloading data
  2. Checkpoint frequently: Resume interrupted training
  3. Right-size: Don't over-provision GPUs
  4. Terminate idle: No auto-stop, manually terminate
  • 控制台显示实时GPU利用率
  • 可通过API进行程序化监控

Monitor usage

常见问题

  • Dashboard shows real-time GPU utilization
  • API for programmatic monitoring
问题解决方案
实例无法启动检查区域可用性,尝试其他GPU型号
SSH连接被拒绝等待实例初始化完成(3-15分钟)
终止实例后数据丢失使用持久化文件系统
数据传输缓慢使用与实例同区域的文件系统
无法检测到GPU重启实例,检查驱动

Common issues

参考资料

IssueSolution
Instance won't launchCheck region availability, try different GPU
SSH connection refusedWait for instance to initialize (3-15 min)
Data lost after terminateUse persistent filesystems
Slow data transferUse filesystem in same region
GPU not detectedReboot instance, check drivers
  • 高级用法 - 多节点训练、API自动化
  • 故障排除 - 常见问题与解决方案

References

资源

  • Advanced Usage - Multi-node training, API automation
  • Troubleshooting - Common issues and solutions

Resources