lambda-labs-gpu-cloud
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLambda Labs GPU Cloud
Lambda Labs GPU 云服务
—
何时使用 Lambda Labs
Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.
在以下场景使用 Lambda Labs:
- 需要具备完整SSH访问权限的专属GPU实例
- 运行长时间训练任务(数小时至数天)
- 希望采用无出口费用的简洁定价模式
- 需要跨会话的持久化存储
- 要求高性能多节点集群(16-512块GPU)
- 想要预安装的机器学习栈(包含PyTorch、CUDA、NCCL的Lambda Stack)
核心特性:
- GPU多样性:B200、H100、GH200、A100、A10、A6000、V100
- Lambda Stack:预安装PyTorch、TensorFlow、CUDA、cuDNN、NCCL
- 持久化文件系统:实例重启后仍保留数据
- 一键式集群:配备InfiniBand的16-512 GPU Slurm集群
- 简洁定价:按分钟计费,无出口费用
- 全球区域:覆盖全球12+个区域
可选择替代方案的场景:
- Modal:适用于无服务器、自动扩缩容的工作负载
- SkyPilot:适用于多云编排与成本优化
- RunPod:适用于更便宜的竞价实例与无服务器端点
- Vast.ai:适用于价格最低的GPU市场
When to use Lambda Labs
快速开始
—
账户设置
Use Lambda Labs when:
- Need dedicated GPU instances with full SSH access
- Running long training jobs (hours to days)
- Want simple pricing with no egress fees
- Need persistent storage across sessions
- Require high-performance multi-node clusters (16-512 GPUs)
- Want pre-installed ML stack (Lambda Stack with PyTorch, CUDA, NCCL)
Key features:
- GPU variety: B200, H100, GH200, A100, A10, A6000, V100
- Lambda Stack: Pre-installed PyTorch, TensorFlow, CUDA, cuDNN, NCCL
- Persistent filesystems: Keep data across instance restarts
- 1-Click Clusters: 16-512 GPU Slurm clusters with InfiniBand
- Simple pricing: Pay-per-minute, no egress fees
- Global regions: 12+ regions worldwide
Use alternatives instead:
- Modal: For serverless, auto-scaling workloads
- SkyPilot: For multi-cloud orchestration and cost optimization
- RunPod: For cheaper spot instances and serverless endpoints
- Vast.ai: For GPU marketplace with lowest prices
- 在 https://lambda.ai 创建账户
- 添加支付方式
- 从控制台生成API密钥
- 添加SSH密钥(启动实例前必须完成)
Quick start
通过控制台启动实例
Account setup
—
- Create account at https://lambda.ai
- Add payment method
- Generate API key from dashboard
- Add SSH key (required before launching instances)
- 访问 https://cloud.lambda.ai/instances
- 点击“启动实例”
- 选择GPU类型与区域
- 选择SSH密钥
- 可选:挂载文件系统
- 启动实例,等待3-15分钟
Launch via console
通过SSH连接
- Go to https://cloud.lambda.ai/instances
- Click "Launch instance"
- Select GPU type and region
- Choose SSH key
- Optionally attach filesystem
- Launch and wait 3-15 minutes
bash
undefinedConnect via SSH
从控制台获取实例IP
bash
undefinedssh ubuntu@<INSTANCE-IP>
Get instance IP from console
或使用指定密钥
ssh ubuntu@<INSTANCE-IP>
ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
undefinedOr with specific key
GPU实例
—
可用GPU型号
ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
undefined| GPU | 显存 | 每GPU每小时价格 | 最佳适用场景 |
|---|---|---|---|
| B200 SXM6 | 180 GB | $4.99 | 大模型训练、最快训练速度 |
| H100 SXM | 80 GB | $2.99-3.29 | 大模型训练 |
| H100 PCIe | 80 GB | $2.49 | 高性价比H100 |
| GH200 | 96 GB | $1.49 | 单GPU大模型训练 |
| A100 80GB | 80 GB | $1.79 | 生产级训练 |
| A100 40GB | 40 GB | $1.29 | 标准训练 |
| A10 | 24 GB | $0.75 | 推理、微调 |
| A6000 | 48 GB | $0.80 | 显存性价比高 |
| V100 | 16 GB | $0.55 | 经济型训练 |
GPU instances
实例配置
Available GPUs
—
| GPU | VRAM | Price/GPU/hr | Best For |
|---|---|---|---|
| B200 SXM6 | 180 GB | $4.99 | Largest models, fastest training |
| H100 SXM | 80 GB | $2.99-3.29 | Large model training |
| H100 PCIe | 80 GB | $2.49 | Cost-effective H100 |
| GH200 | 96 GB | $1.49 | Single-GPU large models |
| A100 80GB | 80 GB | $1.79 | Production training |
| A100 40GB | 40 GB | $1.29 | Standard training |
| A10 | 24 GB | $0.75 | Inference, fine-tuning |
| A6000 | 48 GB | $0.80 | Good VRAM/price ratio |
| V100 | 16 GB | $0.55 | Budget training |
8x GPU:最适合分布式训练(DDP、FSDP)
4x GPU:大模型、多GPU训练
2x GPU:中等工作负载
1x GPU:微调、推理、开发Instance configurations
启动时长
8x GPU: Best for distributed training (DDP, FSDP)
4x GPU: Large models, multi-GPU training
2x GPU: Medium workloads
1x GPU: Fine-tuning, inference, development- 单GPU实例:3-5分钟
- 多GPU实例:10-15分钟
Launch times
Lambda Stack
- Single-GPU: 3-5 minutes
- Multi-GPU: 10-15 minutes
所有实例均预安装Lambda Stack:
bash
undefinedLambda Stack
包含的软件
All instances come with Lambda Stack pre-installed:
bash
undefined- Ubuntu 22.04 LTS
- NVIDIA驱动(最新版)
- CUDA 12.x
- cuDNN 8.x
- NCCL(用于多GPU)
- PyTorch(最新版)
- TensorFlow(最新版)
- JAX
- JupyterLab
undefinedIncluded software
验证安装
- Ubuntu 22.04 LTS
- NVIDIA drivers (latest)
- CUDA 12.x
- cuDNN 8.x
- NCCL (for multi-GPU)
- PyTorch (latest)
- TensorFlow (latest)
- JAX
- JupyterLab
undefinedbash
undefinedVerify installation
检查GPU
bash
undefinednvidia-smi
Check GPU
检查PyTorch
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"
Check PyTorch
检查CUDA版本
python -c "import torch; print(torch.cuda.is_available())"
nvcc --version
undefinedCheck CUDA version
Python API
—
安装
nvcc --version
undefinedbash
pip install lambda-cloud-clientPython API
身份验证
Installation
—
bash
pip install lambda-cloud-clientpython
import os
import lambda_cloud_clientAuthentication
使用API密钥配置
python
import os
import lambda_cloud_clientconfiguration = lambda_cloud_client.Configuration(
host="https://cloud.lambdalabs.com/api/v1",
access_token=os.environ["LAMBDA_API_KEY"]
)
undefinedConfigure with API key
列出可用实例
configuration = lambda_cloud_client.Configuration(
host="https://cloud.lambdalabs.com/api/v1",
access_token=os.environ["LAMBDA_API_KEY"]
)
undefinedpython
with lambda_cloud_client.ApiClient(configuration) as api_client:
api = lambda_cloud_client.DefaultApi(api_client)
# 获取可用实例类型
types = api.instance_types()
for name, info in types.data.items():
print(f"{name}: {info.instance_type.description}")List available instances
启动实例
python
with lambda_cloud_client.ApiClient(configuration) as api_client:
api = lambda_cloud_client.DefaultApi(api_client)
# Get available instance types
types = api.instance_types()
for name, info in types.data.items():
print(f"{name}: {info.instance_type.description}")python
from lambda_cloud_client.models import LaunchInstanceRequest
request = LaunchInstanceRequest(
region_name="us-west-1",
instance_type_name="gpu_1x_h100_sxm5",
ssh_key_names=["my-ssh-key"],
file_system_names=["my-filesystem"], # 可选
name="training-job"
)
response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"已启动:{instance_id}")Launch instance
列出运行中的实例
python
from lambda_cloud_client.models import LaunchInstanceRequest
request = LaunchInstanceRequest(
region_name="us-west-1",
instance_type_name="gpu_1x_h100_sxm5",
ssh_key_names=["my-ssh-key"],
file_system_names=["my-filesystem"], # Optional
name="training-job"
)
response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"Launched: {instance_id}")python
instances = api.list_instances()
for instance in instances.data:
print(f"{instance.name}: {instance.ip} ({instance.status})")List running instances
终止实例
python
instances = api.list_instances()
for instance in instances.data:
print(f"{instance.name}: {instance.ip} ({instance.status})")python
from lambda_cloud_client.models import TerminateInstanceRequest
request = TerminateInstanceRequest(
instance_ids=[instance_id]
)
api.terminate_instance(request)Terminate instance
SSH密钥管理
python
from lambda_cloud_client.models import TerminateInstanceRequest
request = TerminateInstanceRequest(
instance_ids=[instance_id]
)
api.terminate_instance(request)python
from lambda_cloud_client.models import AddSshKeyRequestSSH key management
添加SSH密钥
python
from lambda_cloud_client.models import AddSshKeyRequestrequest = AddSshKeyRequest(
name="my-key",
public_key="ssh-rsa AAAA..."
)
api.add_ssh_key(request)
Add SSH key
列出密钥
request = AddSshKeyRequest(
name="my-key",
public_key="ssh-rsa AAAA..."
)
api.add_ssh_key(request)
keys = api.list_ssh_keys()
List keys
删除密钥
keys = api.list_ssh_keys()
api.delete_ssh_key(key_id)
undefinedDelete key
基于curl的CLI操作
—
列出实例类型
api.delete_ssh_key(key_id)
undefinedbash
curl -u $LAMBDA_API_KEY: \
https://cloud.lambdalabs.com/api/v1/instance-types | jqCLI with curl
启动实例
List instance types
—
bash
curl -u $LAMBDA_API_KEY: \
https://cloud.lambdalabs.com/api/v1/instance-types | jqbash
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
-H "Content-Type: application/json" \
-d '{
"region_name": "us-west-1",
"instance_type_name": "gpu_1x_h100_sxm5",
"ssh_key_names": ["my-key"]
}' | jqLaunch instance
终止实例
bash
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
-H "Content-Type: application/json" \
-d '{
"region_name": "us-west-1",
"instance_type_name": "gpu_1x_h100_sxm5",
"ssh_key_names": ["my-key"]
}' | jqbash
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
-H "Content-Type: application/json" \
-d '{"instance_ids": ["<INSTANCE-ID>"]}' | jqTerminate instance
持久化存储
—
文件系统
bash
curl -u $LAMBDA_API_KEY: \
-X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
-H "Content-Type: application/json" \
-d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq文件系统可跨实例重启保留数据:
bash
undefinedPersistent storage
挂载路径
Filesystems
—
Filesystems persist data across instance restarts:
bash
undefined/lambda/nfs/<FILESYSTEM_NAME>
Mount location
示例:保存检查点
/lambda/nfs/<FILESYSTEM_NAME>
python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
undefinedExample: save checkpoints
创建文件系统
python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
undefined- 进入Lambda控制台的存储页面
- 点击“创建文件系统”
- 选择区域(必须与实例区域匹配)
- 命名并创建
Create filesystem
挂载到实例
- Go to Storage in Lambda console
- Click "Create filesystem"
- Select region (must match instance region)
- Name and create
文件系统必须在实例启动时挂载:
- 通过控制台:启动时选择文件系统
- 通过API:在启动请求中包含参数
file_system_names
Attach to instance
最佳实践
Filesystems must be attached at instance launch time:
- Via console: Select filesystem when launching
- Via API: Include in launch request
file_system_names
bash
undefinedBest practices
存储在文件系统(持久化)
bash
undefined/lambda/nfs/storage/
├── datasets/
├── checkpoints/
├── models/
└── outputs/
Store on filesystem (persists)
本地SSD(更快,临时存储)
/lambda/nfs/storage/
├── datasets/
├── checkpoints/
├── models/
└── outputs/
/home/ubuntu/
└── working/ # 临时文件
undefinedLocal SSD (faster, ephemeral)
SSH配置
—
添加SSH密钥
/home/ubuntu/
└── working/ # Temporary files
undefinedbash
undefinedSSH configuration
在本地生成密钥
Add SSH key
—
bash
undefinedssh-keygen -t ed25519 -f ~/.ssh/lambda_key
Generate key locally
将公钥添加到Lambda控制台
—
或通过API添加
ssh-keygen -t ed25519 -f ~/.ssh/lambda_key
undefinedAdd public key to Lambda console
多密钥配置
Or via API
—
undefinedbash
undefinedMultiple keys
在实例上添加更多密钥
bash
undefinedecho 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
undefinedOn instance, add more keys
从GitHub导入
echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
undefinedbash
undefinedImport from GitHub
在实例上执行
bash
undefinedssh-import-id gh:username
undefinedOn instance
SSH隧道
ssh-import-id gh:username
undefinedbash
undefinedSSH tunneling
转发Jupyter
bash
undefinedssh -L 8888:localhost:8888 ubuntu@<IP>
Forward Jupyter
转发TensorBoard
ssh -L 8888:localhost:8888 ubuntu@<IP>
ssh -L 6006:localhost:6006 ubuntu@<IP>
Forward TensorBoard
多端口转发
ssh -L 6006:localhost:6006 ubuntu@<IP>
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
undefinedMultiple ports
JupyterLab
—
从控制台启动
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
undefined- 进入实例页面
- 点击“Cloud IDE”列的“启动”按钮
- JupyterLab将在浏览器中打开
JupyterLab
手动访问
Launch from console
—
- Go to Instances page
- Click "Launch" in Cloud IDE column
- JupyterLab opens in browser
bash
undefinedManual access
在实例上执行
bash
undefinedjupyter lab --ip=0.0.0.0 --port=8888
On instance
从本地机器通过隧道访问
jupyter lab --ip=0.0.0.0 --port=8888
ssh -L 8888:localhost:8888 ubuntu@<IP>
From local machine with tunnel
ssh -L 8888:localhost:8888 ubuntu@<IP>
undefined训练工作流
—
单GPU训练
undefinedbash
undefinedTraining workflows
SSH连接到实例
Single-GPU training
—
bash
undefinedssh ubuntu@<IP>
SSH to instance
克隆仓库
ssh ubuntu@<IP>
git clone https://github.com/user/project
cd project
Clone repo
安装依赖
git clone https://github.com/user/project
cd project
pip install -r requirements.txt
Install dependencies
开始训练
pip install -r requirements.txt
python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
undefinedTrain
多GPU训练(单节点)
python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
undefinedpython
undefinedMulti-GPU training (single node)
train_ddp.py
python
undefinedimport torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
dist.init_process_group("nccl")
rank = dist.get_rank()
device = rank % torch.cuda.device_count()
model = MyModel().to(device)
model = DDP(model, device_ids=[device])
# 训练循环...if name == "main":
main()
```bashtrain_ddp.py
使用torchrun启动(8块GPU)
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
dist.init_process_group("nccl")
rank = dist.get_rank()
device = rank % torch.cuda.device_count()
model = MyModel().to(device)
model = DDP(model, device_ids=[device])
# Training loop...if name == "main":
main()
```bashtorchrun --nproc_per_node=8 train_ddp.py
undefinedLaunch with torchrun (8 GPUs)
保存检查点到文件系统
torchrun --nproc_per_node=8 train_ddp.py
undefinedpython
import os
checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)Checkpoint to filesystem
保存检查点
python
import os
checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
undefinedSave checkpoint
一键式集群
—
概述
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
undefined高性能Slurm集群具备以下特性:
- 16-512块NVIDIA H100或B200 GPU
- NVIDIA Quantum-2 400 Gb/s InfiniBand
- 3200 Gb/s的GPUDirect RDMA
- 预安装分布式机器学习栈
1-Click Clusters
包含的软件
Overview
—
High-performance Slurm clusters with:
- 16-512 NVIDIA H100 or B200 GPUs
- NVIDIA Quantum-2 400 Gb/s InfiniBand
- GPUDirect RDMA at 3200 Gb/s
- Pre-installed distributed ML stack
- Ubuntu 22.04 LTS + Lambda Stack
- NCCL、Open MPI
- 支持DDP和FSDP的PyTorch
- TensorFlow
- OFED驱动
Included software
存储
- Ubuntu 22.04 LTS + Lambda Stack
- NCCL, Open MPI
- PyTorch with DDP and FSDP
- TensorFlow
- OFED drivers
- 每个计算节点配备24 TB NVMe(临时存储)
- Lambda文件系统用于持久化数据
Storage
多节点训练
- 24 TB NVMe per compute node (ephemeral)
- Lambda filesystems for persistent data
bash
undefinedMulti-node training
在Slurm集群上执行
bash
undefinedsrun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8
torchrun --nnodes=4 --nproc_per_node=8
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500
train.py
torchrun --nnodes=4 --nproc_per_node=8
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500
train.py
undefinedOn Slurm cluster
网络
—
带宽
srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8
torchrun --nnodes=4 --nproc_per_node=8
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500
train.py
torchrun --nnodes=4 --nproc_per_node=8
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500
train.py
undefined- 同区域实例间:最高200 Gbps
- 互联网出站:最高20 Gbps
Networking
防火墙
Bandwidth
—
- Inter-instance (same region): up to 200 Gbps
- Internet outbound: 20 Gbps max
- 默认:仅开放22端口(SSH)
- 可在Lambda控制台配置额外端口
- 默认允许ICMP流量
Firewall
私有IP
- Default: Only port 22 (SSH) open
- Configure additional ports in Lambda console
- ICMP traffic allowed by default
bash
undefinedPrivate IPs
查看私有IP
bash
undefinedip addr show | grep 'inet '
undefinedFind private IP
常见工作流
—
工作流1:大语言模型(LLM)微调
ip addr show | grep 'inet '
undefinedbash
undefinedCommon workflows
1. 启动带文件系统的8x H100实例
Workflow 1: Fine-tuning LLM
2. SSH连接并配置
bash
undefinedssh ubuntu@<IP>
pip install transformers accelerate peft
1. Launch 8x H100 instance with filesystem
3. 将模型下载到文件系统
2. SSH and setup
—
ssh ubuntu@<IP>
pip install transformers accelerate peft
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b')
"
3. Download model to filesystem
4. 微调并将检查点保存到文件系统
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b')
"
accelerate launch --num_processes 8 train.py
--model_path /lambda/nfs/storage/models/llama-2-7b
--output_dir /lambda/nfs/storage/outputs
--checkpoint_dir /lambda/nfs/storage/checkpoints
--model_path /lambda/nfs/storage/models/llama-2-7b
--output_dir /lambda/nfs/storage/outputs
--checkpoint_dir /lambda/nfs/storage/checkpoints
undefined4. Fine-tune with checkpoints on filesystem
工作流2:批量推理
accelerate launch --num_processes 8 train.py
--model_path /lambda/nfs/storage/models/llama-2-7b
--output_dir /lambda/nfs/storage/outputs
--checkpoint_dir /lambda/nfs/storage/checkpoints
--model_path /lambda/nfs/storage/models/llama-2-7b
--output_dir /lambda/nfs/storage/outputs
--checkpoint_dir /lambda/nfs/storage/checkpoints
undefinedbash
undefinedWorkflow 2: Batch inference
1. 启动A10实例(推理场景高性价比)
—
2. 执行推理
bash
undefinedpython inference.py
--model /lambda/nfs/storage/models/fine-tuned
--input /lambda/nfs/storage/data/inputs.jsonl
--output /lambda/nfs/storage/data/outputs.jsonl
--model /lambda/nfs/storage/models/fine-tuned
--input /lambda/nfs/storage/data/inputs.jsonl
--output /lambda/nfs/storage/data/outputs.jsonl
undefined1. Launch A10 instance (cost-effective for inference)
成本优化
2. Run inference
选择合适的GPU
python inference.py
--model /lambda/nfs/storage/models/fine-tuned
--input /lambda/nfs/storage/data/inputs.jsonl
--output /lambda/nfs/storage/data/outputs.jsonl
--model /lambda/nfs/storage/models/fine-tuned
--input /lambda/nfs/storage/data/inputs.jsonl
--output /lambda/nfs/storage/data/outputs.jsonl
undefined| 任务 | 推荐GPU |
|---|---|
| LLM微调(7B参数) | A100 40GB |
| LLM微调(70B参数) | 8x H100 |
| 推理 | A10、A6000 |
| 开发 | V100、A10 |
| 极致性能 | B200 |
Cost optimization
降低成本的方法
Choose right GPU
—
| Task | Recommended GPU |
|---|---|
| LLM fine-tuning (7B) | A100 40GB |
| LLM fine-tuning (70B) | 8x H100 |
| Inference | A10, A6000 |
| Development | V100, A10 |
| Maximum performance | B200 |
- 使用文件系统:避免重复下载数据
- 频繁保存检查点:可恢复中断的训练
- 合理选型:不要过度配置GPU
- 终止闲置实例:无自动停止功能,需手动终止
Reduce costs
监控使用情况
- Use filesystems: Avoid re-downloading data
- Checkpoint frequently: Resume interrupted training
- Right-size: Don't over-provision GPUs
- Terminate idle: No auto-stop, manually terminate
- 控制台显示实时GPU利用率
- 可通过API进行程序化监控
Monitor usage
常见问题
- Dashboard shows real-time GPU utilization
- API for programmatic monitoring
| 问题 | 解决方案 |
|---|---|
| 实例无法启动 | 检查区域可用性,尝试其他GPU型号 |
| SSH连接被拒绝 | 等待实例初始化完成(3-15分钟) |
| 终止实例后数据丢失 | 使用持久化文件系统 |
| 数据传输缓慢 | 使用与实例同区域的文件系统 |
| 无法检测到GPU | 重启实例,检查驱动 |
Common issues
参考资料
| Issue | Solution |
|---|---|
| Instance won't launch | Check region availability, try different GPU |
| SSH connection refused | Wait for instance to initialize (3-15 min) |
| Data lost after terminate | Use persistent filesystems |
| Slow data transfer | Use filesystem in same region |
| GPU not detected | Reboot instance, check drivers |
- 高级用法 - 多节点训练、API自动化
- 故障排除 - 常见问题与解决方案
References
资源
- Advanced Usage - Multi-node training, API automation
- Troubleshooting - Common issues and solutions
Resources
—
- Documentation: https://docs.lambda.ai
- Console: https://cloud.lambda.ai
- Pricing: https://lambda.ai/instances
- Support: https://support.lambdalabs.com
- Blog: https://lambda.ai/blog
—