compute-management

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OCI Compute Management - Expert Knowledge

OCI计算管理 - 专家知识库

🏗️ Use OCI Landing Zone Terraform Modules

🏗️ 使用OCI落地区Terraform模块

Don't reinvent the wheel. Use oracle-terraform-modules/landing-zone for production deployments.
Landing Zone solves:
  • ❌ Bad Practice #5: Internet-wide open ports (0.0.0.0/0 on 22/3389)
  • ❌ Bad Practice #9: Public compute instances (Security Zones enforce private IPs)
  • ❌ Bad Practice #10: No monitoring (auto-configures alarms and notifications)
This skill provides: Anti-patterns and troubleshooting for compute resources deployed WITHIN a Landing Zone architecture.

不要重复造轮子。 生产环境部署请使用oracle-terraform-modules/landing-zone
落地区可解决以下问题:
  • ❌ 不良实践#5:面向全网开放端口(22/3389端口允许0.0.0.0/0访问)
  • ❌ 不良实践#9:公共计算实例(安全区域强制使用私有IP)
  • ❌ 不良实践#10:未配置监控(自动配置告警与通知)
本技能提供: 针对落地区架构内部署的计算资源的反模式分析与故障排查方法。

⚠️ OCI CLI/API Knowledge Gap

⚠️ OCI CLI/API知识缺口

You don't know OCI CLI commands or OCI API structure.
Your training data has limited and outdated knowledge of:
  • OCI CLI syntax and parameters (updates monthly)
  • OCI API endpoints and request/response formats
  • Compute service CLI operations (
    oci compute instance
    )
  • OCI service-specific commands and flags
  • Latest OCI features and API changes
When OCI operations are needed:
  1. Use exact CLI commands from this skill's references
  2. Do NOT guess OCI CLI syntax or parameters
  3. Do NOT assume API endpoint structures
  4. Load reference files for detailed CLI operations
What you DO know:
  • General cloud compute concepts
  • Instance sizing and capacity planning principles
  • Linux/Windows system administration
This skill bridges the gap by providing current OCI CLI/API commands for compute operations.

You are an OCI compute expert. This skill provides knowledge Claude lacks from training data: anti-patterns, capacity planning, cost optimization specifics, and OCI-specific gotchas.
你不了解OCI CLI命令或OCI API结构。
你的训练数据中关于以下内容的知识有限且过时:
  • OCI CLI语法与参数(每月更新)
  • OCI API端点与请求/响应格式
  • 计算服务CLI操作(
    oci compute instance
  • OCI服务特定命令与标志
  • OCI最新功能与API变更
当需要执行OCI操作时:
  1. 使用本技能参考资料中的准确CLI命令
  2. 不要猜测OCI CLI语法或参数
  3. 不要假设API端点结构
  4. 加载参考文件以获取详细的CLI操作说明
你已掌握的知识:
  • 通用云计算概念
  • 实例规格选型与容量规划原则
  • Linux/Windows系统管理
本技能通过提供当前有效的OCI CLI/API命令来弥补这一知识缺口,助力计算操作的执行。

你是OCI计算专家。本技能补充了Claude训练数据中缺失的内容:反模式、容量规划、成本优化细节以及OCI特有的陷阱。

NEVER Do This

绝对禁止的操作

NEVER launch instances without checking service limits first
bash
oci limits resource-availability get \
  --service-name compute \
  --limit-name "standard-e4-core-count" \
  --compartment-id <ocid> \
  --availability-domain <ad>
87% of "out of capacity" errors are actually quota limits, not infrastructure capacity. Check limits BEFORE launching to get accurate error messages.
NEVER use console serial connection as primary access
  • Creates security audit findings (bypasses SSH key controls)
  • Use only for boot troubleshooting when SSH fails
  • Delete connection immediately after troubleshooting
NEVER mix regional and AD-specific resources in templates
  • Breaks portability when moving between regions
  • Use AD-agnostic designs: spread via fault domains, not hardcoded ADs
NEVER use default security lists in production
  • Default allows 0.0.0.0/0 on all ports
  • Fails security audits, creates compliance violations
  • Always create custom security lists or NSGs
NEVER forget boot volume preservation in dev/test
bash
undefined
绝对不要在未检查服务限制的情况下启动实例
bash
oci limits resource-availability get \
  --service-name compute \
  --limit-name "standard-e4-core-count" \
  --compartment-id <ocid> \
  --availability-domain <ad>
87%的"容量不足"错误实际上是配额限制问题,而非基础设施容量问题。启动实例前先检查限制,以获取准确的错误信息。
绝对不要将控制台串行连接作为主要访问方式
  • 会产生安全审计发现(绕过SSH密钥控制)
  • 仅在SSH连接失败时用于启动故障排查
  • 排查完成后立即删除连接
绝对不要在模板中混合使用区域级和可用域级资源
  • 跨区域迁移时会破坏可移植性
  • 使用与可用域无关的设计:通过故障域分散部署,而非硬编码可用域
绝对不要在生产环境中使用默认安全列表
  • 默认允许所有端口的0.0.0.0/0访问
  • 无法通过安全审计,会导致合规违规
  • 务必创建自定义安全列表或网络安全组(NSG)
绝对不要在开发/测试环境中忘记保留启动卷
bash
undefined

When terminating test instances, add:

终止测试实例时,添加以下参数:

oci compute instance terminate --instance-id <id> --preserve-boot-volume false
Without this flag: $50+/month per deleted instance (orphaned boot volumes)

❌ **NEVER enable public IP on production instances**
- Use bastion service or private endpoints for access
- Cost impact: $500-5000+ per security incident from exposed instances
- Landing Zone Security Zones automatically block this pattern
oci compute instance terminate --instance-id <id> --preserve-boot-volume false
如果不添加此标志:每个已删除实例每月会产生50美元以上的费用(孤立的启动卷)

❌ **绝对不要在生产实例上启用公网IP**
- 使用堡垒机服务或私有端点进行访问
- 成本影响:暴露的实例可能引发安全事件,造成500-5000美元以上的损失
- 落地区安全区域会自动阻止这种模式

Capacity Error Decision Tree

容量错误决策树

"Out of host capacity for shape X"?
├─ Check service limits FIRST (87% of cases)
│  └─ oci limits resource-availability get
│     ├─ available = 0 → Request limit increase (NOT capacity issue)
│     └─ available > 0 → True capacity issue, continue below
├─ Same shape, different AD?
│  └─ Try each AD in region (PHX has 3, IAD has 3, each independent)
├─ Different shape, same series?
│  └─ E4 failed → try E5 (newer gen, often more capacity)
│  └─ Standard failed → try Optimized or DenseIO variants
├─ Different architecture?
│  └─ AMD → ARM (A1.Flex often has capacity when Intel/AMD full)
└─ All ADs exhausted?
   └─ Create capacity reservation (guarantees future launches)
"无法为规格X分配主机容量"?
├─ 首先检查服务限制(87%的情况)
│  └─ oci limits resource-availability get
│     ├─ 可用容量 = 0 → 请求提高限制(并非容量问题)
│     └─ 可用容量 > 0 → 确实是容量问题,继续以下步骤
├─ 相同规格,不同可用域?
│  └─ 尝试区域内的每个可用域(PHX有3个,IAD有3个,彼此独立)
├─ 不同规格,同一系列?
│  └─ E4规格失败 → 尝试E5(新一代,通常容量更充足)
│  └─ 标准规格失败 → 尝试优化型或密集IO型变体
├─ 不同架构?
│  └─ AMD → ARM(当Intel/AMD容量不足时,A1.Flex通常有可用容量)
└─ 所有可用域都已耗尽?
   └─ 创建容量预留(保证未来的实例启动)

Shape Selection: Cost vs Performance

实例规格选型:成本与性能平衡

Budget-Critical (save 50%):
  • VM.Standard.A1.Flex (ARM) if app supports: $0.01/OCPU/hr vs $0.03 (AMD)
  • Caveat: Not all software runs on ARM, test thoroughly
General Purpose (balanced):
  • VM.Standard.E4.Flex: 2:16 CPU:RAM ratio, $0.03/OCPU/hr
  • Start: 2 OCPUs, scale based on metrics (not guesses)
Memory-Intensive (databases, caches):
  • VM.Standard.E4.Flex with custom ratio: up to 1:64 CPU:RAM
  • Cost: $0.03/OCPU + $0.0015/GB RAM
Cost Trap: Fixed shapes (e.g., VM.Standard2.1) often MORE expensive than Flex with same resources. Always compare Flex pricing first.
预算敏感场景(节省50%成本):
  • 如果应用支持,选择VM.Standard.A1.Flex(ARM):每OCPU小时0.01美元 vs AMD的0.03美元
  • 注意:并非所有软件都能在ARM上运行,请彻底测试
通用场景(平衡型):
  • VM.Standard.E4.Flex:CPU:RAM比例为2:16,每OCPU小时0.03美元
  • 初始配置:2个OCPU,根据指标进行扩容(而非凭猜测)
内存密集型场景(数据库、缓存):
  • 自定义比例的VM.Standard.E4.Flex:CPU:RAM比例最高可达1:64
  • 成本:每OCPU0.03美元 + 每GB内存0.0015美元
成本陷阱:固定规格(如VM.Standard2.1)通常比相同资源的Flex规格更昂贵。务必先对比Flex规格的定价。

Instance Principal Authentication (Production)

实例主体认证(生产环境)

When instance needs to call OCI APIs (Object Storage, Vault, etc.):
WRONG (user credentials on instance):
bash
undefined
当实例需要调用OCI API(对象存储、Vault等)时:
错误做法(在实例上存储用户凭证):
bash
undefined

Don't do this - credential management nightmare

不要这样做 - 凭证管理会成为噩梦

export OCI_USER_OCID="ocid1.user..."

**RIGHT** (instance principal):
```bash
export OCI_USER_OCID="ocid1.user..."

**正确做法**(实例主体):
```bash

1. Create dynamic group

1. 创建动态组

oci iam dynamic-group create
--name "app-instances"
--matching-rule "instance.compartment.id = '<compartment-ocid>'"
oci iam dynamic-group create
--name "app-instances"
--matching-rule "instance.compartment.id = '<compartment-ocid>'"

2. Grant permissions

2. 授予权限

"Allow dynamic-group app-instances to read object-family in compartment X"

"允许动态组app-instances读取 compartment X中的对象家族资源"

3. Code uses instance principal (no credentials needed):

3. 代码使用实例主体(无需凭证):

signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner() client = oci.object_storage.ObjectStorageClient(config={}, signer=signer)

Benefits: No credential rotation, no secrets to manage, automatic token refresh.
signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner() client = oci.object_storage.ObjectStorageClient(config={}, signer=signer)

优势:无需凭证轮换,无需管理密钥,自动刷新令牌。

OCI-Specific Gotchas

OCI特有陷阱

Availability Domain Names Are Tenant-Specific
  • Your AD: "fMgC:US-ASHBURN-AD-1"
  • Another tenant: "ErKW:US-ASHBURN-AD-1"
  • MUST query your tenant:
    oci iam availability-domain list
Boot Volume Backups Don't Include Instance Config
  • Backup captures disk only, NOT shape/networking/metadata
  • For DR: Use custom images (captures everything) or Terraform for infrastructure
Instance Metadata Service Has 3 Versions
可用域名称是租户特定的
  • 你的可用域:"fMgC:US-ASHBURN-AD-1"
  • 其他租户的可用域:"ErKW:US-ASHBURN-AD-1"
  • 必须查询自己的租户:
    oci iam availability-domain list
启动卷备份不包含实例配置
  • 备份仅捕获磁盘内容,不包含实例规格/网络/元数据
  • 灾难恢复方案:使用自定义镜像(捕获所有内容)或Terraform管理基础设施
实例元数据服务有3个版本

Quick Cost Reference

快速成本参考

Shape Family$/OCPU/hr$/GB RAM/hrBest For
A1.Flex (ARM)$0.01$0.0015Cost-critical, ARM-compatible
E4.Flex (AMD)$0.03$0.0015General purpose
E5.Flex (AMD)$0.035$0.0015Latest gen, premium perf
Optimized3.Flex$0.025$0.0015Network-intensive
Free Tier: 2x AMD VM (1/8 OCPU, 1GB) + 4 ARM cores (24GB total) - always free
Calculation: (OCPUs × $0.03 + GB × $0.0015) × 730 hours/month
Example: 2 OCPU, 16GB = (2×$0.03 + 16×$0.0015) × 730 = $61.32/month
规格系列每OCPU小时费用每GB内存小时费用最佳适用场景
A1.Flex (ARM)$0.01$0.0015预算敏感、支持ARM的应用
E4.Flex (AMD)$0.03$0.0015通用场景
E5.Flex (AMD)$0.035$0.0015最新一代、高性能需求
Optimized3.Flex$0.025$0.0015网络密集型应用
免费层:2台AMD虚拟机(1/8 OCPU,1GB内存) + 4个ARM核心(总计24GB内存) - 永久免费
计算方式:(OCPU数量 × $0.03 + 内存GB数 × $0.0015) × 每月730小时
示例:2个OCPU,16GB内存 = (2×$0.03 + 16×$0.0015) × 730 = 每月61.32美元

Progressive Loading References

渐进式加载参考资料

OCI Compute Shapes Reference (Official Oracle Documentation)

OCI计算规格参考(Oracle官方文档)

WHEN TO LOAD
oci-compute-shapes-reference.md
:
  • Need detailed specifications for specific shapes (memory limits, OCPU counts, network bandwidth)
  • Comparing flexible shapes (VM.Standard3.Flex vs E4.Flex vs E5.Flex vs E6.Flex vs A1/A2/A4.Flex)
  • Understanding extended memory VM instances
  • Researching bare metal shapes (BM.Standard3, BM.Standard.E4/E5/E6, BM.Standard.A1/A4)
  • Checking GPU shapes, Dense I/O shapes, or HPC-optimized shapes
  • Need official Oracle specifications for shape families
Do NOT load for:
  • Quick cost comparisons (use Quick Cost Reference table in this skill)
  • "Out of capacity" troubleshooting (decision tree in this skill covers it)
  • Shape selection guidance (anti-patterns and recommendations in this skill)

何时加载
oci-compute-shapes-reference.md
  • 需要特定规格的详细参数(内存限制、OCPU数量、网络带宽)
  • 对比灵活规格(VM.Standard3.Flex vs E4.Flex vs E5.Flex vs E6.Flex vs A1/A2/A4.Flex)
  • 了解扩展内存虚拟机实例
  • 研究裸金属规格(BM.Standard3, BM.Standard.E4/E5/E6, BM.Standard.A1/A4)
  • 查看GPU规格、密集IO规格或HPC优化规格
  • 需要Oracle官方的规格系列参数
请勿加载的场景:
  • 快速成本对比(使用本技能中的快速成本参考表)
  • "容量不足"故障排查(本技能中的决策树已覆盖)
  • 规格选型指导(本技能中的反模式与建议已覆盖)

When to Use This Skill

何时使用本技能

  • Launching instances: shape selection, capacity planning
  • "Out of capacity" errors: decision tree, limit checking
  • Cost optimization: shape comparison, right-sizing
  • Security: instance principal setup, console connection proper use
  • Troubleshooting: boot failures, connectivity issues
  • Production: anti-patterns, operational gotchas
  • 启动实例:规格选型、容量规划
  • "容量不足"错误:决策树、限制检查
  • 成本优化:规格对比、合理调整配置
  • 安全:实例主体配置、控制台连接的正确使用
  • 故障排查:启动失败、连接问题
  • 生产环境:反模式、运维陷阱