monitoring-operations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OCI Monitoring and Observability - Expert Knowledge

OCI 监控与可观测性 - 专家知识

🏗️ Use OCI Landing Zone Terraform Modules

🏗️ 使用OCI Landing Zone Terraform模块

Don't reinvent the wheel. Use oracle-terraform-modules/landing-zone for observability stack.
Landing Zone solves:
  • ❌ Bad Practice #10: No logging, monitoring, notifications (Landing Zone deploys complete observability)
  • ❌ Bad Practice #7: Limited security services (Landing Zone integrates Cloud Guard, VSS, OSMS)
This skill provides: Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone.

不要重复造轮子。 可使用oracle-terraform-modules/landing-zone搭建可观测性堆栈。
Landing Zone可解决以下问题:
  • ❌ 不良实践#10:无日志、监控、通知(Landing Zone可部署完整的可观测性体系)
  • ❌ 不良实践#7:安全服务受限(Landing Zone集成了Cloud Guard、VSS、OSMS)
本技能提供: 针对在Landing Zone内部署的监控系统的指标、告警及故障排查方案。

⚠️ OCI CLI/API Knowledge Gap

⚠️ OCI CLI/API知识盲区

You don't know OCI CLI commands or OCI API structure.
Your training data has limited and outdated knowledge of:
  • OCI CLI syntax and parameters (updates monthly)
  • OCI API endpoints and request/response formats
  • Monitoring service CLI operations (
    oci monitoring alarm
    ,
    oci monitoring metric
    )
  • Metric namespaces and MQL (Monitoring Query Language)
  • Latest Logging and Service Connector features
When OCI operations are needed:
  1. Use exact CLI commands from this skill's references
  2. Do NOT guess metric namespace names
  3. Do NOT assume AWS CloudWatch patterns work in OCI
  4. Load reference files for detailed MQL documentation
What you DO know:
  • General observability concepts
  • Alerting and threshold design principles
  • Log aggregation patterns
This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas.

你可能不了解OCI CLI命令或OCI API结构。
你的训练数据中关于以下内容的知识有限且可能过时:
  • OCI CLI语法和参数(每月更新)
  • OCI API端点及请求/响应格式
  • 监控服务CLI操作(
    oci monitoring alarm
    oci monitoring metric
  • 指标命名空间与MQL(Monitoring Query Language)
  • 最新的日志记录与服务连接器功能
当需要执行OCI操作时:
  1. 使用本技能参考资料中的精确CLI命令
  2. 不要猜测指标命名空间名称
  3. 不要假设AWS CloudWatch的模式适用于OCI
  4. 加载参考文件以获取详细的MQL文档
你所掌握的内容:
  • 通用可观测性概念
  • 告警与阈值设计原则
  • 日志聚合模式
本技能通过提供当前OCI专属的监控模式与陷阱规避方案,填补上述知识空白。

NEVER Do This

绝对禁止的操作

NEVER assume metrics are instant (10-15 minute lag)
  • Metrics published every 1-5 minutes
  • Processing delay: 5-10 minutes
  • Total lag: 10-15 minutes from event to visible metric
  • Don't debug "missing metrics" within first 15 minutes of resource creation
NEVER use
=
for alarm thresholds with sparse metrics
undefined
绝对不要假设指标是实时的(存在10-15分钟延迟)
  • 指标每1-5分钟发布一次
  • 处理延迟:5-10分钟
  • 总延迟:从事件发生到指标可见需10-15分钟
  • 资源创建后的前15分钟内,不要排查“指标缺失”问题
绝对不要在稀疏指标的告警阈值中使用
=
undefined

WRONG - alarm never fires if metric has gaps

错误 - 如果指标存在间隙,告警永远不会触发

MetricName[1m].mean() = 0
MetricName[1m].mean() = 0

RIGHT - handle missing data

正确 - 处理数据缺失情况

MetricName[1m]{dataMissing=zero}.mean() > 0

❌ **NEVER forget metric dimensions (causes "no data")**
MetricName[1m]{dataMissing=zero}.mean() > 0

❌ **绝对不要遗漏指标维度(会导致“无数据”问题)**

WRONG - missing required dimension

错误 - 缺少必需的维度

CPUUtilization[1m].mean()
CPUUtilization[1m].mean()

RIGHT - include resourceId dimension

正确 - 包含resourceId维度

CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()

❌ **NEVER set alarm thresholds without trigger delay (alert fatigue)**
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()

❌ **绝对不要设置无触发延迟的告警阈值(会导致告警疲劳)**

BAD - fires on every CPU spike

不佳 - 每次CPU峰值都会触发告警

CPUUtilization[1m].mean() > 80
CPUUtilization[1m].mean() > 80

BETTER - sustained high CPU

更佳 - 仅在CPU持续高负载时触发

CPUUtilization[5m].mean() > 80 Trigger delay: 5 minutes (fires after 5 consecutive breaches)

❌ **NEVER create alarms without notification channels**
CPUUtilization[5m].mean() > 80 触发延迟:5分钟(连续5次违反阈值后触发)

❌ **绝对不要创建无通知渠道的告警**

WRONG - alarm fires but nobody knows

错误 - 告警触发但无人知晓

oci monitoring alarm create ... --destinations '[]'
oci monitoring alarm create ... --destinations '[]'

RIGHT - always link to notification topic

正确 - 始终关联到通知主题

oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
Cost impact: Undetected outages cost $5,000-50,000/hour in production

❌ **NEVER ignore Cloud Guard findings (security audit failure)**
- Cloud Guard detects misconfigurations BEFORE they become incidents
- Integrate Cloud Guard → Notifications → Email/Slack/PagerDuty
- Cost impact: $100,000+ per security breach vs $0 for proactive remediation
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
成本影响:生产环境中未被发现的停机每小时会造成5000-50000美元的损失

❌ **绝对不要忽略Cloud Guard的检测结果(会导致安全审计失败)**
- Cloud Guard会在配置错误演变为事件前检测到它们
- 集成Cloud Guard → 通知 → 邮件/Slack/PagerDuty
- 成本影响:每次安全漏洞造成10万美元以上损失,而主动修复成本为0

Metric Namespace Gotchas

指标命名空间陷阱

OCI Metrics Use Service-Specific Namespaces:
ServiceNamespaceExample Metric
Compute
oci_computeagent
CPUUtilization
,
MemoryUtilization
Autonomous DB
oci_autonomous_database
CpuUtilization
,
StorageUtilization
Load Balancer
oci_lbaas
HttpRequests
,
UnHealthyBackendServers
Object Storage
oci_objectstorage
ObjectCount
,
BytesUploaded
Common Mistake: Using wrong namespace (
oci_compute
vs
oci_computeagent
)
OCI指标使用服务专属的命名空间:
服务命名空间示例指标
计算服务
oci_computeagent
CPUUtilization
MemoryUtilization
自治数据库
oci_autonomous_database
CpuUtilization
StorageUtilization
负载均衡
oci_lbaas
HttpRequests
UnHealthyBackendServers
对象存储
oci_objectstorage
ObjectCount
BytesUploaded
常见错误:使用错误的命名空间(
oci_compute
vs
oci_computeagent

Alarm Missing Data Handling

告警数据缺失处理

SettingBehaviorUse When
treatMissingDataAsBreaching
Alarm fires if no dataCritical services (outage = breach)
treatMissingDataAsNotBreaching
Alarm silent if no dataOptional monitoring
{dataMissing=zero}
Treat missing as 0Counters (requests/sec)
设置项行为使用场景
treatMissingDataAsBreaching
无数据时触发告警关键服务(停机即视为违反阈值)
treatMissingDataAsNotBreaching
无数据时告警静默可选监控场景
{dataMissing=zero}
将缺失数据视为0计数器指标(请求数/秒)

Log Collection Common Gaps

日志收集常见盲区

Problem: Logs not showing in Log Analytics
Logs not appearing?
├─ Is log enabled on resource?
│  └─ Compute: oci-compute-agent must be running
│  └─ Function: Logging enabled in function config
├─ Is Service Connector configured?
│  └─ Source: Log Group → Target: Log Analytics
│  └─ Check: Service Connector status = ACTIVE
├─ IAM policy for Service Connector?
│  └─ "Allow any-user to use log-content in tenancy"
│  └─ "Allow service loganalytics to READ logcontent in tenancy"
└─ 10-15 minute ingestion lag?
   └─ Wait before debugging
问题:日志未出现在Log Analytics中
日志未显示?
├─ 资源上是否启用了日志功能?
│  └─ 计算服务:oci-compute-agent必须处于运行状态
│  └─ 函数:在函数配置中启用日志记录
├─ 是否配置了Service Connector?
│  └─ 源:日志组 → 目标:Log Analytics
│  └─ 检查:Service Connector状态 = ACTIVE
├─ Service Connector是否有IAM策略?
│  └─ "Allow any-user to use log-content in tenancy"
│  └─ "Allow service loganalytics to READ logcontent in tenancy"
└─ 是否存在10-15分钟的摄入延迟?
   └─ 先等待再进行排查

Metric Query Optimization

指标查询优化

Expensive (slow):
undefined
高开销(缓慢):
undefined

Queries ALL instances

查询所有实例

CPUUtilization[1m].mean()

**Optimized** (filter by dimension):
CPUUtilization[1m].mean()

**优化后**(按维度过滤):

Query specific instance

查询特定实例

CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()

**Cost**: Queries free, but rate limited (1000 req/min)
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()

**成本**:查询免费,但有速率限制(1000次请求/分钟)

Progressive Loading References

渐进式加载参考资料

OCI Monitoring Reference (Official Oracle Documentation)

OCI监控参考资料(Oracle官方文档)

WHEN TO LOAD
oci-monitoring-reference.md
:
  • Need comprehensive list of all OCI service metrics
  • Understanding MQL (Monitoring Query Language) in depth
  • Implementing complex alarm conditions and composites
  • Need official Oracle guidance on Logging and Service Connector
  • Setting up Log Analytics and APM integration
Do NOT load for:
  • Quick alarm setup (examples in this skill)
  • Common metric patterns (tables above)
  • Troubleshooting decision trees (covered above)

何时加载
oci-monitoring-reference.md
  • 需要获取所有OCI服务指标的完整列表
  • 深入理解MQL(Monitoring Query Language)
  • 实现复杂的告警条件与组合告警
  • 需要Oracle官方关于日志记录与Service Connector的指导
  • 配置Log Analytics与APM集成
无需加载的场景
  • 快速配置告警(本技能中已有示例)
  • 常见指标模式(上述表格已覆盖)
  • 排查决策树(上述内容已覆盖)

When to Use This Skill

何时使用本技能

  • Alarms: threshold configuration, missing data handling, trigger delay
  • Troubleshooting: metrics not showing, alarms not firing, namespace errors
  • Log collection: Service Connector, IAM policies, missing logs
  • Performance: query optimization, dimension filtering
  • 告警:阈值配置、数据缺失处理、触发延迟设置
  • 故障排查:指标不显示、告警不触发、命名空间错误
  • 日志收集:Service Connector配置、IAM策略、日志缺失问题
  • 性能:查询优化、维度过滤