monitoring-operations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOCI Monitoring and Observability - Expert Knowledge
OCI 监控与可观测性 - 专家知识
🏗️ Use OCI Landing Zone Terraform Modules
🏗️ 使用OCI Landing Zone Terraform模块
Don't reinvent the wheel. Use oracle-terraform-modules/landing-zone for observability stack.
Landing Zone solves:
- ❌ Bad Practice #10: No logging, monitoring, notifications (Landing Zone deploys complete observability)
- ❌ Bad Practice #7: Limited security services (Landing Zone integrates Cloud Guard, VSS, OSMS)
This skill provides: Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone.
不要重复造轮子。 可使用oracle-terraform-modules/landing-zone搭建可观测性堆栈。
Landing Zone可解决以下问题:
- ❌ 不良实践#10:无日志、监控、通知(Landing Zone可部署完整的可观测性体系)
- ❌ 不良实践#7:安全服务受限(Landing Zone集成了Cloud Guard、VSS、OSMS)
本技能提供: 针对在Landing Zone内部署的监控系统的指标、告警及故障排查方案。
⚠️ OCI CLI/API Knowledge Gap
⚠️ OCI CLI/API知识盲区
You don't know OCI CLI commands or OCI API structure.
Your training data has limited and outdated knowledge of:
- OCI CLI syntax and parameters (updates monthly)
- OCI API endpoints and request/response formats
- Monitoring service CLI operations (,
oci monitoring alarm)oci monitoring metric - Metric namespaces and MQL (Monitoring Query Language)
- Latest Logging and Service Connector features
When OCI operations are needed:
- Use exact CLI commands from this skill's references
- Do NOT guess metric namespace names
- Do NOT assume AWS CloudWatch patterns work in OCI
- Load reference files for detailed MQL documentation
What you DO know:
- General observability concepts
- Alerting and threshold design principles
- Log aggregation patterns
This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas.
你可能不了解OCI CLI命令或OCI API结构。
你的训练数据中关于以下内容的知识有限且可能过时:
- OCI CLI语法和参数(每月更新)
- OCI API端点及请求/响应格式
- 监控服务CLI操作(、
oci monitoring alarm)oci monitoring metric - 指标命名空间与MQL(Monitoring Query Language)
- 最新的日志记录与服务连接器功能
当需要执行OCI操作时:
- 使用本技能参考资料中的精确CLI命令
- 不要猜测指标命名空间名称
- 不要假设AWS CloudWatch的模式适用于OCI
- 加载参考文件以获取详细的MQL文档
你所掌握的内容:
- 通用可观测性概念
- 告警与阈值设计原则
- 日志聚合模式
本技能通过提供当前OCI专属的监控模式与陷阱规避方案,填补上述知识空白。
NEVER Do This
绝对禁止的操作
❌ NEVER assume metrics are instant (10-15 minute lag)
- Metrics published every 1-5 minutes
- Processing delay: 5-10 minutes
- Total lag: 10-15 minutes from event to visible metric
- Don't debug "missing metrics" within first 15 minutes of resource creation
❌ NEVER use for alarm thresholds with sparse metrics
=undefined❌ 绝对不要假设指标是实时的(存在10-15分钟延迟)
- 指标每1-5分钟发布一次
- 处理延迟:5-10分钟
- 总延迟:从事件发生到指标可见需10-15分钟
- 资源创建后的前15分钟内,不要排查“指标缺失”问题
❌ 绝对不要在稀疏指标的告警阈值中使用
=undefinedWRONG - alarm never fires if metric has gaps
错误 - 如果指标存在间隙,告警永远不会触发
MetricName[1m].mean() = 0
MetricName[1m].mean() = 0
RIGHT - handle missing data
正确 - 处理数据缺失情况
MetricName[1m]{dataMissing=zero}.mean() > 0
❌ **NEVER forget metric dimensions (causes "no data")**MetricName[1m]{dataMissing=zero}.mean() > 0
❌ **绝对不要遗漏指标维度(会导致“无数据”问题)**WRONG - missing required dimension
错误 - 缺少必需的维度
CPUUtilization[1m].mean()
CPUUtilization[1m].mean()
RIGHT - include resourceId dimension
正确 - 包含resourceId维度
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
❌ **NEVER set alarm thresholds without trigger delay (alert fatigue)**CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
❌ **绝对不要设置无触发延迟的告警阈值(会导致告警疲劳)**BAD - fires on every CPU spike
不佳 - 每次CPU峰值都会触发告警
CPUUtilization[1m].mean() > 80
CPUUtilization[1m].mean() > 80
BETTER - sustained high CPU
更佳 - 仅在CPU持续高负载时触发
CPUUtilization[5m].mean() > 80
Trigger delay: 5 minutes (fires after 5 consecutive breaches)
❌ **NEVER create alarms without notification channels**CPUUtilization[5m].mean() > 80
触发延迟:5分钟(连续5次违反阈值后触发)
❌ **绝对不要创建无通知渠道的告警**WRONG - alarm fires but nobody knows
错误 - 告警触发但无人知晓
oci monitoring alarm create ... --destinations '[]'
oci monitoring alarm create ... --destinations '[]'
RIGHT - always link to notification topic
正确 - 始终关联到通知主题
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
Cost impact: Undetected outages cost $5,000-50,000/hour in production
❌ **NEVER ignore Cloud Guard findings (security audit failure)**
- Cloud Guard detects misconfigurations BEFORE they become incidents
- Integrate Cloud Guard → Notifications → Email/Slack/PagerDuty
- Cost impact: $100,000+ per security breach vs $0 for proactive remediationoci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
成本影响:生产环境中未被发现的停机每小时会造成5000-50000美元的损失
❌ **绝对不要忽略Cloud Guard的检测结果(会导致安全审计失败)**
- Cloud Guard会在配置错误演变为事件前检测到它们
- 集成Cloud Guard → 通知 → 邮件/Slack/PagerDuty
- 成本影响:每次安全漏洞造成10万美元以上损失,而主动修复成本为0Metric Namespace Gotchas
指标命名空间陷阱
OCI Metrics Use Service-Specific Namespaces:
| Service | Namespace | Example Metric |
|---|---|---|
| Compute | | |
| Autonomous DB | | |
| Load Balancer | | |
| Object Storage | | |
Common Mistake: Using wrong namespace ( vs )
oci_computeoci_computeagentOCI指标使用服务专属的命名空间:
| 服务 | 命名空间 | 示例指标 |
|---|---|---|
| 计算服务 | | |
| 自治数据库 | | |
| 负载均衡 | | |
| 对象存储 | | |
常见错误:使用错误的命名空间( vs )
oci_computeoci_computeagentAlarm Missing Data Handling
告警数据缺失处理
| Setting | Behavior | Use When |
|---|---|---|
| Alarm fires if no data | Critical services (outage = breach) |
| Alarm silent if no data | Optional monitoring |
| Treat missing as 0 | Counters (requests/sec) |
| 设置项 | 行为 | 使用场景 |
|---|---|---|
| 无数据时触发告警 | 关键服务(停机即视为违反阈值) |
| 无数据时告警静默 | 可选监控场景 |
| 将缺失数据视为0 | 计数器指标(请求数/秒) |
Log Collection Common Gaps
日志收集常见盲区
Problem: Logs not showing in Log Analytics
Logs not appearing?
├─ Is log enabled on resource?
│ └─ Compute: oci-compute-agent must be running
│ └─ Function: Logging enabled in function config
│
├─ Is Service Connector configured?
│ └─ Source: Log Group → Target: Log Analytics
│ └─ Check: Service Connector status = ACTIVE
│
├─ IAM policy for Service Connector?
│ └─ "Allow any-user to use log-content in tenancy"
│ └─ "Allow service loganalytics to READ logcontent in tenancy"
│
└─ 10-15 minute ingestion lag?
└─ Wait before debugging问题:日志未出现在Log Analytics中
日志未显示?
├─ 资源上是否启用了日志功能?
│ └─ 计算服务:oci-compute-agent必须处于运行状态
│ └─ 函数:在函数配置中启用日志记录
│
├─ 是否配置了Service Connector?
│ └─ 源:日志组 → 目标:Log Analytics
│ └─ 检查:Service Connector状态 = ACTIVE
│
├─ Service Connector是否有IAM策略?
│ └─ "Allow any-user to use log-content in tenancy"
│ └─ "Allow service loganalytics to READ logcontent in tenancy"
│
└─ 是否存在10-15分钟的摄入延迟?
└─ 先等待再进行排查Metric Query Optimization
指标查询优化
Expensive (slow):
undefined高开销(缓慢):
undefinedQueries ALL instances
查询所有实例
CPUUtilization[1m].mean()
**Optimized** (filter by dimension):CPUUtilization[1m].mean()
**优化后**(按维度过滤):Query specific instance
查询特定实例
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()
**Cost**: Queries free, but rate limited (1000 req/min)CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()
**成本**:查询免费,但有速率限制(1000次请求/分钟)Progressive Loading References
渐进式加载参考资料
OCI Monitoring Reference (Official Oracle Documentation)
OCI监控参考资料(Oracle官方文档)
WHEN TO LOAD :
oci-monitoring-reference.md- Need comprehensive list of all OCI service metrics
- Understanding MQL (Monitoring Query Language) in depth
- Implementing complex alarm conditions and composites
- Need official Oracle guidance on Logging and Service Connector
- Setting up Log Analytics and APM integration
Do NOT load for:
- Quick alarm setup (examples in this skill)
- Common metric patterns (tables above)
- Troubleshooting decision trees (covered above)
何时加载 :
oci-monitoring-reference.md- 需要获取所有OCI服务指标的完整列表
- 深入理解MQL(Monitoring Query Language)
- 实现复杂的告警条件与组合告警
- 需要Oracle官方关于日志记录与Service Connector的指导
- 配置Log Analytics与APM集成
无需加载的场景:
- 快速配置告警(本技能中已有示例)
- 常见指标模式(上述表格已覆盖)
- 排查决策树(上述内容已覆盖)
When to Use This Skill
何时使用本技能
- Alarms: threshold configuration, missing data handling, trigger delay
- Troubleshooting: metrics not showing, alarms not firing, namespace errors
- Log collection: Service Connector, IAM policies, missing logs
- Performance: query optimization, dimension filtering
- 告警:阈值配置、数据缺失处理、触发延迟设置
- 故障排查:指标不显示、告警不触发、命名空间错误
- 日志收集:Service Connector配置、IAM策略、日志缺失问题
- 性能:查询优化、维度过滤