ops-inspector

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Standalone Install Note

独立安装说明

If this environment only installed the current skill, start from the CloudBase main entry and use the published
cloudbase/references/...
paths for sibling skills.
  • CloudBase main entry:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/SKILL.md
  • Current skill raw source:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/ops-inspector/SKILL.md
Keep local
references/...
paths for files that ship with the current skill directory. When this file points to a sibling skill such as
cloud-functions
or
cloudrun-development
, use the standalone fallback URL shown next to that reference.
如果当前环境仅安装了本Skill,请从CloudBase主入口开始,使用已发布的
cloudbase/references/...
路径访问同级Skill。
  • CloudBase主入口:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/SKILL.md
  • 当前Skill原始源码:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/ops-inspector/SKILL.md
对于随当前Skill目录一起提供的文件,请保留本地
references/...
路径。当本文件指向
cloud-functions
cloudrun-development
等同级Skill时,请使用该参考旁显示的独立备用URL。

Activation Contract

激活约定

Use this first when

优先使用场景

  • The user wants to check the health or status of CloudBase resources (cloud functions, CloudRun, databases, storage, etc.).
  • The user reports errors, failures, or abnormal behavior and wants a quick diagnosis.
  • The user asks for an "inspection", "health check", "巡检", "诊断", or "troubleshooting" of their CloudBase environment.
  • The user wants to review recent error logs across services.
  • 用户需要检查CloudBase资源(云函数、CloudRun、数据库、存储等)的健康状态或运行状态时。
  • 用户报告错误、故障或异常行为,需要快速诊断时。
  • 用户要求对其CloudBase环境进行「巡检」「健康检查」「诊断」或「故障排查」时。
  • 用户需要查看各服务近期的错误日志时。

Read before writing code if

如需编写代码请先阅读

  • The inspection reveals code-level issues in cloud functions or CloudRun services — then read the relevant implementation skill before suggesting fixes.
  • The user wants to fix a problem found during inspection rather than just diagnose it.
  • 巡检发现云函数或CloudRun服务存在代码层面的问题——此时请先阅读相关实现Skill,再提出修复建议。
  • 用户希望修复巡检中发现的问题,而非仅进行诊断时。

Then also read

同时需要阅读的Skill

  • Cloud function issues ->
    ../cloud-functions/SKILL.md
    (standalone fallback:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloud-functions/SKILL.md
    )
  • CloudRun issues ->
    ../cloudrun-development/SKILL.md
    (standalone fallback:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudrun-development/SKILL.md
    )
  • Database issues ->
    ../relational-database-tool/SKILL.md
    (standalone fallback:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/relational-database-tool/SKILL.md
    ) or
    ../no-sql-web-sdk/SKILL.md
    (standalone fallback:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/no-sql-web-sdk/SKILL.md
    )
  • Platform overview ->
    ../cloudbase-platform/SKILL.md
    (standalone fallback:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudbase-platform/SKILL.md
    )
  • 云函数问题 ->
    ../cloud-functions/SKILL.md
    (独立备用URL:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloud-functions/SKILL.md
  • CloudRun问题 ->
    ../cloudrun-development/SKILL.md
    (独立备用URL:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudrun-development/SKILL.md
  • 数据库问题 ->
    ../relational-database-tool/SKILL.md
    (独立备用URL:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/relational-database-tool/SKILL.md
    )或
    ../no-sql-web-sdk/SKILL.md
    (独立备用URL:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/no-sql-web-sdk/SKILL.md
  • 平台概览 ->
    ../cloudbase-platform/SKILL.md
    (独立备用URL:
    https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudbase-platform/SKILL.md

Do NOT use for

请勿使用场景

  • Deploying new resources or writing application code. This skill is read-only and diagnostic.
  • Replacing proper monitoring/alerting infrastructure. It provides point-in-time inspection, not continuous monitoring.
  • Directly fixing problems — it diagnoses and recommends; actual fixes should use the appropriate implementation skill.
  • 部署新资源或编写应用代码。本Skill仅用于只读诊断。
  • 替代正规的监控/告警基础设施。它仅提供特定时间点的巡检,而非持续监控。
  • 直接修复问题——本Skill仅负责诊断并给出建议,实际修复需使用对应的实现Skill。

Common mistakes / gotchas

常见错误/注意事项

  • Running a full inspection without first confirming the environment is bound (
    auth
    tool must show logged-in and env-bound state).
  • Ignoring CLS log service status — if CLS is not enabled,
    queryLogs
    will fail; always check first with
    queryLogs(action="checkLogService")
    .
  • Searching logs without a time range — this can return excessive or irrelevant results. Always scope searches to a relevant time window.
  • Treating a single error log as the root cause without correlating across resources. A function error may stem from a database or config issue.
  • 未确认环境已绑定就运行全面巡检(
    auth
    工具必须显示已登录且已绑定环境状态)。
  • 忽略CLS日志服务状态——如果CLS未启用,
    queryLogs
    会执行失败;请始终先通过
    queryLogs(action="checkLogService")
    检查状态。
  • 未指定时间范围就搜索日志——这可能返回过多或无关结果。请始终将搜索范围限定在相关时间窗口内。
  • 未关联跨资源信息就将单个错误日志视为根本原因。函数错误可能源于数据库或配置问题。

Minimal checklist

最小检查清单

  • Environment is bound and accessible (
    envQuery(action="info")
    )
  • CLS log service is enabled (
    queryLogs(action="checkLogService")
    )
  • All target resources are listed before diving into details
  • Time range is specified for any log searches
  • Findings are summarized with severity levels and actionable recommendations

  • 环境已绑定且可访问(
    envQuery(action="info")
  • CLS日志服务已启用(
    queryLogs(action="checkLogService")
  • 在深入细节前,已列出所有目标资源
  • 所有日志搜索均指定了时间范围
  • 检查结果按严重程度分级汇总,并附上可执行建议

How to use this skill (for a coding agent)

如何使用本Skill(面向编码Agent)

Inspection Modes

巡检模式

The skill supports two modes based on user intent:
ModeWhen to useScope
Full inspectionUser asks for a general health check / 巡检 / 全面检查All resource types in the environment
Targeted inspectionUser reports a specific error or asks about a specific resourceOne resource type or a specific resource
本Skill根据用户意图支持两种模式:
模式使用场景范围
全面巡检用户要求进行常规健康检查/巡检/全面检查环境内所有资源类型
定向巡检用户报告特定错误或询问特定资源情况单一资源类型或特定资源

Full Inspection Workflow

全面巡检流程

Follow these steps in order for a comprehensive environment health check:
Step 1 — Environment Check
envQuery(action="info")
Confirm the environment is accessible. Record the
envId
for console link generation.
Step 2 — Log Service Status
queryLogs(action="checkLogService")
If CLS is not enabled, note this as a warning — log-based diagnosis will be unavailable. Recommend enabling CLS in the console:
https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log
Step 3 — Cloud Functions Inspection
queryFunctions(action="listFunctions")
For each function, check:
  • Status: Is the function in an active/deployed state?
  • Recent errors:
    queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<recent>")
  • Common issues:
    • Timeout errors (execution exceeded limit)
    • Memory limit exceeded
    • Runtime errors (unhandled exceptions)
    • Cold start frequency
Step 4 — CloudRun Services Inspection
queryCloudRun(action="list")
For each service, check:
  • Status: Is the service running?
  • Detail:
    queryCloudRun(action="detail", detailServerName="<name>")
  • Common issues:
    • Service not running (scaled to zero or crashed)
    • Image pull failures
    • OOMKilled events
    • Health check failures
Step 5 — Error Log Aggregation (if CLS is enabled)
queryLogs(action="searchLogs", queryString="ERROR", service="tcb", startTime="<24h-ago>", limit=50)
queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", startTime="<24h-ago>", limit=50)
Look for patterns:
  • Repeated error messages (same error many times)
  • Cascading failures (errors in multiple services around the same time)
  • Timeout patterns
Step 6 — Summary Report
Generate a structured report:
markdown
undefined
请按以下步骤完成环境全面健康检查:
步骤1 — 环境检查
envQuery(action="info")
确认环境可访问。记录
envId
用于生成控制台链接。
步骤2 — 日志服务状态检查
queryLogs(action="checkLogService")
如果CLS未启用,请标记为警告——基于日志的诊断将无法进行。建议在控制台启用CLS:
https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log
步骤3 — 云函数巡检
queryFunctions(action="listFunctions")
对每个函数,检查以下内容:
  • 状态:函数是否处于活跃/已部署状态?
  • 近期错误
    queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<recent>")
  • 常见问题
    • 超时错误(执行时间超过限制)
    • 内存超限
    • 运行时错误(未处理异常)
    • 冷启动频率
步骤4 — CloudRun服务巡检
queryCloudRun(action="list")
对每个服务,检查以下内容:
  • 状态:服务是否在运行?
  • 详情
    queryCloudRun(action="detail", detailServerName="<name>")
  • 常见问题
    • 服务未运行(已缩容至零或崩溃)
    • 镜像拉取失败
    • OOMKilled事件
    • 健康检查失败
步骤5 — 错误日志汇总(CLS已启用时)
queryLogs(action="searchLogs", queryString="ERROR", service="tcb", startTime="<24h-ago>", limit=50)
queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", startTime="<24h-ago>", limit=50)
查找以下模式:
  • 重复出现的错误消息(同一错误多次出现)
  • 连锁故障(多个服务在同一时间段出现错误)
  • 超时模式
步骤6 — 汇总报告
生成结构化报告:
markdown
undefined

CloudBase Resource Inspection Report

CloudBase资源巡检报告

Environment: ${envId} Inspection Time: ${timestamp}
环境:${envId} 巡检时间:${timestamp}

Overall Health: ✅ Healthy / ⚠️ Warnings Found / ❌ Issues Found

整体健康状态:✅ 健康 / ⚠️ 发现警告 / ❌ 发现问题

Cloud Functions

云函数

FunctionStatusRecent ErrorsSeverity
............
函数状态近期错误严重程度
............

CloudRun Services

CloudRun服务

ServiceStatusIssuesSeverity
............
服务状态问题严重程度
............

Error Log Summary

错误日志汇总

  • Total errors in last 24h: N
  • Top error patterns: ...
  • 过去24小时总错误数:N
  • 高频错误模式:...

Recommendations

建议

  1. ...
  2. ...
  1. ...
  2. ...

Console Links

控制台链接

Targeted Inspection Workflow

定向巡检流程

When the user specifies a resource type or a specific resource:
  1. Cloud function errors:
    queryFunctions(action="listFunctionLogs", functionName="<name>")
    then
    queryLogs(action="searchLogs", queryString="* AND functionName:<name> AND level:ERROR", ...)
  2. CloudRun errors:
    queryCloudRun(action="detail", detailServerName="<name>")
    then
    queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", ...)
  3. Database issues: Check
    querySqlDatabase
    or
    readNoSqlDatabaseStructure
    depending on type
  4. General error search:
    queryLogs(action="searchLogs", queryString="<error-keyword>", ...)
当用户指定资源类型或特定资源时:
  1. 云函数错误:先执行
    queryFunctions(action="listFunctionLogs", functionName="<name>")
    ,再执行
    queryLogs(action="searchLogs", queryString="* AND functionName:<name> AND level:ERROR", ...)
  2. CloudRun错误:先执行
    queryCloudRun(action="detail", detailServerName="<name>")
    ,再执行
    queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", ...)
  3. 数据库问题:根据数据库类型,使用
    querySqlDatabase
    readNoSqlDatabaseStructure
    检查
  4. 通用错误搜索
    queryLogs(action="searchLogs", queryString="<error-keyword>", ...)

AIOps Methodology

AIOps方法论

This skill follows AIOps principles for intelligent inspection:
  1. Data Collection: Gather logs and resource states via MCP tools
  2. Pattern Recognition: Identify recurring errors, anomaly patterns, and correlations across services
  3. Root Cause Hypothesis: Based on error patterns, suggest likely root causes (e.g., a function timeout may be caused by a database query bottleneck)
  4. Actionable Recommendations: Provide specific, prioritized remediation steps with links to relevant skills and console pages
本Skill遵循AIOps原则进行智能巡检:
  1. 数据收集:通过MCP工具收集日志和资源状态
  2. 模式识别:识别重复错误、异常模式及跨服务关联关系
  3. 根本原因假设:基于错误模式,推测可能的根本原因(例如,函数超时可能由数据库查询瓶颈导致)
  4. 可执行建议:提供具体、分优先级的修复步骤,并附上相关Skill和控制台页面链接

Severity Levels

严重程度分级

LevelIconMeaning
CriticalService is down or data is at risk; requires immediate action
Warning⚠️Errors detected but service is still partially functional; investigate soon
Infoℹ️No errors found; informational status only
HealthyResource is operating normally
级别图标含义
严重服务已宕机或数据面临风险;需立即处理
警告⚠️检测到错误但服务仍可部分运行;请尽快排查
信息ℹ️未发现错误;仅为信息性状态
健康资源运行正常

Preferred Tool Map

推荐工具映射

OperationMCP Tool Call
Check environment
envQuery(action="info")
Check CLS status
queryLogs(action="checkLogService")
List cloud functions
queryFunctions(action="listFunctions")
Get function detail
queryFunctions(action="getFunctionDetail", functionName="<name>")
Get function logs
queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<time>", endTime="<time>")
Get function log detail
queryFunctions(action="getFunctionLogDetail", requestId="<id>")
List CloudRun services
queryCloudRun(action="list")
Get CloudRun detail
queryCloudRun(action="detail", detailServerName="<name>")
Search CLS logs
queryLogs(action="searchLogs", queryString="<query>", service="tcb|tcbr", startTime="<time>", endTime="<time>")
Check NoSQL structure
readNoSqlDatabaseStructure(action="listCollections")
Check MySQL status
querySqlDatabase(action="getContext")
操作MCP工具调用
检查环境
envQuery(action="info")
检查CLS状态
queryLogs(action="checkLogService")
列出云函数
queryFunctions(action="listFunctions")
获取函数详情
queryFunctions(action="getFunctionDetail", functionName="<name>")
获取函数日志
queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<time>", endTime="<time>")
获取函数日志详情
queryFunctions(action="getFunctionLogDetail", requestId="<id>")
列出CloudRun服务
queryCloudRun(action="list")
获取CloudRun详情
queryCloudRun(action="detail", detailServerName="<name>")
搜索CLS日志
queryLogs(action="searchLogs", queryString="<query>", service="tcb|tcbr", startTime="<time>", endTime="<time>")
检查NoSQL结构
readNoSqlDatabaseStructure(action="listCollections")
检查MySQL状态
querySqlDatabase(action="getContext")

Common CLS Query Patterns

常见CLS查询模式

ScenarioqueryString
All errors
ERROR
Function timeout
timeout OR 超时
Function OOM
OOM OR out of memory OR 内存超限
CloudRun crash
crash OR OOMKilled OR Error
Specific function errors
functionName:<name> AND level:ERROR
5xx HTTP errors
statusCode:>499
Cold start issues
coldStart OR 冷启动
场景queryString
所有错误
ERROR
函数超时
timeout OR 超时
函数内存超限
OOM OR out of memory OR 内存超限
CloudRun崩溃
crash OR OOMKilled OR Error
特定函数错误
functionName:<name> AND level:ERROR
5xx HTTP错误
statusCode:>499
冷启动问题
coldStart OR 冷启动

Time Range Guidance

时间范围指南

  • Quick check: Last 1 hour (
    startTime
    = 1 hour ago)
  • Standard inspection: Last 24 hours
  • Trend analysis: Last 7 days
  • Specific incident: Narrow to the reported time window
Always use ISO 8601 format for
startTime
/
endTime
, e.g.,
"2025-01-15 00:00:00"
.
  • 快速检查:过去1小时(
    startTime
    = 1小时前)
  • 标准巡检:过去24小时
  • 趋势分析:过去7天
  • 特定事件:缩小至报告的时间窗口
startTime
/
endTime
请始终使用ISO 8601格式,例如
"2025-01-15 00:00:00"

Related Skills

相关Skill

  • cloud-functions
    — Cloud function development, deployment, and debugging
  • cloudrun-development
    — CloudRun backend deployment and management
  • cloudbase-platform
    — General platform knowledge and console navigation
  • relational-database-tool
    — MySQL database management and diagnostics
  • cloud-functions
    — 云函数开发、部署与调试
  • cloudrun-development
    — CloudRun后端部署与管理
  • cloudbase-platform
    — 通用平台知识与控制台导航
  • relational-database-tool
    — MySQL数据库管理与诊断