holmesgpt-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HolmesGPT Skill

HolmesGPT 使用指南

AI-powered troubleshooting for Kubernetes and cloud-native environments.
基于AI的Kubernetes与云原生环境故障排查工具。

Overview

概述

HolmesGPT is a CNCF Sandbox project that connects AI models with live observability data to investigate infrastructure problems, find root causes, and suggest remediations. It operates with read-only access and respects RBAC permissions, making it safe for production environments.
HolmesGPT是CNCF沙盒项目,它将AI模型与实时可观测性数据相连,用于排查基础设施问题、定位根因并给出修复建议。它采用只读访问权限,并遵循RBAC权限规则,在生产环境中使用安全可靠。

Quick Reference

快速参考

TopicReference
Installation
references/installation.md
Configuration
references/configuration.md
Data Sources
references/data-sources.md
Commands
references/commands.md
Troubleshooting
references/troubleshooting.md
HTTP API
references/http-api.md
Integrations
references/integrations.md
主题参考文档
安装
references/installation.md
配置
references/configuration.md
数据源
references/data-sources.md
命令
references/commands.md
故障排查
references/troubleshooting.md
HTTP API
references/http-api.md
集成
references/integrations.md

Key Features

核心特性

  • Root Cause Analysis: Investigates alerts and cluster issues
  • Multi-Source Integration: 30+ toolsets (K8s, Prometheus, Grafana)
  • Alert Integration: AlertManager, PagerDuty, OpsGenie, Jira, Slack
  • Interactive Mode: Troubleshooting with
    /run
    ,
    /show
    ,
    /clear
  • Custom Toolsets: Extend with proprietary tools via YAML configuration
  • CI/CD Integration: Automated deployment failure investigation
  • 根因分析:排查告警与集群问题
  • 多源集成:支持30+工具集(K8s、Prometheus、Grafana)
  • 告警集成:AlertManager、PagerDuty、OpsGenie、Jira、Slack
  • 交互模式:通过
    /run
    /show
    /clear
    命令进行故障排查
  • 自定义工具集:通过YAML配置扩展专有工具
  • CI/CD集成:自动排查部署失败问题

Installation Quick Start

快速安装指南

CLI (Homebrew)

CLI(Homebrew)

bash
brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key"  # or OPENAI_API_KEY
holmes ask "what pods are unhealthy?"
bash
brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key"  # 或使用OPENAI_API_KEY
holmes ask "what pods are unhealthy?"

Kubernetes (Helm)

Kubernetes(Helm)

bash
helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yaml
bash
helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yaml

Docker

Docker

bash
docker run -it --net=host \
  -e OPENAI_API_KEY="your-key" \
  -v ~/.kube/config:/root/.kube/config \
  us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
  ask "what pods are crashing?"
bash
docker run -it --net=host \
  -e OPENAI_API_KEY="your-key" \
  -v ~/.kube/config:/root/.kube/config \
  us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
  ask "what pods are crashing?"

Essential Commands

核心命令

bash
undefined
bash
undefined

Basic investigation

Basic investigation

holmes ask "what pods are unhealthy and why?" holmes ask "why is my deployment failing?"
holmes ask "what pods are unhealthy and why?" holmes ask "why is my deployment failing?"

Interactive mode

Interactive mode

holmes ask "investigate issue" --interactive
holmes ask "investigate issue" --interactive

Alert investigation

Alert investigation

holmes investigate alertmanager --alertmanager-url http://localhost:9093 holmes investigate pagerduty --pagerduty-api-key <KEY> --update
holmes investigate alertmanager --alertmanager-url http://localhost:9093 holmes investigate pagerduty --pagerduty-api-key <KEY> --update

With file context

With file context

holmes ask "summarize the key points" -f ./logs.txt
holmes ask "summarize the key points" -f ./logs.txt

CI/CD integration

CI/CD integration

holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>
undefined
holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>
undefined

Supported AI Providers

支持的AI提供商

ProviderEnvironment VariableModels
Anthropic
ANTHROPIC_API_KEY
Sonnet 4, Opus 4.5
OpenAI
OPENAI_API_KEY
GPT-4.1, GPT-4o
Azure OpenAI
AZURE_API_KEY
GPT-4.1
AWS BedrockAWS credentialsClaude 3.5 Sonnet
Google Gemini
GEMINI_API_KEY
Gemini 1.5 Pro
Vertex AI
VERTEXAI_PROJECT
Gemini 1.5 Pro
OllamaLocal installLlama 3.1, Mistral
提供商环境变量模型
Anthropic
ANTHROPIC_API_KEY
Sonnet 4、Opus 4.5
OpenAI
OPENAI_API_KEY
GPT-4.1、GPT-4o
Azure OpenAI
AZURE_API_KEY
GPT-4.1
AWS BedrockAWS credentialsClaude 3.5 Sonnet
Google Gemini
GEMINI_API_KEY
Gemini 1.5 Pro
Vertex AI
VERTEXAI_PROJECT
Gemini 1.5 Pro
OllamaLocal installLlama 3.1、Mistral

Basic Helm Values Structure

Helm Values 基础结构

yaml
undefined
yaml
undefined

values.yaml for Kubernetes deployment

values.yaml for Kubernetes deployment

image: repository: robustadev/holmes tag: latest
env:
  • name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: holmesgpt-secrets key: anthropic-api-key
image: repository: robustadev/holmes tag: latest
env:
  • name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: holmesgpt-secrets key: anthropic-api-key

Model configuration

Model configuration

modelList: sonnet: api_key: "{{ env.ANTHROPIC_API_KEY }}" model: anthropic/claude-sonnet-4-20250514 temperature: 0
modelList: sonnet: api_key: "{{ env.ANTHROPIC_API_KEY }}" model: anthropic/claude-sonnet-4-20250514 temperature: 0

Toolsets to enable

Toolsets to enable

toolsets: kubernetes/core: enabled: true kubernetes/logs: enabled: true prometheus/metrics: enabled: true
toolsets: kubernetes/core: enabled: true kubernetes/logs: enabled: true prometheus/metrics: enabled: true

Resources

Resources

resources: requests: memory: "1024Mi" cpu: "100m" limits: memory: "1024Mi"
resources: requests: memory: "1024Mi" cpu: "100m" limits: memory: "1024Mi"

RBAC (read-only by default)

RBAC (read-only by default)

createServiceAccount: true
undefined
createServiceAccount: true
undefined

Interactive Mode Commands

交互模式命令

CommandDescription
/clear
Reset context when changing topics
/run
Execute custom commands and share output with AI
/show
Display complete tool outputs
/context
Review accumulated investigation information
命令描述
/clear
切换主题时重置上下文
/run
执行自定义命令并将输出共享给AI
/show
显示完整的工具输出
/context
查看已积累的排查信息

Custom Toolset Example

自定义工具集示例

yaml
undefined
yaml
undefined

custom-toolset.yaml

custom-toolset.yaml

toolsets: my-custom-tool: description: "Custom diagnostic tool" tools: - name: check_service_health description: "Check health of a specific service" command: | curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health parameters: - name: service_name description: "Name of the service" - name: namespace description: "Kubernetes namespace"

Use with: `holmes ask "check health" -t custom-toolset.yaml`
toolsets: my-custom-tool: description: "Custom diagnostic tool" tools: - name: check_service_health description: "Check health of a specific service" command: | curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health parameters: - name: service_name description: "Name of the service" - name: namespace description: "Kubernetes namespace"

使用方式:`holmes ask "check health" -t custom-toolset.yaml`

Kubernetes Annotations for Integration

Kubernetes 集成注解

yaml
undefined
yaml
undefined

Add to Services/Deployments for HolmesGPT context

Add to Services/Deployments for HolmesGPT context

metadata: annotations: holmesgpt.dev/runbook: | This service handles payment processing. Common issues: database connectivity, API rate limits. Check: kubectl logs -l app=payment-service
undefined
metadata: annotations: holmesgpt.dev/runbook: | This service handles payment processing. Common issues: database connectivity, API rate limits. Check: kubectl logs -l app=payment-service
undefined

Environment Variables Reference

环境变量参考

VariableDescriptionDefault
HOLMES_CONFIG_PATH
Config file path
~/.holmes/config.yaml
HOLMES_LOG_LEVEL
Log verbosity
INFO
PROMETHEUS_URL
Prometheus server URL-
GITHUB_TOKEN
GitHub API token-
DATADOG_API_KEY
DataDog API key-
CONFLUENCE_BASE_URL
Confluence URL-
变量描述默认值
HOLMES_CONFIG_PATH
配置文件路径
~/.holmes/config.yaml
HOLMES_LOG_LEVEL
日志级别
INFO
PROMETHEUS_URL
Prometheus服务器地址-
GITHUB_TOKEN
GitHub API令牌-
DATADOG_API_KEY
DataDog API密钥-
CONFLUENCE_BASE_URL
Confluence地址-

Best Practices

最佳实践

  1. Use Specific Queries: Include namespace, deployment name, symptoms
  2. Start with Claude Sonnet 4.0/4.5: Best accuracy for complex investigations
  3. Enable Relevant Toolsets: Only enable what you need to reduce noise
  4. Use Interactive Mode: For complex multi-step investigations
  5. Set Up Runbooks: Provide context for known alert types
  6. CI/CD Integration: Automate deployment failure analysis
  1. 使用具体查询:包含命名空间、部署名称、症状信息
  2. 优先使用Claude Sonnet 4.0/4.5:复杂排查场景下准确率最高
  3. 启用相关工具集:仅启用所需工具集以减少干扰
  4. 使用交互模式:适用于复杂的多步骤排查
  5. 设置运行手册:为已知告警类型提供上下文
  6. CI/CD集成:自动分析部署失败原因

Security Considerations

安全注意事项

  • HolmesGPT uses read-only access (
    get
    ,
    list
    ,
    watch
    only)
  • Respects existing RBAC permissions
  • Never modifies, creates, or deletes resources
  • API keys stored in Kubernetes Secrets
  • Data not used for model training
  • HolmesGPT 使用只读权限(仅
    get
    list
    watch
    操作)
  • 遵循现有RBAC权限规则
  • 绝不会修改、创建或删除资源
  • API密钥存储在Kubernetes Secrets中
  • 数据不会用于模型训练

Official Resources

官方资源