holmesgpt-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHolmesGPT Skill
HolmesGPT 使用指南
AI-powered troubleshooting for Kubernetes and cloud-native environments.
基于AI的Kubernetes与云原生环境故障排查工具。
Overview
概述
HolmesGPT is a CNCF Sandbox project that connects AI models with live
observability data to investigate infrastructure problems, find root
causes, and suggest remediations. It operates with read-only access
and respects RBAC permissions, making it safe for production environments.
HolmesGPT是CNCF沙盒项目,它将AI模型与实时可观测性数据相连,用于排查基础设施问题、定位根因并给出修复建议。它采用只读访问权限,并遵循RBAC权限规则,在生产环境中使用安全可靠。
Quick Reference
快速参考
| Topic | Reference |
|---|---|
| Installation | |
| Configuration | |
| Data Sources | |
| Commands | |
| Troubleshooting | |
| HTTP API | |
| Integrations | |
| 主题 | 参考文档 |
|---|---|
| 安装 | |
| 配置 | |
| 数据源 | |
| 命令 | |
| 故障排查 | |
| HTTP API | |
| 集成 | |
Key Features
核心特性
- Root Cause Analysis: Investigates alerts and cluster issues
- Multi-Source Integration: 30+ toolsets (K8s, Prometheus, Grafana)
- Alert Integration: AlertManager, PagerDuty, OpsGenie, Jira, Slack
- Interactive Mode: Troubleshooting with ,
/run,/show/clear - Custom Toolsets: Extend with proprietary tools via YAML configuration
- CI/CD Integration: Automated deployment failure investigation
- 根因分析:排查告警与集群问题
- 多源集成:支持30+工具集(K8s、Prometheus、Grafana)
- 告警集成:AlertManager、PagerDuty、OpsGenie、Jira、Slack
- 交互模式:通过、
/run、/show命令进行故障排查/clear - 自定义工具集:通过YAML配置扩展专有工具
- CI/CD集成:自动排查部署失败问题
Installation Quick Start
快速安装指南
CLI (Homebrew)
CLI(Homebrew)
bash
brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key" # or OPENAI_API_KEY
holmes ask "what pods are unhealthy?"bash
brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key" # 或使用OPENAI_API_KEY
holmes ask "what pods are unhealthy?"Kubernetes (Helm)
Kubernetes(Helm)
bash
helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yamlbash
helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yamlDocker
Docker
bash
docker run -it --net=host \
-e OPENAI_API_KEY="your-key" \
-v ~/.kube/config:/root/.kube/config \
us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
ask "what pods are crashing?"bash
docker run -it --net=host \
-e OPENAI_API_KEY="your-key" \
-v ~/.kube/config:/root/.kube/config \
us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
ask "what pods are crashing?"Essential Commands
核心命令
bash
undefinedbash
undefinedBasic investigation
Basic investigation
holmes ask "what pods are unhealthy and why?"
holmes ask "why is my deployment failing?"
holmes ask "what pods are unhealthy and why?"
holmes ask "why is my deployment failing?"
Interactive mode
Interactive mode
holmes ask "investigate issue" --interactive
holmes ask "investigate issue" --interactive
Alert investigation
Alert investigation
holmes investigate alertmanager --alertmanager-url http://localhost:9093
holmes investigate pagerduty --pagerduty-api-key <KEY> --update
holmes investigate alertmanager --alertmanager-url http://localhost:9093
holmes investigate pagerduty --pagerduty-api-key <KEY> --update
With file context
With file context
holmes ask "summarize the key points" -f ./logs.txt
holmes ask "summarize the key points" -f ./logs.txt
CI/CD integration
CI/CD integration
holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>
undefinedholmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>
undefinedSupported AI Providers
支持的AI提供商
| Provider | Environment Variable | Models |
|---|---|---|
| Anthropic | | Sonnet 4, Opus 4.5 |
| OpenAI | | GPT-4.1, GPT-4o |
| Azure OpenAI | | GPT-4.1 |
| AWS Bedrock | AWS credentials | Claude 3.5 Sonnet |
| Google Gemini | | Gemini 1.5 Pro |
| Vertex AI | | Gemini 1.5 Pro |
| Ollama | Local install | Llama 3.1, Mistral |
| 提供商 | 环境变量 | 模型 |
|---|---|---|
| Anthropic | | Sonnet 4、Opus 4.5 |
| OpenAI | | GPT-4.1、GPT-4o |
| Azure OpenAI | | GPT-4.1 |
| AWS Bedrock | AWS credentials | Claude 3.5 Sonnet |
| Google Gemini | | Gemini 1.5 Pro |
| Vertex AI | | Gemini 1.5 Pro |
| Ollama | Local install | Llama 3.1、Mistral |
Basic Helm Values Structure
Helm Values 基础结构
yaml
undefinedyaml
undefinedvalues.yaml for Kubernetes deployment
values.yaml for Kubernetes deployment
image:
repository: robustadev/holmes
tag: latest
env:
- name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: holmesgpt-secrets key: anthropic-api-key
image:
repository: robustadev/holmes
tag: latest
env:
- name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: holmesgpt-secrets key: anthropic-api-key
Model configuration
Model configuration
modelList:
sonnet:
api_key: "{{ env.ANTHROPIC_API_KEY }}"
model: anthropic/claude-sonnet-4-20250514
temperature: 0
modelList:
sonnet:
api_key: "{{ env.ANTHROPIC_API_KEY }}"
model: anthropic/claude-sonnet-4-20250514
temperature: 0
Toolsets to enable
Toolsets to enable
toolsets:
kubernetes/core:
enabled: true
kubernetes/logs:
enabled: true
prometheus/metrics:
enabled: true
toolsets:
kubernetes/core:
enabled: true
kubernetes/logs:
enabled: true
prometheus/metrics:
enabled: true
Resources
Resources
resources:
requests:
memory: "1024Mi"
cpu: "100m"
limits:
memory: "1024Mi"
resources:
requests:
memory: "1024Mi"
cpu: "100m"
limits:
memory: "1024Mi"
RBAC (read-only by default)
RBAC (read-only by default)
createServiceAccount: true
undefinedcreateServiceAccount: true
undefinedInteractive Mode Commands
交互模式命令
| Command | Description |
|---|---|
| Reset context when changing topics |
| Execute custom commands and share output with AI |
| Display complete tool outputs |
| Review accumulated investigation information |
| 命令 | 描述 |
|---|---|
| 切换主题时重置上下文 |
| 执行自定义命令并将输出共享给AI |
| 显示完整的工具输出 |
| 查看已积累的排查信息 |
Custom Toolset Example
自定义工具集示例
yaml
undefinedyaml
undefinedcustom-toolset.yaml
custom-toolset.yaml
toolsets:
my-custom-tool:
description: "Custom diagnostic tool"
tools:
- name: check_service_health
description: "Check health of a specific service"
command: |
curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health
parameters:
- name: service_name
description: "Name of the service"
- name: namespace
description: "Kubernetes namespace"
Use with: `holmes ask "check health" -t custom-toolset.yaml`toolsets:
my-custom-tool:
description: "Custom diagnostic tool"
tools:
- name: check_service_health
description: "Check health of a specific service"
command: |
curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health
parameters:
- name: service_name
description: "Name of the service"
- name: namespace
description: "Kubernetes namespace"
使用方式:`holmes ask "check health" -t custom-toolset.yaml`Kubernetes Annotations for Integration
Kubernetes 集成注解
yaml
undefinedyaml
undefinedAdd to Services/Deployments for HolmesGPT context
Add to Services/Deployments for HolmesGPT context
metadata:
annotations:
holmesgpt.dev/runbook: |
This service handles payment processing.
Common issues: database connectivity, API rate limits.
Check: kubectl logs -l app=payment-service
undefinedmetadata:
annotations:
holmesgpt.dev/runbook: |
This service handles payment processing.
Common issues: database connectivity, API rate limits.
Check: kubectl logs -l app=payment-service
undefinedEnvironment Variables Reference
环境变量参考
| Variable | Description | Default |
|---|---|---|
| Config file path | |
| Log verbosity | |
| Prometheus server URL | - |
| GitHub API token | - |
| DataDog API key | - |
| Confluence URL | - |
| 变量 | 描述 | 默认值 |
|---|---|---|
| 配置文件路径 | |
| 日志级别 | |
| Prometheus服务器地址 | - |
| GitHub API令牌 | - |
| DataDog API密钥 | - |
| Confluence地址 | - |
Best Practices
最佳实践
- Use Specific Queries: Include namespace, deployment name, symptoms
- Start with Claude Sonnet 4.0/4.5: Best accuracy for complex investigations
- Enable Relevant Toolsets: Only enable what you need to reduce noise
- Use Interactive Mode: For complex multi-step investigations
- Set Up Runbooks: Provide context for known alert types
- CI/CD Integration: Automate deployment failure analysis
- 使用具体查询:包含命名空间、部署名称、症状信息
- 优先使用Claude Sonnet 4.0/4.5:复杂排查场景下准确率最高
- 启用相关工具集:仅启用所需工具集以减少干扰
- 使用交互模式:适用于复杂的多步骤排查
- 设置运行手册:为已知告警类型提供上下文
- CI/CD集成:自动分析部署失败原因
Security Considerations
安全注意事项
- HolmesGPT uses read-only access (,
get,listonly)watch - Respects existing RBAC permissions
- Never modifies, creates, or deletes resources
- API keys stored in Kubernetes Secrets
- Data not used for model training
- HolmesGPT 使用只读权限(仅、
get、list操作)watch - 遵循现有RBAC权限规则
- 绝不会修改、创建或删除资源
- API密钥存储在Kubernetes Secrets中
- 数据不会用于模型训练
Official Resources
官方资源
- Documentation: https://holmesgpt.dev/
- GitHub: https://github.com/robusta-dev/holmesgpt
- Helm Chart: https://github.com/robusta-dev/holmesgpt/tree/master/helm/holmes
- Slack Community: Cloud Native Slack
- 文档:https://holmesgpt.dev/
- GitHub:https://github.com/robusta-dev/holmesgpt
- Helm Chart:https://github.com/robusta-dev/holmesgpt/tree/master/helm/holmes
- Slack社区:Cloud Native Slack