mimir
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana Mimir Skill
Grafana Mimir 完整指南
Comprehensive guide for Grafana Mimir - the horizontally scalable, highly available, multi-tenant time series database for long-term Prometheus metrics storage.
本指南是Grafana Mimir的完整参考文档——这是一款用于Prometheus指标长期存储的水平可扩展、高可用、多租户时序数据库。
What is Mimir?
什么是Mimir?
Mimir is an open-source, horizontally scalable, highly available, multi-tenant long-term storage solution for Prometheus and OpenTelemetry metrics that:
- Overcomes Prometheus limitations - Scalability and long-term retention
- Multi-tenant by default - Built-in tenant isolation via header
X-Scope-OrgID - Stores data in object storage - S3, GCS, Azure Blob Storage, or Swift
- 100% Prometheus compatible - PromQL queries, remote write protocol
- Part of LGTM+ Stack - Logs, Grafana, Traces, Metrics unified observability
Mimir是一款开源、水平可扩展、高可用、多租户的Prometheus与OpenTelemetry指标长期存储解决方案,具备以下特性:
- 突破Prometheus的局限——解决可扩展性与长期存储问题
- 默认支持多租户——通过请求头实现内置租户隔离
X-Scope-OrgID - 将数据存储在对象存储中——支持S3、GCS、Azure Blob Storage或Swift
- 100%兼容Prometheus——支持PromQL查询、远程写入协议
- 属于LGTM+ 技术栈——实现日志、Grafana、链路追踪、指标的统一可观测性
Architecture Overview
架构概述
Core Components
核心组件
| Component | Purpose |
|---|---|
| Distributor | Validates requests, routes incoming metrics to ingesters via hash ring |
| Ingester | Stores time-series data in memory, flushes to object storage |
| Querier | Executes PromQL queries from ingesters and store-gateways |
| Query Frontend | Caches query results, optimizes and splits queries |
| Query Scheduler | Manages per-tenant query queues for fairness |
| Store-Gateway | Provides access to historical metric blocks in object storage |
| Compactor | Consolidates and optimizes stored metric data blocks |
| Ruler | Evaluates recording and alerting rules (optional) |
| Alertmanager | Handles alert routing and deduplication (optional) |
| 组件 | 用途 |
|---|---|
| Distributor | 验证请求,通过哈希环将传入的指标路由至Ingester |
| Ingester | 将时序数据存储在内存中,定期刷新至对象存储 |
| Querier | 从Ingester和Store-Gateway执行PromQL查询 |
| Query Frontend | 缓存查询结果,优化并拆分查询请求 |
| Query Scheduler | 管理每个租户的查询队列,确保公平性 |
| Store-Gateway | 提供对对象存储中历史指标块的访问 |
| Compactor | 合并并优化存储的指标数据块 |
| Ruler | 评估记录规则与告警规则(可选组件) |
| Alertmanager | 处理告警路由与去重(可选组件) |
Data Flow
数据流程
Write Path:
Prometheus/OTel → Distributor → Ingester → Object Storage
↓
Hash Ring
(routes by series)Read Path:
Query → Query Frontend → Query Scheduler → Querier
↓
Ingesters (recent)
↓
Store-Gateway (historical)写入流程:
Prometheus/OTel → Distributor → Ingester → Object Storage
↓
Hash Ring
(按时间序列路由)读取流程:
Query → Query Frontend → Query Scheduler → Querier
↓
Ingesters (近期数据)
↓
Store-Gateway (历史数据)Deployment Modes
部署模式
1. Monolithic Mode (-target=all
)
-target=all1. 单体模式(-target=all
)
-target=all- All components in single process
- Best for: Development, testing, small-scale (~1M series)
- Horizontally scalable by deploying multiple instances
- Not recommended for large-scale (all components scale together)
- 所有组件运行在单个进程中
- 最佳适用场景:开发、测试、小规模部署(约100万条时间序列)
- 可通过部署多个实例实现水平扩展
- 不建议用于大规模部署(所有组件会同步扩展,无法按需调整)
2. Microservices Mode (Distributed) - Recommended for Production
2. 微服务模式(分布式)——生产环境推荐
yaml
undefinedyaml
undefinedUsing mimir-distributed Helm chart
Using mimir-distributed Helm chart
distributor:
replicas: 3
ingester:
replicas: 3
zoneAwareReplication:
enabled: true
querier:
replicas: 3
queryFrontend:
replicas: 2
queryScheduler:
replicas: 2
storeGateway:
replicas: 3
compactor:
replicas: 1
undefineddistributor:
replicas: 3
ingester:
replicas: 3
zoneAwareReplication:
enabled: true
querier:
replicas: 3
queryFrontend:
replicas: 2
queryScheduler:
replicas: 2
storeGateway:
replicas: 3
compactor:
replicas: 1
undefinedHelm Deployment
Helm部署
Add Repository
添加仓库
bash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo updatebash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo updateInstall Distributed Mimir
安装分布式Mimir
bash
helm install mimir grafana/mimir-distributed \
--namespace monitoring \
--values values.yamlbash
helm install mimir grafana/mimir-distributed \
--namespace monitoring \
--values values.yamlPre-Built Values Files
预构建的Values文件
| File | Purpose |
|---|---|
| Non-production testing with MinIO |
| ~1 million series (single replicas, not HA) |
| Production (~10 million series) |
| 文件 | 用途 |
|---|---|
| 用于非生产环境测试,搭配MinIO使用 |
| 适用于约100万条时间序列(单实例,不具备高可用) |
| 生产环境适用(约1000万条时间序列) |
Production Values Example
生产环境Values示例
yaml
undefinedyaml
undefinedDeployment mode
Deployment mode
mimir:
structuredConfig:
multitenancy_enabled: true
mimir:
structuredConfig:
multitenancy_enabled: true
Storage configuration
Storage configuration
mimir:
structuredConfig:
common:
storage:
backend: azure # or s3, gcs
azure:
account_name: ${AZURE_STORAGE_ACCOUNT}
account_key: ${AZURE_STORAGE_KEY}
endpoint_suffix: blob.core.windows.net
blocks_storage:
azure:
container_name: mimir-blocks
alertmanager_storage:
azure:
container_name: mimir-alertmanager
ruler_storage:
azure:
container_name: mimir-rulermimir:
structuredConfig:
common:
storage:
backend: azure # or s3, gcs
azure:
account_name: ${AZURE_STORAGE_ACCOUNT}
account_key: ${AZURE_STORAGE_KEY}
endpoint_suffix: blob.core.windows.net
blocks_storage:
azure:
container_name: mimir-blocks
alertmanager_storage:
azure:
container_name: mimir-alertmanager
ruler_storage:
azure:
container_name: mimir-rulerDistributor
Distributor
distributor:
replicas: 3
resources:
requests:
cpu: 1
memory: 2Gi
limits:
memory: 4Gi
distributor:
replicas: 3
resources:
requests:
cpu: 1
memory: 2Gi
limits:
memory: 4Gi
Ingester
Ingester
ingester:
replicas: 3
zoneAwareReplication:
enabled: true
persistentVolume:
enabled: true
size: 50Gi
resources:
requests:
cpu: 2
memory: 8Gi
limits:
memory: 16Gi
ingester:
replicas: 3
zoneAwareReplication:
enabled: true
persistentVolume:
enabled: true
size: 50Gi
resources:
requests:
cpu: 2
memory: 8Gi
limits:
memory: 16Gi
Querier
Querier
querier:
replicas: 3
resources:
requests:
cpu: 1
memory: 2Gi
limits:
memory: 8Gi
querier:
replicas: 3
resources:
requests:
cpu: 1
memory: 2Gi
limits:
memory: 8Gi
Query Frontend
Query Frontend
query_frontend:
replicas: 2
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 2Gi
query_frontend:
replicas: 2
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 2Gi
Query Scheduler
Query Scheduler
query_scheduler:
replicas: 2
query_scheduler:
replicas: 2
Store Gateway
Store Gateway
store_gateway:
replicas: 3
persistentVolume:
enabled: true
size: 20Gi
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
memory: 8Gi
store_gateway:
replicas: 3
persistentVolume:
enabled: true
size: 20Gi
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
memory: 8Gi
Compactor
Compactor
compactor:
replicas: 1
persistentVolume:
enabled: true
size: 50Gi
resources:
requests:
cpu: 1
memory: 4Gi
limits:
memory: 8Gi
compactor:
replicas: 1
persistentVolume:
enabled: true
size: 50Gi
resources:
requests:
cpu: 1
memory: 4Gi
limits:
memory: 8Gi
Gateway for external access
Gateway for external access
gateway:
enabledNonEnterprise: true
replicas: 2
gateway:
enabledNonEnterprise: true
replicas: 2
Monitoring
Monitoring
metaMonitoring:
serviceMonitor:
enabled: true
undefinedmetaMonitoring:
serviceMonitor:
enabled: true
undefinedStorage Configuration
存储配置
Critical Requirements
关键要求
- Must create buckets manually - Mimir doesn't create them
- Separate buckets required - blocks_storage, alertmanager_storage, ruler_storage cannot share the same bucket+prefix
- Azure: Hierarchical namespace must be disabled
- 必须手动创建存储桶——Mimir不会自动创建
- 需要使用独立的存储桶——blocks_storage、alertmanager_storage、ruler_storage不能共享同一个存储桶或前缀
- Azure环境:必须禁用分层命名空间
Azure Blob Storage
Azure Blob Storage
yaml
mimir:
structuredConfig:
common:
storage:
backend: azure
azure:
account_name: <storage-account-name>
# Option 1: Account Key (via environment variable)
account_key: ${AZURE_STORAGE_KEY}
# Option 2: User-Assigned Managed Identity
# user_assigned_id: <identity-client-id>
endpoint_suffix: blob.core.windows.net
blocks_storage:
azure:
container_name: mimir-blocks
alertmanager_storage:
azure:
container_name: mimir-alertmanager
ruler_storage:
azure:
container_name: mimir-ruleryaml
mimir:
structuredConfig:
common:
storage:
backend: azure
azure:
account_name: <storage-account-name>
# 选项1:账户密钥(通过环境变量传入)
account_key: ${AZURE_STORAGE_KEY}
# 选项2:用户分配的托管标识
# user_assigned_id: <identity-client-id>
endpoint_suffix: blob.core.windows.net
blocks_storage:
azure:
container_name: mimir-blocks
alertmanager_storage:
azure:
container_name: mimir-alertmanager
ruler_storage:
azure:
container_name: mimir-rulerAWS S3
AWS S3
yaml
mimir:
structuredConfig:
common:
storage:
backend: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
blocks_storage:
s3:
bucket_name: mimir-blocks
alertmanager_storage:
s3:
bucket_name: mimir-alertmanager
ruler_storage:
s3:
bucket_name: mimir-ruleryaml
mimir:
structuredConfig:
common:
storage:
backend: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
blocks_storage:
s3:
bucket_name: mimir-blocks
alertmanager_storage:
s3:
bucket_name: mimir-alertmanager
ruler_storage:
s3:
bucket_name: mimir-rulerGoogle Cloud Storage
Google Cloud Storage
yaml
mimir:
structuredConfig:
common:
storage:
backend: gcs
gcs:
service_account: ${GCS_SERVICE_ACCOUNT_JSON}
blocks_storage:
gcs:
bucket_name: mimir-blocks
alertmanager_storage:
gcs:
bucket_name: mimir-alertmanager
ruler_storage:
gcs:
bucket_name: mimir-ruleryaml
mimir:
structuredConfig:
common:
storage:
backend: gcs
gcs:
service_account: ${GCS_SERVICE_ACCOUNT_JSON}
blocks_storage:
gcs:
bucket_name: mimir-blocks
alertmanager_storage:
gcs:
bucket_name: mimir-alertmanager
ruler_storage:
gcs:
bucket_name: mimir-rulerLimits Configuration
限制配置
yaml
mimir:
structuredConfig:
limits:
# Ingestion limits
ingestion_rate: 25000 # Samples/sec per tenant
ingestion_burst_size: 50000 # Burst size
max_series_per_metric: 10000
max_series_per_user: 1000000
max_global_series_per_user: 1000000
max_label_names_per_series: 30
max_label_name_length: 1024
max_label_value_length: 2048
# Query limits
max_fetched_series_per_query: 100000
max_fetched_chunks_per_query: 2000000
max_query_lookback: 0 # No limit
max_query_parallelism: 32
# Retention
compactor_blocks_retention_period: 365d # 1 year
# Out-of-order samples
out_of_order_time_window: 5myaml
mimir:
structuredConfig:
limits:
# 写入限制
ingestion_rate: 25000 # 每个租户的每秒样本数
ingestion_burst_size: 50000 # 突发写入上限
max_series_per_metric: 10000
max_series_per_user: 1000000
max_global_series_per_user: 1000000
max_label_names_per_series: 30
max_label_name_length: 1024
max_label_value_length: 2048
# 查询限制
max_fetched_series_per_query: 100000
max_fetched_chunks_per_query: 2000000
max_query_lookback: 0 # 无限制
max_query_parallelism: 32
# 数据保留
compactor_blocks_retention_period: 365d # 1年
# 乱序样本
out_of_order_time_window: 5mPer-Tenant Overrides (Runtime Configuration)
租户级别的运行时覆盖配置
yaml
undefinedyaml
undefinedruntime-config.yaml
runtime-config.yaml
overrides:
tenant1:
ingestion_rate: 50000
max_series_per_user: 2000000
compactor_blocks_retention_period: 730d # 2 years
tenant2:
ingestion_rate: 75000
max_global_series_per_user: 5000000
Enable runtime configuration:
```yaml
mimir:
structuredConfig:
runtime_config:
file: /etc/mimir/runtime-config.yaml
period: 10soverrides:
tenant1:
ingestion_rate: 50000
max_series_per_user: 2000000
compactor_blocks_retention_period: 730d # 2年
tenant2:
ingestion_rate: 75000
max_global_series_per_user: 5000000
启用运行时配置:
```yaml
mimir:
structuredConfig:
runtime_config:
file: /etc/mimir/runtime-config.yaml
period: 10sHigh Availability Configuration
高可用配置
HA Tracker for Prometheus Deduplication
用于Prometheus去重的HA Tracker
yaml
mimir:
structuredConfig:
distributor:
ha_tracker:
enable_ha_tracker: true
kvstore:
store: memberlist
cluster_label: cluster
replica_label: __replica__
memberlist:
join_members:
- mimir-gossip-ring.monitoring.svc.cluster.local:7946Prometheus Configuration:
yaml
global:
external_labels:
cluster: prom-team1
__replica__: replica1
remote_write:
- url: http://mimir-gateway:8080/api/v1/push
headers:
X-Scope-OrgID: my-tenantyaml
mimir:
structuredConfig:
distributor:
ha_tracker:
enable_ha_tracker: true
kvstore:
store: memberlist
cluster_label: cluster
replica_label: __replica__
memberlist:
join_members:
- mimir-gossip-ring.monitoring.svc.cluster.local:7946Prometheus配置:
yaml
global:
external_labels:
cluster: prom-team1
__replica__: replica1
remote_write:
- url: http://mimir-gateway:8080/api/v1/push
headers:
X-Scope-OrgID: my-tenantZone-Aware Replication
区域感知复制
yaml
ingester:
zoneAwareReplication:
enabled: true
zones:
- name: zone-a
nodeSelector:
topology.kubernetes.io/zone: us-east-1a
- name: zone-b
nodeSelector:
topology.kubernetes.io/zone: us-east-1b
- name: zone-c
nodeSelector:
topology.kubernetes.io/zone: us-east-1c
store_gateway:
zoneAwareReplication:
enabled: trueyaml
ingester:
zoneAwareReplication:
enabled: true
zones:
- name: zone-a
nodeSelector:
topology.kubernetes.io/zone: us-east-1a
- name: zone-b
nodeSelector:
topology.kubernetes.io/zone: us-east-1b
- name: zone-c
nodeSelector:
topology.kubernetes.io/zone: us-east-1c
store_gateway:
zoneAwareReplication:
enabled: trueShuffle Sharding
随机分片
Limits tenant data to a subset of instances for fault isolation:
yaml
mimir:
structuredConfig:
limits:
# Write path
ingestion_tenant_shard_size: 3
# Read path
max_queriers_per_tenant: 5
store_gateway_tenant_shard_size: 3将租户数据限制在部分实例中,实现故障隔离:
yaml
mimir:
structuredConfig:
limits:
# 写入路径
ingestion_tenant_shard_size: 3
# 读取路径
max_queriers_per_tenant: 5
store_gateway_tenant_shard_size: 3OpenTelemetry Integration
OpenTelemetry集成
OTLP Metrics Ingestion
OTLP指标写入
OpenTelemetry Collector Config:
yaml
exporters:
otlphttp:
endpoint: http://mimir-gateway:8080/otlp
headers:
X-Scope-OrgID: "my-tenant"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlphttp]OpenTelemetry Collector配置:
yaml
exporters:
otlphttp:
endpoint: http://mimir-gateway:8080/otlp
headers:
X-Scope-OrgID: "my-tenant"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlphttp]Exponential Histograms (Experimental)
指数直方图(实验性)
go
// Go SDK configuration
Aggregation: metric.AggregationBase2ExponentialHistogram{
MaxSize: 160, // Maximum buckets
MaxScale: 20, // Scale factor
}Key Benefits:
- Explicit min/max values (no estimation needed)
- Better accuracy for extreme percentiles
- Native OTLP format preservation
go
// Go SDK配置
Aggregation: metric.AggregationBase2ExponentialHistogram{
MaxSize: 160, // 最大桶数量
MaxScale: 20, // 缩放因子
}核心优势:
- 明确的最小值/最大值(无需估算)
- 对极端百分位数的计算更准确
- 原生支持OTLP格式
Multi-Tenancy
多租户
yaml
mimir:
structuredConfig:
multitenancy_enabled: true
no_auth_tenant: anonymous # Used when multitenancy disabledQuery with tenant header:
bash
curl -H "X-Scope-OrgID: tenant-a" \
"http://mimir:8080/prometheus/api/v1/query?query=up"Tenant ID Constraints:
- Max 150 characters
- Allowed: alphanumeric,
!-_.*'() - Prohibited: or
.alone,.., slashes__mimir_cluster
yaml
mimir:
structuredConfig:
multitenancy_enabled: true
no_auth_tenant: anonymous # 禁用多租户时使用的默认租户携带租户请求头查询:
bash
curl -H "X-Scope-OrgID: tenant-a" \
"http://mimir:8080/prometheus/api/v1/query?query=up"租户ID约束:
- 最大长度150字符
- 允许使用:字母数字、
!-_.*'() - 禁止使用:单独的或
.、..、斜杠__mimir_cluster
API Reference
API参考
Ingestion Endpoints
写入端点
bash
undefinedbash
undefinedPrometheus remote write
Prometheus远程写入
POST /api/v1/push
POST /api/v1/push
OTLP metrics
OTLP指标
POST /otlp/v1/metrics
POST /otlp/v1/metrics
InfluxDB line protocol
InfluxDB行协议
POST /api/v1/push/influx/write
undefinedPOST /api/v1/push/influx/write
undefinedQuery Endpoints
查询端点
bash
undefinedbash
undefinedInstant query
即时查询
GET,POST /prometheus/api/v1/query?query=<promql>&time=<timestamp>
GET,POST /prometheus/api/v1/query?query=<promql>&time=<timestamp>
Range query
范围查询
GET,POST /prometheus/api/v1/query_range?query=<promql>&start=<start>&end=<end>&step=<step>
GET,POST /prometheus/api/v1/query_range?query=<promql>&start=<start>&end=<end>&step=<step>
Labels
标签查询
GET,POST /prometheus/api/v1/labels
GET /prometheus/api/v1/label/{name}/values
GET,POST /prometheus/api/v1/labels
GET /prometheus/api/v1/label/{name}/values
Series
时间序列查询
GET,POST /prometheus/api/v1/series
GET,POST /prometheus/api/v1/series
Exemplars
示例查询
GET,POST /prometheus/api/v1/query_exemplars
GET,POST /prometheus/api/v1/query_exemplars
Cardinality
基数查询
GET,POST /prometheus/api/v1/cardinality/label_names
GET,POST /prometheus/api/v1/cardinality/active_series
undefinedGET,POST /prometheus/api/v1/cardinality/label_names
GET,POST /prometheus/api/v1/cardinality/active_series
undefinedAdministrative Endpoints
管理端点
bash
undefinedbash
undefinedFlush ingester data
刷新Ingester数据
GET,POST /ingester/flush
GET,POST /ingester/flush
Prepare shutdown
准备关机
GET,POST,DELETE /ingester/prepare-shutdown
GET,POST,DELETE /ingester/prepare-shutdown
Ring status
哈希环状态
GET /ingester/ring
GET /distributor/ring
GET /store-gateway/ring
GET /compactor/ring
GET /ingester/ring
GET /distributor/ring
GET /store-gateway/ring
GET /compactor/ring
Tenant stats
租户统计
GET /distributor/all_user_stats
GET /api/v1/user_stats
GET /api/v1/user_limits
undefinedGET /distributor/all_user_stats
GET /api/v1/user_stats
GET /api/v1/user_limits
undefinedHealth & Config
健康与配置检查
bash
GET /ready
GET /metrics
GET /config
GET /config?mode=diff
GET /runtime_configbash
GET /ready
GET /metrics
GET /config
GET /config?mode=diff
GET /runtime_configAzure Identity Configuration
Azure身份配置
User-Assigned Managed Identity
用户分配的托管标识
1. Create Identity:
bash
az identity create \
--name mimir-identity \
--resource-group <rg>
IDENTITY_CLIENT_ID=$(az identity show --name mimir-identity --resource-group <rg> --query clientId -o tsv)
IDENTITY_PRINCIPAL_ID=$(az identity show --name mimir-identity --resource-group <rg> --query principalId -o tsv)2. Assign to Node Pool:
bash
az vmss identity assign \
--resource-group <aks-node-rg> \
--name <vmss-name> \
--identities /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/mimir-identity3. Grant Storage Permission:
bash
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee-object-id $IDENTITY_PRINCIPAL_ID \
--scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>4. Configure Mimir:
yaml
mimir:
structuredConfig:
common:
storage:
azure:
user_assigned_id: <IDENTITY_CLIENT_ID>1. 创建标识:
bash
az identity create \
--name mimir-identity \
--resource-group <rg>
IDENTITY_CLIENT_ID=$(az identity show --name mimir-identity --resource-group <rg> --query clientId -o tsv)
IDENTITY_PRINCIPAL_ID=$(az identity show --name mimir-identity --resource-group <rg> --query principalId -o tsv)2. 分配至节点池:
bash
az vmss identity assign \
--resource-group <aks-node-rg> \
--name <vmss-name> \
--identities /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/mimir-identity3. 授予存储权限:
bash
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee-object-id $IDENTITY_PRINCIPAL_ID \
--scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>4. 配置Mimir:
yaml
mimir:
structuredConfig:
common:
storage:
azure:
user_assigned_id: <IDENTITY_CLIENT_ID>Workload Identity Federation
工作负载身份联邦
1. Create Federated Credential:
bash
az identity federated-credential create \
--name mimir-federated \
--identity-name mimir-identity \
--resource-group <rg> \
--issuer <aks-oidc-issuer-url> \
--subject system:serviceaccount:monitoring:mimir \
--audiences api://AzureADTokenExchange2. Configure Helm Values:
yaml
serviceAccount:
annotations:
azure.workload.identity/client-id: <IDENTITY_CLIENT_ID>
podLabels:
azure.workload.identity/use: "true"1. 创建联邦凭据:
bash
az identity federated-credential create \
--name mimir-federated \
--identity-name mimir-identity \
--resource-group <rg> \
--issuer <aks-oidc-issuer-url> \
--subject system:serviceaccount:monitoring:mimir \
--audiences api://AzureADTokenExchange2. 配置Helm Values:
yaml
serviceAccount:
annotations:
azure.workload.identity/client-id: <IDENTITY_CLIENT_ID>
podLabels:
azure.workload.identity/use: "true"Troubleshooting
故障排查
Common Issues
常见问题
1. Container Not Found (Azure)
bash
undefined1. 容器不存在(Azure环境)
bash
undefinedCreate required containers
创建所需的容器
az storage container create --name mimir-blocks --account-name <storage>
az storage container create --name mimir-alertmanager --account-name <storage>
az storage container create --name mimir-ruler --account-name <storage>
**2. Authorization Failure (Azure)**
```bashaz storage container create --name mimir-blocks --account-name <storage>
az storage container create --name mimir-alertmanager --account-name <storage>
az storage container create --name mimir-ruler --account-name <storage>
**2. 授权失败(Azure环境)**
```bashVerify RBAC assignment
验证RBAC分配
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>
Assign if missing
若缺失则分配权限
az role assignment create
--role "Storage Blob Data Contributor"
--assignee-object-id <principal-id>
--scope <storage-scope>
--role "Storage Blob Data Contributor"
--assignee-object-id <principal-id>
--scope <storage-scope>
az role assignment create
--role "Storage Blob Data Contributor"
--assignee-object-id <principal-id>
--scope <storage-scope>
--role "Storage Blob Data Contributor"
--assignee-object-id <principal-id>
--scope <storage-scope>
Restart pod to refresh token
重启Pod以刷新令牌
kubectl delete pod -n monitoring <ingester-pod>
**3. Ingester OOM**
```yaml
ingester:
resources:
limits:
memory: 16Gi # Increase memory4. Query Timeout
yaml
mimir:
structuredConfig:
querier:
timeout: 5m
max_concurrent: 205. High Cardinality
yaml
mimir:
structuredConfig:
limits:
max_series_per_user: 5000000
max_series_per_metric: 50000kubectl delete pod -n monitoring <ingester-pod>
**3. Ingester内存溢出(OOM)**
```yaml
ingester:
resources:
limits:
memory: 16Gi # 增加内存限制4. 查询超时
yaml
mimir:
structuredConfig:
querier:
timeout: 5m
max_concurrent: 205. 高基数问题
yaml
mimir:
structuredConfig:
limits:
max_series_per_user: 5000000
max_series_per_metric: 50000Diagnostic Commands
诊断命令
bash
undefinedbash
undefinedCheck pod status
检查Pod状态
kubectl get pods -n monitoring -l app.kubernetes.io/name=mimir
kubectl get pods -n monitoring -l app.kubernetes.io/name=mimir
Check ingester logs
查看Ingester日志
kubectl logs -n monitoring -l app.kubernetes.io/component=ingester --tail=100
kubectl logs -n monitoring -l app.kubernetes.io/component=ingester --tail=100
Check distributor logs
查看Distributor日志
kubectl logs -n monitoring -l app.kubernetes.io/component=distributor --tail=100
kubectl logs -n monitoring -l app.kubernetes.io/component=distributor --tail=100
Verify readiness
验证就绪状态
kubectl exec -it <mimir-pod> -n monitoring -- wget -qO- http://localhost:8080/ready
kubectl exec -it <mimir-pod> -n monitoring -- wget -qO- http://localhost:8080/ready
Check ring status
检查哈希环状态
kubectl port-forward svc/mimir-distributor 8080:8080 -n monitoring
curl http://localhost:8080/distributor/ring
kubectl port-forward svc/mimir-distributor 8080:8080 -n monitoring
curl http://localhost:8080/distributor/ring
Check configuration
查看配置
kubectl exec -it <mimir-pod> -n monitoring -- cat /etc/mimir/mimir.yaml
kubectl exec -it <mimir-pod> -n monitoring -- cat /etc/mimir/mimir.yaml
Validate configuration before deployment
部署前验证配置
mimir -modules -config.file <path-to-config-file>
undefinedmimir -modules -config.file <path-to-config-file>
undefinedKey Metrics to Monitor
需监控的核心指标
promql
undefinedpromql
undefinedIngestion rate per tenant
每个租户的写入速率
sum by (user) (rate(cortex_distributor_received_samples_total[5m]))
sum by (user) (rate(cortex_distributor_received_samples_total[5m]))
Series count per tenant
每个租户的时间序列数量
sum by (user) (cortex_ingester_memory_series)
sum by (user) (cortex_ingester_memory_series)
Query latency
查询延迟
histogram_quantile(0.99, sum by (le) (rate(cortex_request_duration_seconds_bucket{route=~"/api/prom/api/v1/query.*"}[5m])))
histogram_quantile(0.99, sum by (le) (rate(cortex_request_duration_seconds_bucket{route=~"/api/prom/api/v1/query.*"}[5m])))
Compactor status
Compactor运行状态
cortex_compactor_runs_completed_total
cortex_compactor_runs_failed_total
cortex_compactor_runs_completed_total
cortex_compactor_runs_failed_total
Store-gateway block sync
Store-gateway块同步状态
cortex_bucket_store_blocks_loaded
undefinedcortex_bucket_store_blocks_loaded
undefinedCircuit Breakers (Ingester)
熔断器(Ingester)
yaml
mimir:
structuredConfig:
ingester:
push_circuit_breaker:
enabled: true
request_timeout: 2s
failure_threshold_percentage: 10
cooldown_period: 10s
read_circuit_breaker:
enabled: true
request_timeout: 30sStates:
- Closed - Normal operation
- Open - Stops forwarding to failing instances
- Half-open - Limited trial requests after cooldown
yaml
mimir:
structuredConfig:
ingester:
push_circuit_breaker:
enabled: true
request_timeout: 2s
failure_threshold_percentage: 10
cooldown_period: 10s
read_circuit_breaker:
enabled: true
request_timeout: 30s状态说明:
- 关闭状态——正常运行
- 打开状态——停止向故障实例转发请求
- 半开状态——冷却期后允许有限的测试请求