mimir

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana Mimir Skill

Grafana Mimir 完整指南

Comprehensive guide for Grafana Mimir - the horizontally scalable, highly available, multi-tenant time series database for long-term Prometheus metrics storage.
本指南是Grafana Mimir的完整参考文档——这是一款用于Prometheus指标长期存储的水平可扩展、高可用、多租户时序数据库。

What is Mimir?

什么是Mimir?

Mimir is an open-source, horizontally scalable, highly available, multi-tenant long-term storage solution for Prometheus and OpenTelemetry metrics that:
  • Overcomes Prometheus limitations - Scalability and long-term retention
  • Multi-tenant by default - Built-in tenant isolation via
    X-Scope-OrgID
    header
  • Stores data in object storage - S3, GCS, Azure Blob Storage, or Swift
  • 100% Prometheus compatible - PromQL queries, remote write protocol
  • Part of LGTM+ Stack - Logs, Grafana, Traces, Metrics unified observability
Mimir是一款开源、水平可扩展、高可用、多租户的Prometheus与OpenTelemetry指标长期存储解决方案,具备以下特性:
  • 突破Prometheus的局限——解决可扩展性与长期存储问题
  • 默认支持多租户——通过
    X-Scope-OrgID
    请求头实现内置租户隔离
  • 将数据存储在对象存储中——支持S3、GCS、Azure Blob Storage或Swift
  • 100%兼容Prometheus——支持PromQL查询、远程写入协议
  • 属于LGTM+ 技术栈——实现日志、Grafana、链路追踪、指标的统一可观测性

Architecture Overview

架构概述

Core Components

核心组件

ComponentPurpose
DistributorValidates requests, routes incoming metrics to ingesters via hash ring
IngesterStores time-series data in memory, flushes to object storage
QuerierExecutes PromQL queries from ingesters and store-gateways
Query FrontendCaches query results, optimizes and splits queries
Query SchedulerManages per-tenant query queues for fairness
Store-GatewayProvides access to historical metric blocks in object storage
CompactorConsolidates and optimizes stored metric data blocks
RulerEvaluates recording and alerting rules (optional)
AlertmanagerHandles alert routing and deduplication (optional)
组件用途
Distributor验证请求,通过哈希环将传入的指标路由至Ingester
Ingester将时序数据存储在内存中,定期刷新至对象存储
Querier从Ingester和Store-Gateway执行PromQL查询
Query Frontend缓存查询结果,优化并拆分查询请求
Query Scheduler管理每个租户的查询队列,确保公平性
Store-Gateway提供对对象存储中历史指标块的访问
Compactor合并并优化存储的指标数据块
Ruler评估记录规则与告警规则(可选组件)
Alertmanager处理告警路由与去重(可选组件)

Data Flow

数据流程

Write Path:
Prometheus/OTel → Distributor → Ingester → Object Storage
                 Hash Ring
                 (routes by series)
Read Path:
Query → Query Frontend → Query Scheduler → Querier
                                    Ingesters (recent)
                                    Store-Gateway (historical)
写入流程:
Prometheus/OTel → Distributor → Ingester → Object Storage
                 Hash Ring
                 (按时间序列路由)
读取流程:
Query → Query Frontend → Query Scheduler → Querier
                                    Ingesters (近期数据)
                                    Store-Gateway (历史数据)

Deployment Modes

部署模式

1. Monolithic Mode (
-target=all
)

1. 单体模式(
-target=all

  • All components in single process
  • Best for: Development, testing, small-scale (~1M series)
  • Horizontally scalable by deploying multiple instances
  • Not recommended for large-scale (all components scale together)
  • 所有组件运行在单个进程中
  • 最佳适用场景:开发、测试、小规模部署(约100万条时间序列)
  • 可通过部署多个实例实现水平扩展
  • 不建议用于大规模部署(所有组件会同步扩展,无法按需调整)

2. Microservices Mode (Distributed) - Recommended for Production

2. 微服务模式(分布式)——生产环境推荐

yaml
undefined
yaml
undefined

Using mimir-distributed Helm chart

Using mimir-distributed Helm chart

distributor: replicas: 3
ingester: replicas: 3 zoneAwareReplication: enabled: true
querier: replicas: 3
queryFrontend: replicas: 2
queryScheduler: replicas: 2
storeGateway: replicas: 3
compactor: replicas: 1
undefined
distributor: replicas: 3
ingester: replicas: 3 zoneAwareReplication: enabled: true
querier: replicas: 3
queryFrontend: replicas: 2
queryScheduler: replicas: 2
storeGateway: replicas: 3
compactor: replicas: 1
undefined

Helm Deployment

Helm部署

Add Repository

添加仓库

bash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
bash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Install Distributed Mimir

安装分布式Mimir

bash
helm install mimir grafana/mimir-distributed \
  --namespace monitoring \
  --values values.yaml
bash
helm install mimir grafana/mimir-distributed \
  --namespace monitoring \
  --values values.yaml

Pre-Built Values Files

预构建的Values文件

FilePurpose
values.yaml
Non-production testing with MinIO
small.yaml
~1 million series (single replicas, not HA)
large.yaml
Production (~10 million series)
文件用途
values.yaml
用于非生产环境测试,搭配MinIO使用
small.yaml
适用于约100万条时间序列(单实例,不具备高可用)
large.yaml
生产环境适用(约1000万条时间序列)

Production Values Example

生产环境Values示例

yaml
undefined
yaml
undefined

Deployment mode

Deployment mode

mimir: structuredConfig: multitenancy_enabled: true
mimir: structuredConfig: multitenancy_enabled: true

Storage configuration

Storage configuration

mimir: structuredConfig: common: storage: backend: azure # or s3, gcs azure: account_name: ${AZURE_STORAGE_ACCOUNT} account_key: ${AZURE_STORAGE_KEY} endpoint_suffix: blob.core.windows.net
blocks_storage:
  azure:
    container_name: mimir-blocks

alertmanager_storage:
  azure:
    container_name: mimir-alertmanager

ruler_storage:
  azure:
    container_name: mimir-ruler
mimir: structuredConfig: common: storage: backend: azure # or s3, gcs azure: account_name: ${AZURE_STORAGE_ACCOUNT} account_key: ${AZURE_STORAGE_KEY} endpoint_suffix: blob.core.windows.net
blocks_storage:
  azure:
    container_name: mimir-blocks

alertmanager_storage:
  azure:
    container_name: mimir-alertmanager

ruler_storage:
  azure:
    container_name: mimir-ruler

Distributor

Distributor

distributor: replicas: 3 resources: requests: cpu: 1 memory: 2Gi limits: memory: 4Gi
distributor: replicas: 3 resources: requests: cpu: 1 memory: 2Gi limits: memory: 4Gi

Ingester

Ingester

ingester: replicas: 3 zoneAwareReplication: enabled: true persistentVolume: enabled: true size: 50Gi resources: requests: cpu: 2 memory: 8Gi limits: memory: 16Gi
ingester: replicas: 3 zoneAwareReplication: enabled: true persistentVolume: enabled: true size: 50Gi resources: requests: cpu: 2 memory: 8Gi limits: memory: 16Gi

Querier

Querier

querier: replicas: 3 resources: requests: cpu: 1 memory: 2Gi limits: memory: 8Gi
querier: replicas: 3 resources: requests: cpu: 1 memory: 2Gi limits: memory: 8Gi

Query Frontend

Query Frontend

query_frontend: replicas: 2 resources: requests: cpu: 500m memory: 1Gi limits: memory: 2Gi
query_frontend: replicas: 2 resources: requests: cpu: 500m memory: 1Gi limits: memory: 2Gi

Query Scheduler

Query Scheduler

query_scheduler: replicas: 2
query_scheduler: replicas: 2

Store Gateway

Store Gateway

store_gateway: replicas: 3 persistentVolume: enabled: true size: 20Gi resources: requests: cpu: 500m memory: 2Gi limits: memory: 8Gi
store_gateway: replicas: 3 persistentVolume: enabled: true size: 20Gi resources: requests: cpu: 500m memory: 2Gi limits: memory: 8Gi

Compactor

Compactor

compactor: replicas: 1 persistentVolume: enabled: true size: 50Gi resources: requests: cpu: 1 memory: 4Gi limits: memory: 8Gi
compactor: replicas: 1 persistentVolume: enabled: true size: 50Gi resources: requests: cpu: 1 memory: 4Gi limits: memory: 8Gi

Gateway for external access

Gateway for external access

gateway: enabledNonEnterprise: true replicas: 2
gateway: enabledNonEnterprise: true replicas: 2

Monitoring

Monitoring

metaMonitoring: serviceMonitor: enabled: true
undefined
metaMonitoring: serviceMonitor: enabled: true
undefined

Storage Configuration

存储配置

Critical Requirements

关键要求

  • Must create buckets manually - Mimir doesn't create them
  • Separate buckets required - blocks_storage, alertmanager_storage, ruler_storage cannot share the same bucket+prefix
  • Azure: Hierarchical namespace must be disabled
  • 必须手动创建存储桶——Mimir不会自动创建
  • 需要使用独立的存储桶——blocks_storage、alertmanager_storage、ruler_storage不能共享同一个存储桶或前缀
  • Azure环境:必须禁用分层命名空间

Azure Blob Storage

Azure Blob Storage

yaml
mimir:
  structuredConfig:
    common:
      storage:
        backend: azure
        azure:
          account_name: <storage-account-name>
          # Option 1: Account Key (via environment variable)
          account_key: ${AZURE_STORAGE_KEY}
          # Option 2: User-Assigned Managed Identity
          # user_assigned_id: <identity-client-id>
          endpoint_suffix: blob.core.windows.net

    blocks_storage:
      azure:
        container_name: mimir-blocks

    alertmanager_storage:
      azure:
        container_name: mimir-alertmanager

    ruler_storage:
      azure:
        container_name: mimir-ruler
yaml
mimir:
  structuredConfig:
    common:
      storage:
        backend: azure
        azure:
          account_name: <storage-account-name>
          # 选项1:账户密钥(通过环境变量传入)
          account_key: ${AZURE_STORAGE_KEY}
          # 选项2:用户分配的托管标识
          # user_assigned_id: <identity-client-id>
          endpoint_suffix: blob.core.windows.net

    blocks_storage:
      azure:
        container_name: mimir-blocks

    alertmanager_storage:
      azure:
        container_name: mimir-alertmanager

    ruler_storage:
      azure:
        container_name: mimir-ruler

AWS S3

AWS S3

yaml
mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.us-east-1.amazonaws.com
          region: us-east-1
          access_key_id: ${AWS_ACCESS_KEY_ID}
          secret_access_key: ${AWS_SECRET_ACCESS_KEY}

    blocks_storage:
      s3:
        bucket_name: mimir-blocks

    alertmanager_storage:
      s3:
        bucket_name: mimir-alertmanager

    ruler_storage:
      s3:
        bucket_name: mimir-ruler
yaml
mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.us-east-1.amazonaws.com
          region: us-east-1
          access_key_id: ${AWS_ACCESS_KEY_ID}
          secret_access_key: ${AWS_SECRET_ACCESS_KEY}

    blocks_storage:
      s3:
        bucket_name: mimir-blocks

    alertmanager_storage:
      s3:
        bucket_name: mimir-alertmanager

    ruler_storage:
      s3:
        bucket_name: mimir-ruler

Google Cloud Storage

Google Cloud Storage

yaml
mimir:
  structuredConfig:
    common:
      storage:
        backend: gcs
        gcs:
          service_account: ${GCS_SERVICE_ACCOUNT_JSON}

    blocks_storage:
      gcs:
        bucket_name: mimir-blocks

    alertmanager_storage:
      gcs:
        bucket_name: mimir-alertmanager

    ruler_storage:
      gcs:
        bucket_name: mimir-ruler
yaml
mimir:
  structuredConfig:
    common:
      storage:
        backend: gcs
        gcs:
          service_account: ${GCS_SERVICE_ACCOUNT_JSON}

    blocks_storage:
      gcs:
        bucket_name: mimir-blocks

    alertmanager_storage:
      gcs:
        bucket_name: mimir-alertmanager

    ruler_storage:
      gcs:
        bucket_name: mimir-ruler

Limits Configuration

限制配置

yaml
mimir:
  structuredConfig:
    limits:
      # Ingestion limits
      ingestion_rate: 25000                    # Samples/sec per tenant
      ingestion_burst_size: 50000              # Burst size
      max_series_per_metric: 10000
      max_series_per_user: 1000000
      max_global_series_per_user: 1000000
      max_label_names_per_series: 30
      max_label_name_length: 1024
      max_label_value_length: 2048

      # Query limits
      max_fetched_series_per_query: 100000
      max_fetched_chunks_per_query: 2000000
      max_query_lookback: 0                    # No limit
      max_query_parallelism: 32

      # Retention
      compactor_blocks_retention_period: 365d  # 1 year

      # Out-of-order samples
      out_of_order_time_window: 5m
yaml
mimir:
  structuredConfig:
    limits:
      # 写入限制
      ingestion_rate: 25000                    # 每个租户的每秒样本数
      ingestion_burst_size: 50000              # 突发写入上限
      max_series_per_metric: 10000
      max_series_per_user: 1000000
      max_global_series_per_user: 1000000
      max_label_names_per_series: 30
      max_label_name_length: 1024
      max_label_value_length: 2048

      # 查询限制
      max_fetched_series_per_query: 100000
      max_fetched_chunks_per_query: 2000000
      max_query_lookback: 0                    # 无限制
      max_query_parallelism: 32

      # 数据保留
      compactor_blocks_retention_period: 365d  # 1年

      # 乱序样本
      out_of_order_time_window: 5m

Per-Tenant Overrides (Runtime Configuration)

租户级别的运行时覆盖配置

yaml
undefined
yaml
undefined

runtime-config.yaml

runtime-config.yaml

overrides: tenant1: ingestion_rate: 50000 max_series_per_user: 2000000 compactor_blocks_retention_period: 730d # 2 years tenant2: ingestion_rate: 75000 max_global_series_per_user: 5000000

Enable runtime configuration:

```yaml
mimir:
  structuredConfig:
    runtime_config:
      file: /etc/mimir/runtime-config.yaml
      period: 10s
overrides: tenant1: ingestion_rate: 50000 max_series_per_user: 2000000 compactor_blocks_retention_period: 730d # 2年 tenant2: ingestion_rate: 75000 max_global_series_per_user: 5000000

启用运行时配置:

```yaml
mimir:
  structuredConfig:
    runtime_config:
      file: /etc/mimir/runtime-config.yaml
      period: 10s

High Availability Configuration

高可用配置

HA Tracker for Prometheus Deduplication

用于Prometheus去重的HA Tracker

yaml
mimir:
  structuredConfig:
    distributor:
      ha_tracker:
        enable_ha_tracker: true
        kvstore:
          store: memberlist
        cluster_label: cluster
        replica_label: __replica__

    memberlist:
      join_members:
        - mimir-gossip-ring.monitoring.svc.cluster.local:7946
Prometheus Configuration:
yaml
global:
  external_labels:
    cluster: prom-team1
    __replica__: replica1

remote_write:
  - url: http://mimir-gateway:8080/api/v1/push
    headers:
      X-Scope-OrgID: my-tenant
yaml
mimir:
  structuredConfig:
    distributor:
      ha_tracker:
        enable_ha_tracker: true
        kvstore:
          store: memberlist
        cluster_label: cluster
        replica_label: __replica__

    memberlist:
      join_members:
        - mimir-gossip-ring.monitoring.svc.cluster.local:7946
Prometheus配置:
yaml
global:
  external_labels:
    cluster: prom-team1
    __replica__: replica1

remote_write:
  - url: http://mimir-gateway:8080/api/v1/push
    headers:
      X-Scope-OrgID: my-tenant

Zone-Aware Replication

区域感知复制

yaml
ingester:
  zoneAwareReplication:
    enabled: true
    zones:
      - name: zone-a
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1a
      - name: zone-b
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1b
      - name: zone-c
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1c

store_gateway:
  zoneAwareReplication:
    enabled: true
yaml
ingester:
  zoneAwareReplication:
    enabled: true
    zones:
      - name: zone-a
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1a
      - name: zone-b
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1b
      - name: zone-c
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1c

store_gateway:
  zoneAwareReplication:
    enabled: true

Shuffle Sharding

随机分片

Limits tenant data to a subset of instances for fault isolation:
yaml
mimir:
  structuredConfig:
    limits:
      # Write path
      ingestion_tenant_shard_size: 3

      # Read path
      max_queriers_per_tenant: 5
      store_gateway_tenant_shard_size: 3
将租户数据限制在部分实例中,实现故障隔离:
yaml
mimir:
  structuredConfig:
    limits:
      # 写入路径
      ingestion_tenant_shard_size: 3

      # 读取路径
      max_queriers_per_tenant: 5
      store_gateway_tenant_shard_size: 3

OpenTelemetry Integration

OpenTelemetry集成

OTLP Metrics Ingestion

OTLP指标写入

OpenTelemetry Collector Config:
yaml
exporters:
  otlphttp:
    endpoint: http://mimir-gateway:8080/otlp
    headers:
      X-Scope-OrgID: "my-tenant"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlphttp]
OpenTelemetry Collector配置:
yaml
exporters:
  otlphttp:
    endpoint: http://mimir-gateway:8080/otlp
    headers:
      X-Scope-OrgID: "my-tenant"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlphttp]

Exponential Histograms (Experimental)

指数直方图(实验性)

go
// Go SDK configuration
Aggregation: metric.AggregationBase2ExponentialHistogram{
    MaxSize:  160,      // Maximum buckets
    MaxScale: 20,       // Scale factor
}
Key Benefits:
  • Explicit min/max values (no estimation needed)
  • Better accuracy for extreme percentiles
  • Native OTLP format preservation
go
// Go SDK配置
Aggregation: metric.AggregationBase2ExponentialHistogram{
    MaxSize:  160,      // 最大桶数量
    MaxScale: 20,       // 缩放因子
}
核心优势:
  • 明确的最小值/最大值(无需估算)
  • 对极端百分位数的计算更准确
  • 原生支持OTLP格式

Multi-Tenancy

多租户

yaml
mimir:
  structuredConfig:
    multitenancy_enabled: true
    no_auth_tenant: anonymous    # Used when multitenancy disabled
Query with tenant header:
bash
curl -H "X-Scope-OrgID: tenant-a" \
  "http://mimir:8080/prometheus/api/v1/query?query=up"
Tenant ID Constraints:
  • Max 150 characters
  • Allowed: alphanumeric,
    !
    -
    _
    .
    *
    '
    (
    )
  • Prohibited:
    .
    or
    ..
    alone,
    __mimir_cluster
    , slashes
yaml
mimir:
  structuredConfig:
    multitenancy_enabled: true
    no_auth_tenant: anonymous    # 禁用多租户时使用的默认租户
携带租户请求头查询:
bash
curl -H "X-Scope-OrgID: tenant-a" \
  "http://mimir:8080/prometheus/api/v1/query?query=up"
租户ID约束:
  • 最大长度150字符
  • 允许使用:字母数字、
    !
    -
    _
    .
    *
    '
    (
    )
  • 禁止使用:单独的
    .
    ..
    __mimir_cluster
    、斜杠

API Reference

API参考

Ingestion Endpoints

写入端点

bash
undefined
bash
undefined

Prometheus remote write

Prometheus远程写入

POST /api/v1/push
POST /api/v1/push

OTLP metrics

OTLP指标

POST /otlp/v1/metrics
POST /otlp/v1/metrics

InfluxDB line protocol

InfluxDB行协议

POST /api/v1/push/influx/write
undefined
POST /api/v1/push/influx/write
undefined

Query Endpoints

查询端点

bash
undefined
bash
undefined

Instant query

即时查询

GET,POST /prometheus/api/v1/query?query=<promql>&time=<timestamp>
GET,POST /prometheus/api/v1/query?query=<promql>&time=<timestamp>

Range query

范围查询

GET,POST /prometheus/api/v1/query_range?query=<promql>&start=<start>&end=<end>&step=<step>
GET,POST /prometheus/api/v1/query_range?query=<promql>&start=<start>&end=<end>&step=<step>

Labels

标签查询

GET,POST /prometheus/api/v1/labels GET /prometheus/api/v1/label/{name}/values
GET,POST /prometheus/api/v1/labels GET /prometheus/api/v1/label/{name}/values

Series

时间序列查询

GET,POST /prometheus/api/v1/series
GET,POST /prometheus/api/v1/series

Exemplars

示例查询

GET,POST /prometheus/api/v1/query_exemplars
GET,POST /prometheus/api/v1/query_exemplars

Cardinality

基数查询

GET,POST /prometheus/api/v1/cardinality/label_names GET,POST /prometheus/api/v1/cardinality/active_series
undefined
GET,POST /prometheus/api/v1/cardinality/label_names GET,POST /prometheus/api/v1/cardinality/active_series
undefined

Administrative Endpoints

管理端点

bash
undefined
bash
undefined

Flush ingester data

刷新Ingester数据

GET,POST /ingester/flush
GET,POST /ingester/flush

Prepare shutdown

准备关机

GET,POST,DELETE /ingester/prepare-shutdown
GET,POST,DELETE /ingester/prepare-shutdown

Ring status

哈希环状态

GET /ingester/ring GET /distributor/ring GET /store-gateway/ring GET /compactor/ring
GET /ingester/ring GET /distributor/ring GET /store-gateway/ring GET /compactor/ring

Tenant stats

租户统计

GET /distributor/all_user_stats GET /api/v1/user_stats GET /api/v1/user_limits
undefined
GET /distributor/all_user_stats GET /api/v1/user_stats GET /api/v1/user_limits
undefined

Health & Config

健康与配置检查

bash
GET /ready
GET /metrics
GET /config
GET /config?mode=diff
GET /runtime_config
bash
GET /ready
GET /metrics
GET /config
GET /config?mode=diff
GET /runtime_config

Azure Identity Configuration

Azure身份配置

User-Assigned Managed Identity

用户分配的托管标识

1. Create Identity:
bash
az identity create \
  --name mimir-identity \
  --resource-group <rg>

IDENTITY_CLIENT_ID=$(az identity show --name mimir-identity --resource-group <rg> --query clientId -o tsv)
IDENTITY_PRINCIPAL_ID=$(az identity show --name mimir-identity --resource-group <rg> --query principalId -o tsv)
2. Assign to Node Pool:
bash
az vmss identity assign \
  --resource-group <aks-node-rg> \
  --name <vmss-name> \
  --identities /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/mimir-identity
3. Grant Storage Permission:
bash
az role assignment create \
  --role "Storage Blob Data Contributor" \
  --assignee-object-id $IDENTITY_PRINCIPAL_ID \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>
4. Configure Mimir:
yaml
mimir:
  structuredConfig:
    common:
      storage:
        azure:
          user_assigned_id: <IDENTITY_CLIENT_ID>
1. 创建标识:
bash
az identity create \
  --name mimir-identity \
  --resource-group <rg>

IDENTITY_CLIENT_ID=$(az identity show --name mimir-identity --resource-group <rg> --query clientId -o tsv)
IDENTITY_PRINCIPAL_ID=$(az identity show --name mimir-identity --resource-group <rg> --query principalId -o tsv)
2. 分配至节点池:
bash
az vmss identity assign \
  --resource-group <aks-node-rg> \
  --name <vmss-name> \
  --identities /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/mimir-identity
3. 授予存储权限:
bash
az role assignment create \
  --role "Storage Blob Data Contributor" \
  --assignee-object-id $IDENTITY_PRINCIPAL_ID \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>
4. 配置Mimir:
yaml
mimir:
  structuredConfig:
    common:
      storage:
        azure:
          user_assigned_id: <IDENTITY_CLIENT_ID>

Workload Identity Federation

工作负载身份联邦

1. Create Federated Credential:
bash
az identity federated-credential create \
  --name mimir-federated \
  --identity-name mimir-identity \
  --resource-group <rg> \
  --issuer <aks-oidc-issuer-url> \
  --subject system:serviceaccount:monitoring:mimir \
  --audiences api://AzureADTokenExchange
2. Configure Helm Values:
yaml
serviceAccount:
  annotations:
    azure.workload.identity/client-id: <IDENTITY_CLIENT_ID>

podLabels:
  azure.workload.identity/use: "true"
1. 创建联邦凭据:
bash
az identity federated-credential create \
  --name mimir-federated \
  --identity-name mimir-identity \
  --resource-group <rg> \
  --issuer <aks-oidc-issuer-url> \
  --subject system:serviceaccount:monitoring:mimir \
  --audiences api://AzureADTokenExchange
2. 配置Helm Values:
yaml
serviceAccount:
  annotations:
    azure.workload.identity/client-id: <IDENTITY_CLIENT_ID>

podLabels:
  azure.workload.identity/use: "true"

Troubleshooting

故障排查

Common Issues

常见问题

1. Container Not Found (Azure)
bash
undefined
1. 容器不存在(Azure环境)
bash
undefined

Create required containers

创建所需的容器

az storage container create --name mimir-blocks --account-name <storage> az storage container create --name mimir-alertmanager --account-name <storage> az storage container create --name mimir-ruler --account-name <storage>

**2. Authorization Failure (Azure)**

```bash
az storage container create --name mimir-blocks --account-name <storage> az storage container create --name mimir-alertmanager --account-name <storage> az storage container create --name mimir-ruler --account-name <storage>

**2. 授权失败(Azure环境)**

```bash

Verify RBAC assignment

验证RBAC分配

az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage>

Assign if missing

若缺失则分配权限

az role assignment create
--role "Storage Blob Data Contributor"
--assignee-object-id <principal-id>
--scope <storage-scope>
az role assignment create
--role "Storage Blob Data Contributor"
--assignee-object-id <principal-id>
--scope <storage-scope>

Restart pod to refresh token

重启Pod以刷新令牌

kubectl delete pod -n monitoring <ingester-pod>

**3. Ingester OOM**

```yaml
ingester:
  resources:
    limits:
      memory: 16Gi  # Increase memory
4. Query Timeout
yaml
mimir:
  structuredConfig:
    querier:
      timeout: 5m
      max_concurrent: 20
5. High Cardinality
yaml
mimir:
  structuredConfig:
    limits:
      max_series_per_user: 5000000
      max_series_per_metric: 50000
kubectl delete pod -n monitoring <ingester-pod>

**3. Ingester内存溢出(OOM)**

```yaml
ingester:
  resources:
    limits:
      memory: 16Gi  # 增加内存限制
4. 查询超时
yaml
mimir:
  structuredConfig:
    querier:
      timeout: 5m
      max_concurrent: 20
5. 高基数问题
yaml
mimir:
  structuredConfig:
    limits:
      max_series_per_user: 5000000
      max_series_per_metric: 50000

Diagnostic Commands

诊断命令

bash
undefined
bash
undefined

Check pod status

检查Pod状态

kubectl get pods -n monitoring -l app.kubernetes.io/name=mimir
kubectl get pods -n monitoring -l app.kubernetes.io/name=mimir

Check ingester logs

查看Ingester日志

kubectl logs -n monitoring -l app.kubernetes.io/component=ingester --tail=100
kubectl logs -n monitoring -l app.kubernetes.io/component=ingester --tail=100

Check distributor logs

查看Distributor日志

kubectl logs -n monitoring -l app.kubernetes.io/component=distributor --tail=100
kubectl logs -n monitoring -l app.kubernetes.io/component=distributor --tail=100

Verify readiness

验证就绪状态

kubectl exec -it <mimir-pod> -n monitoring -- wget -qO- http://localhost:8080/ready
kubectl exec -it <mimir-pod> -n monitoring -- wget -qO- http://localhost:8080/ready

Check ring status

检查哈希环状态

kubectl port-forward svc/mimir-distributor 8080:8080 -n monitoring curl http://localhost:8080/distributor/ring
kubectl port-forward svc/mimir-distributor 8080:8080 -n monitoring curl http://localhost:8080/distributor/ring

Check configuration

查看配置

kubectl exec -it <mimir-pod> -n monitoring -- cat /etc/mimir/mimir.yaml
kubectl exec -it <mimir-pod> -n monitoring -- cat /etc/mimir/mimir.yaml

Validate configuration before deployment

部署前验证配置

mimir -modules -config.file <path-to-config-file>
undefined
mimir -modules -config.file <path-to-config-file>
undefined

Key Metrics to Monitor

需监控的核心指标

promql
undefined
promql
undefined

Ingestion rate per tenant

每个租户的写入速率

sum by (user) (rate(cortex_distributor_received_samples_total[5m]))
sum by (user) (rate(cortex_distributor_received_samples_total[5m]))

Series count per tenant

每个租户的时间序列数量

sum by (user) (cortex_ingester_memory_series)
sum by (user) (cortex_ingester_memory_series)

Query latency

查询延迟

histogram_quantile(0.99, sum by (le) (rate(cortex_request_duration_seconds_bucket{route=~"/api/prom/api/v1/query.*"}[5m])))
histogram_quantile(0.99, sum by (le) (rate(cortex_request_duration_seconds_bucket{route=~"/api/prom/api/v1/query.*"}[5m])))

Compactor status

Compactor运行状态

cortex_compactor_runs_completed_total cortex_compactor_runs_failed_total
cortex_compactor_runs_completed_total cortex_compactor_runs_failed_total

Store-gateway block sync

Store-gateway块同步状态

cortex_bucket_store_blocks_loaded
undefined
cortex_bucket_store_blocks_loaded
undefined

Circuit Breakers (Ingester)

熔断器(Ingester)

yaml
mimir:
  structuredConfig:
    ingester:
      push_circuit_breaker:
        enabled: true
        request_timeout: 2s
        failure_threshold_percentage: 10
        cooldown_period: 10s
      read_circuit_breaker:
        enabled: true
        request_timeout: 30s
States:
  1. Closed - Normal operation
  2. Open - Stops forwarding to failing instances
  3. Half-open - Limited trial requests after cooldown
yaml
mimir:
  structuredConfig:
    ingester:
      push_circuit_breaker:
        enabled: true
        request_timeout: 2s
        failure_threshold_percentage: 10
        cooldown_period: 10s
      read_circuit_breaker:
        enabled: true
        request_timeout: 30s
状态说明:
  1. 关闭状态——正常运行
  2. 打开状态——停止向故障实例转发请求
  3. 半开状态——冷却期后允许有限的测试请求

External Resources

外部资源