app-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana Cloud Application Observability Skill

Grafana Cloud 应用可观测性技能

Overview

概述

Grafana Cloud provides three tightly related application monitoring products:
  1. Application Observability (APM) - RED metrics from OTel traces, service inventory, service maps
  2. Frontend Observability - RUM/Faro SDK for browser apps, session replay, web vitals
  3. AI Observability - LLM/model monitoring via OpenLIT + OTel, token/cost/latency metrics
All three integrate with Grafana Tempo (traces), Loki (logs), and Pyroscope (profiles) for full-stack correlation.

Grafana Cloud 提供三款紧密关联的应用监控产品:
  1. 应用可观测性(APM) - 基于OTel追踪数据生成的RED指标、服务清单、服务拓扑图
  2. 前端可观测性 - 面向浏览器应用的RUM/Faro SDK、会话重放、Web性能指标
  3. AI可观测性 - 通过OpenLIT + OTel实现的LLM/模型监控、令牌/成本/延迟指标
三款产品均可与Grafana Tempo(追踪)、Loki(日志)和Pyroscope(性能剖析)集成,实现全链路关联。

Application Observability (APM)

应用可观测性(APM)

What It Is

产品简介

Application Observability is a pre-built APM experience in Grafana Cloud built on top of OpenTelemetry. It generates RED (Rate, Error, Duration) metrics from distributed traces via span metrics, then surfaces them in:
  • Service Inventory - table of all services with RED metrics at a glance
  • Service Overview - per-service RED metrics, top operations, error breakdown
  • Service Map - node graph of service dependencies with flow visualization
  • Operations view - per-endpoint RED metrics with p50/p95/p99 latency
应用可观测性是Grafana Cloud中基于OpenTelemetry构建的预集成APM方案。它通过Span指标从分布式追踪数据中生成RED(请求率、错误率、持续时间)指标,并在以下模块中展示:
  • 服务清单 - 一目了然展示所有服务RED指标的表格
  • 服务概览 - 单服务的RED指标、核心操作、错误细分
  • 服务拓扑图 - 可视化服务依赖关系的节点图,展示请求流向
  • 操作视图 - 单端点的RED指标及p50/p95/p99延迟数据

How Metrics Are Generated

指标生成方式

Application Observability does NOT rely on traditional Prometheus scraping. Metrics come from span metrics - aggregations computed from OTel trace data:
  • Source: OTel traces sent to Grafana Tempo or Grafana Alloy
  • Generation method: Tempo's metrics-generator OR the
    spanmetrics
    connector in Alloy/OTel Collector
  • Result: Prometheus-compatible metrics stored in Grafana Mimir
Key generated metric names:
  • Via Tempo metrics-generator:
    traces_spanmetrics_calls_total
    ,
    traces_spanmetrics_duration_seconds
  • Via OTel Collector spanmetrics connector:
    traces_span_metrics_calls_total
    ,
    traces_span_metrics_duration_seconds
应用可观测性不依赖传统Prometheus抓取方式。指标来自Span指标——基于OTel追踪数据计算的聚合结果:
  • 数据源:发送至Grafana Tempo或Grafana Alloy的OTel追踪数据
  • 生成方式:Tempo的metrics-generator,或Alloy/OTel Collector中的
    spanmetrics
    连接器
  • 存储:存储在Grafana Mimir中的Prometheus兼容指标
关键生成指标名称:
  • 通过Tempo metrics-generator:
    traces_spanmetrics_calls_total
    ,
    traces_spanmetrics_duration_seconds
  • 通过OTel Collector spanmetrics连接器:
    traces_span_metrics_calls_total
    ,
    traces_span_metrics_duration_seconds

Required OTel Resource Attributes

必填OTel资源属性

These attributes MUST be present on all spans for Application Observability to work:
AttributeGrafana LabelPurpose
service.name
service_name
/ part of
job
Identifies the service
service.namespace
part of
job
label
Groups services;
job = namespace/service.name
deployment.environment
deployment_environment
Env filter (prod/dev/staging)
The
job
label is constructed as:
  • service.namespace/service.name
    when namespace is set
  • service.name
    alone when no namespace
Additional recommended attributes:
  • service.version
    - shown in service overview
  • k8s.cluster.name
    - for K8s environments
  • k8s.namespace.name
    - Kubernetes namespace
  • cloud.region
    - for multi-region setups
所有Span必须包含以下属性,应用可观测性才能正常工作:
属性Grafana标签用途
service.name
service_name
/
job
的一部分
标识服务
service.namespace
job
标签的一部分
对服务进行分组;
job = namespace/service.name
deployment.environment
deployment_environment
环境筛选(生产/开发/预发布)
job
标签的构造规则:
  • 当设置namespace时:
    service.namespace/service.name
  • 未设置namespace时:仅
    service.name
推荐额外添加的属性:
  • service.version
    - 在服务概览中展示
  • k8s.cluster.name
    - 适用于K8s环境
  • k8s.namespace.name
    - Kubernetes命名空间
  • cloud.region
    - 适用于多区域部署

Setting Environment Variables for OTel SDK

配置OTel SDK环境变量

bash
export OTEL_SERVICE_NAME="my-api"
export OTEL_RESOURCE_ATTRIBUTES="service.namespace=myteam,deployment.environment=production,service.version=1.2.3"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
bash
export OTEL_SERVICE_NAME="my-api"
export OTEL_RESOURCE_ATTRIBUTES="service.namespace=myteam,deployment.environment=production,service.version=1.2.3"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"

Grafana Alloy Configuration (River syntax)

Grafana Alloy配置(River语法)

Alloy acts as a local OTel Collector and forwards data to Grafana Cloud:
river
// Receive traces, metrics, logs from instrumented apps
otelcol.receiver.otlp "default" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }
  http {
    endpoint = "0.0.0.0:4318"
  }
  output {
    metrics = [otelcol.processor.resourcedetection.default.input]
    logs    = [otelcol.processor.resourcedetection.default.input]
    traces  = [otelcol.processor.resourcedetection.default.input]
  }
}

// Auto-detect host/cloud metadata
otelcol.processor.resourcedetection "default" {
  detectors = ["env", "system", "gcp", "aws", "azure"]
  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Batch for efficiency
otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
    logs    = [otelcol.exporter.otlphttp.grafana_cloud.input]
    traces  = [otelcol.exporter.otlphttp.grafana_cloud.input]
  }
}

// Auth
otelcol.auth.basic "grafana_cloud" {
  username = env("GRAFANA_CLOUD_INSTANCE_ID")
  password = env("GRAFANA_CLOUD_API_KEY")
}

// Export to Grafana Cloud OTLP endpoint
otelcol.exporter.otlphttp "grafana_cloud" {
  client {
    endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")
    auth     = otelcol.auth.basic.grafana_cloud.handler
  }
}
Required environment variables for Alloy:
bash
GRAFANA_CLOUD_OTLP_ENDPOINT=https://otlp-gateway-<region>.grafana.net/otlp
GRAFANA_CLOUD_INSTANCE_ID=<your-instance-id>
GRAFANA_CLOUD_API_KEY=<your-api-key>
Alloy作为本地OTel Collector,将数据转发至Grafana Cloud:
river
// 接收来自埋点应用的追踪、指标、日志
otelcol.receiver.otlp "default" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }
  http {
    endpoint = "0.0.0.0:4318"
  }
  output {
    metrics = [otelcol.processor.resourcedetection.default.input]
    logs    = [otelcol.processor.resourcedetection.default.input]
    traces  = [otelcol.processor.resourcedetection.default.input]
  }
}

// 自动检测主机/云元数据
otelcol.processor.resourcedetection "default" {
  detectors = ["env", "system", "gcp", "aws", "azure"]
  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// 批量处理提升效率
otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
    logs    = [otelcol.exporter.otlphttp.grafana_cloud.input]
    traces  = [otelcol.exporter.otlphttp.grafana_cloud.input]
  }
}

// 认证配置
otelcol.auth.basic "grafana_cloud" {
  username = env("GRAFANA_CLOUD_INSTANCE_ID")
  password = env("GRAFANA_CLOUD_API_KEY")
}

// 导出至Grafana Cloud OTLP端点
otelcol.exporter.otlphttp "grafana_cloud" {
  client {
    endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")
    auth     = otelcol.auth.basic.grafana_cloud.handler
  }
}
Alloy所需环境变量:
bash
GRAFANA_CLOUD_OTLP_ENDPOINT=https://otlp-gateway-<region>.grafana.net/otlp
GRAFANA_CLOUD_INSTANCE_ID=<your-instance-id>
GRAFANA_CLOUD_API_KEY=<your-api-key>

Service Map

服务拓扑图

The Service Map uses Tempo's metrics-generator to produce service graph metrics:
  • Node graph shows services as nodes, HTTP/gRPC calls as edges
  • Edge thickness indicates request rate; color indicates error rate
  • Clicking a node navigates to Service Overview
  • Requires
    span.kind
    (CLIENT/SERVER) on spans for directional edges
Enable in Tempo (managed by Grafana Cloud automatically):
  • service-graphs
    metrics generator enabled by default in Grafana Cloud Tempo
  • Uses
    traces_service_graph_request_total
    ,
    traces_service_graph_request_failed_total
    metrics
服务拓扑图使用Tempo的metrics-generator生成服务图指标:
  • 节点图将服务展示为节点,HTTP/gRPC调用展示为边
  • 边的粗细表示请求率,颜色表示错误率
  • 点击节点可跳转至服务概览
  • 需要Span上包含
    span.kind
    (CLIENT/SERVER)属性以生成有向边
在Tempo中启用(Grafana Cloud自动管理):
  • Grafana Cloud Tempo默认启用
    service-graphs
    指标生成器
  • 使用
    traces_service_graph_request_total
    ,
    traces_service_graph_request_failed_total
    指标

Integration with Traces, Logs, Profiles

与追踪、日志、性能剖析的集成

Application Observability provides one-click correlation:
  • Traces: Click any metric spike to open exemplar traces in Grafana Tempo
  • Logs: Service logs shown in Service Overview; correlated via
    service.name
    label
  • Profiles: "Go to profiles" button in Service Overview when Pyroscope is configured
  • Frontend: Link from Application Observability to Frontend Observability for the same service

应用可观测性提供一键关联功能:
  • 追踪:点击任意指标峰值,可在Grafana Tempo中打开关联的示例追踪
  • 日志:服务概览中展示服务日志,通过
    service.name
    标签关联
  • 性能剖析:当配置Pyroscope后,服务概览中会显示「前往性能剖析」按钮
  • 前端:从应用可观测性跳转至同一服务的前端可观测性模块

Frontend Observability (Faro)

前端可观测性(Faro)

What It Is

产品简介

Grafana Faro is an open-source JavaScript/TypeScript SDK for Real User Monitoring (RUM). It instruments browser applications to capture:
  • Web vitals: Core Web Vitals (LCP, CLS, INP) and additional performance metrics
  • Errors: Unhandled exceptions, rejected promises with stack traces
  • Sessions: User journeys, page views, navigation timing
  • Logs: Custom log messages from frontend code
  • Traces: Distributed traces via OpenTelemetry-JS (correlates with backend spans)
  • Session replay: Rrweb-based DOM recording for reproducing user issues
Data flows: Faro SDK -> Grafana Alloy (faro receiver) OR Grafana Cloud OTLP endpoint -> Loki (logs) + Tempo (traces) + Mimir (metrics)
Grafana Faro是一款开源的JavaScript/TypeScript SDK,用于真实用户监控(RUM)。它对浏览器应用进行埋点,捕获以下数据:
  • Web性能指标:核心Web指标(LCP、CLS、INP)及额外性能指标
  • 错误:未处理异常、带堆栈追踪的被拒绝Promise
  • 会话:用户旅程、页面浏览、导航计时
  • 日志:前端代码中的自定义日志消息
  • 追踪:通过OpenTelemetry-JS实现的分布式追踪(与后端Span关联)
  • 会话重放:基于Rrweb的DOM录制,用于复现用户问题
数据流:Faro SDK -> Grafana Alloy(Faro接收器)或Grafana Cloud OTLP端点 -> Loki(日志)+ Tempo(追踪)+ Mimir(指标)

Faro SDK Packages

Faro SDK包

@grafana/faro-core          # Core SDK - signals, transports, API
@grafana/faro-web-sdk       # Web instrumentations + transports
@grafana/faro-web-tracing   # OpenTelemetry-JS distributed tracing
@grafana/faro-react         # React-specific integrations (error boundary, router)
@grafana/faro-core          # 核心SDK - 信号、传输、API
@grafana/faro-web-sdk       # Web埋点工具 + 传输模块
@grafana/faro-web-tracing   # OpenTelemetry-JS分布式追踪
@grafana/faro-react         # React专属集成(错误边界、路由)

Basic JavaScript Setup (npm)

JavaScript基础配置(npm)

bash
npm install @grafana/faro-web-sdk
bash
npm install @grafana/faro-web-sdk

or

yarn add @grafana/faro-web-sdk

```javascript
import {
  initializeFaro,
  getWebInstrumentations,
} from '@grafana/faro-web-sdk';

const faro = initializeFaro({
  url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
  app: {
    name: 'my-frontend-app',
    version: '1.0.0',
    environment: 'production',
  },
  instrumentations: [
    ...getWebInstrumentations({
      captureConsole: true,
    }),
  ],
});

// Manual API usage
faro.api.pushLog(['User clicked checkout button']);
faro.api.pushError(new Error('Payment failed'));
faro.api.pushEvent('button_click', { button: 'checkout' });
yarn add @grafana/faro-web-sdk

```javascript
import {
  initializeFaro,
  getWebInstrumentations,
} from '@grafana/faro-web-sdk';

const faro = initializeFaro({
  url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
  app: {
    name: 'my-frontend-app',
    version: '1.0.0',
    environment: 'production',
  },
  instrumentations: [
    ...getWebInstrumentations({
      captureConsole: true,
    }),
  ],
});

// 手动调用API
faro.api.pushLog(['User clicked checkout button']);
faro.api.pushError(new Error('Payment failed'));
faro.api.pushEvent('button_click', { button: 'checkout' });

CDN Setup (no bundler)

CDN配置(无需打包工具)

html
<script src="https://unpkg.com/@grafana/faro-web-sdk@latest/dist/library/faro-web-sdk.iife.js"></script>
<script>
  const { initializeFaro, getWebInstrumentations } = GrafanaFaroWebSdk;

  initializeFaro({
    url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
    app: { name: 'my-app', version: '1.0.0' },
    instrumentations: [...getWebInstrumentations()],
  });
</script>
html
<script src="https://unpkg.com/@grafana/faro-web-sdk@latest/dist/library/faro-web-sdk.iife.js"></script>
<script>
  const { initializeFaro, getWebInstrumentations } = GrafanaFaroWebSdk;

  initializeFaro({
    url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
    app: { name: 'my-app', version: '1.0.0' },
    instrumentations: [...getWebInstrumentations()],
  });
</script>

React Setup with Tracing

带追踪功能的React配置

bash
npm install @grafana/faro-react @grafana/faro-web-tracing
javascript
import { initializeFaro, getWebInstrumentations } from '@grafana/faro-web-sdk';
import { TracingInstrumentation } from '@grafana/faro-web-tracing';
import {
  createReactRouterV6DataOptions,
  ReactIntegration,
  withFaroRouterInstrumentation,
} from '@grafana/faro-react';
import { createBrowserRouter, RouterProvider } from 'react-router-dom';

const faro = initializeFaro({
  url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
  app: {
    name: 'my-react-app',
    version: '1.0.0',
    environment: 'production',
  },
  instrumentations: [
    ...getWebInstrumentations({ captureConsole: true }),
    new TracingInstrumentation(),
    new ReactIntegration({
      router: createReactRouterV6DataOptions({}),
    }),
  ],
});

const router = withFaroRouterInstrumentation(
  createBrowserRouter([
    { path: '/', element: <Home /> },
    { path: '/about', element: <About /> },
  ])
);

function App() {
  return <RouterProvider router={router} />;
}
bash
npm install @grafana/faro-react @grafana/faro-web-tracing
javascript
import { initializeFaro, getWebInstrumentations } from '@grafana/faro-web-sdk';
import { TracingInstrumentation } from '@grafana/faro-web-tracing';
import {
  createReactRouterV6DataOptions,
  ReactIntegration,
  withFaroRouterInstrumentation,
} from '@grafana/faro-react';
import { createBrowserRouter, RouterProvider } from 'react-router-dom';

const faro = initializeFaro({
  url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
  app: {
    name: 'my-react-app',
    version: '1.0.0',
    environment: 'production',
  },
  instrumentations: [
    ...getWebInstrumentations({ captureConsole: true }),
    new TracingInstrumentation(),
    new ReactIntegration({
      router: createReactRouterV6DataOptions({}),
    }),
  ],
});

const router = withFaroRouterInstrumentation(
  createBrowserRouter([
    { path: '/', element: <Home /> },
    { path: '/about', element: <About /> },
  ])
);

function App() {
  return <RouterProvider router={router} />;
}

Session Configuration

会话配置

javascript
initializeFaro({
  url: '...',
  app: { name: 'my-app' },
  sessionTracking: {
    enabled: true,
    persistent: true,
    maxSessionPersistenceTime: 4 * 60 * 60 * 1000, // 4 hours in ms
    samplingRate: 1,           // 1 = 100%, 0.5 = 50% of sessions
    onSessionChange: (oldSession, newSession) => {
      console.log('Session changed', newSession.id);
    },
  },
  instrumentations: [...getWebInstrumentations()],
});
javascript
initializeFaro({
  url: '...',
  app: { name: 'my-app' },
  sessionTracking: {
    enabled: true,
    persistent: true,
    maxSessionPersistenceTime: 4 * 60 * 60 * 1000, // 4小时(毫秒)
    samplingRate: 1,           // 1 = 100%,0.5 = 50%的会话被采样
    onSessionChange: (oldSession, newSession) => {
      console.log('Session changed', newSession.id);
    },
  },
  instrumentations: [...getWebInstrumentations()],
});

Getting the Collector URL

获取接收器URL

  1. In Grafana Cloud, go to Connections (left menu) > search "Frontend Observability"
  2. Click the Frontend Observability card
  3. Navigate to Web SDK Configuration tab
  4. Copy the
    url
    value - this is your unique collector endpoint
  5. Paste into your
    initializeFaro({ url: '...' })
    call
  1. 在Grafana Cloud中,进入左侧菜单的Connections > 搜索「Frontend Observability」
  2. 点击Frontend Observability卡片
  3. 切换至Web SDK Configuration标签页
  4. 复制
    url
    值——这是你的专属接收器端点
  5. 将其粘贴至
    initializeFaro({ url: '...' })
    调用中

What Faro Captures Automatically

Faro自动捕获的内容

When using
getWebInstrumentations()
:
  • Page views and navigation timing
  • Core Web Vitals (LCP, CLS, INP - replaces FID in Faro v2)
  • JavaScript errors and unhandled rejections
  • Console errors/warnings (when
    captureConsole: true
    )
  • Resource loading performance
  • User interactions (clicks, form events)
  • Fetch/XHR request timing
使用
getWebInstrumentations()
时,会自动捕获:
  • 页面浏览和导航计时
  • 核心Web指标(LCP、CLS、INP - 在Faro v2中替代FID)
  • JavaScript错误和未处理拒绝
  • Console错误/警告(当
    captureConsole: true
    时)
  • 资源加载性能
  • 用户交互(点击、表单事件)
  • Fetch/XHR请求计时

Correlation with Backend Traces

与后端追踪的关联

When
TracingInstrumentation
is included, Faro:
  • Injects
    traceparent
    /
    tracestate
    headers into outgoing fetch/XHR requests
  • Creates spans for each HTTP call
  • Links browser session to backend traces in Tempo
  • Enables "Frontend to Backend" trace waterfall in Grafana

当包含
TracingInstrumentation
时,Faro会:
  • 向 outgoing fetch/XHR请求中注入
    traceparent
    /
    tracestate
  • 为每个HTTP调用创建Span
  • 将浏览器会话与Tempo中的后端追踪关联
  • 在Grafana中启用「前端到后端」追踪瀑布图

AI Observability

AI可观测性

What It Is

产品简介

AI Observability monitors generative AI and LLM applications in production. Built on OTel GenAI semantic conventions and the OpenLIT instrumentation library.
Monitors:
  • LLM API calls (OpenAI, Anthropic, Cohere, Google, etc.)
  • Vector databases (Pinecone, Weaviate, Chroma, etc.)
  • AI frameworks (LangChain, CrewAI, LlamaIndex)
  • Model Context Protocol (MCP) servers
  • GPU utilization
  • AI evaluation quality (hallucination, toxicity, bias)
AI可观测性用于监控生产环境中的生成式AI和LLM应用。基于OTel GenAI语义规范和OpenLIT埋点库构建。
监控范围包括:
  • LLM API调用(OpenAI、Anthropic、Cohere、Google等)
  • 向量数据库(Pinecone、Weaviate、Chroma等)
  • AI框架(LangChain、CrewAI、LlamaIndex)
  • 模型上下文协议(MCP)服务器
  • GPU利用率
  • AI评估质量(幻觉、毒性、偏见)

Key Metrics (OTel GenAI Semantic Conventions)

核心指标(OTel GenAI语义规范)

MetricDescription
gen_ai_usage_input_tokens_total
Total input/prompt tokens consumed
gen_ai_usage_output_tokens_total
Total output/completion tokens consumed
gen_ai_usage_cost_USD_sum
Total cost in USD
gen_ai_client_operation_duration
Latency per LLM call (histogram)
gen_ai_client_token_usage
Token usage histogram
Trace spans capture:
  • Model name (
    gen_ai.request.model
    )
  • Temperature, top_p parameters
  • Full prompts and completions (configurable)
  • Provider (
    gen_ai.system
    :
    openai
    ,
    anthropic
    , etc.)
  • Time to first token (TTFT)
指标描述
gen_ai_usage_input_tokens_total
消耗的输入/提示令牌总数
gen_ai_usage_output_tokens_total
消耗的输出/完成令牌总数
gen_ai_usage_cost_USD_sum
总成本(美元)
gen_ai_client_operation_duration
每次LLM调用的延迟(直方图)
gen_ai_client_token_usage
令牌使用量直方图
追踪Span捕获以下信息:
  • 模型名称(
    gen_ai.request.model
  • Temperature、top_p参数
  • 完整提示和完成内容(可配置)
  • 提供商(
    gen_ai.system
    :
    openai
    ,
    anthropic
    等)
  • 首令牌生成时间(TTFT)

Python Setup with OpenLIT

使用OpenLIT的Python配置

bash
pip install openlit openai anthropic cohere
python
import openlit
import openai
bash
pip install openlit openai anthropic cohere
python
import openlit
import openai

One-line initialization - auto-instruments all supported LLM libraries

一键初始化——自动埋点所有支持的LLM库

openlit.init()
openlit.init()

Optional parameters

可选参数

openlit.init( application_name="my-ai-app", environment="production", )
openlit.init( application_name="my-ai-app", environment="production", )

Your existing code works unchanged - OpenLIT intercepts all LLM calls

现有代码无需修改——OpenLIT会拦截所有LLM调用

client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Hello!"}] )
undefined
client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Hello!"}] )
undefined

OTel Environment Variables

OTel环境变量

bash
export OTEL_SERVICE_NAME="my-ai-app"
export OTEL_DEPLOYMENT_ENVIRONMENT="production"
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-<region>.grafana.net/otlp"
bash
export OTEL_SERVICE_NAME="my-ai-app"
export OTEL_DEPLOYMENT_ENVIRONMENT="production"
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-<region>.grafana.net/otlp"

Base64 encode "instanceID:apiToken"

将"instanceID:apiToken"进行Base64编码

export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic base64-encoded-instanceid:apitoken"

To get the credentials:
1. In Grafana Cloud, go to **My Account** > **Stack** > **OpenTelemetry**
2. Generate a token and copy the OTLP endpoint
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic base64-encoded-instanceid:apitoken"

获取凭证步骤:
1. 在Grafana Cloud中,进入**My Account** > **Stack** > **OpenTelemetry**
2. 生成令牌并复制OTLP端点

AI Evaluations and Guards

AI评估与安全防护

python
undefined
python
undefined

Hallucination detection

幻觉检测

evals = openlit.evals.Hallucination( provider="openai", api_key=os.getenv("OPENAI_API_KEY") ) result = evals.measure( prompt=user_message, contexts=["Your knowledge base content here"], text=llm_answer )
evals = openlit.evals.Hallucination( provider="openai", api_key=os.getenv("OPENAI_API_KEY") ) result = evals.measure( prompt=user_message, contexts=["Your knowledge base content here"], text=llm_answer )

Content safety guard

内容安全防护

guard = openlit.guard.All( provider="openai", api_key=os.getenv("OPENAI_API_KEY") ) guard.detect(text=user_message)
undefined
guard = openlit.guard.All( provider="openai", api_key=os.getenv("OPENAI_API_KEY") ) guard.detect(text=user_message)
undefined

Prebuilt Dashboards

预构建仪表盘

Once metrics arrive, Grafana Cloud auto-populates five dashboards:
  1. GenAI Observability - request rates, latency percentiles, costs
  2. GenAI Evaluations - hallucination, bias, toxicity scores
  3. Vector Database Observability - query latency, index ops
  4. MCP Observability - tool call rates, errors
  5. GPU Monitoring - utilization, memory, temperature
指标到达后,Grafana Cloud会自动填充五款仪表盘:
  1. GenAI可观测性 - 请求率、延迟百分位数、成本
  2. GenAI评估 - 幻觉、偏见、毒性评分
  3. 向量数据库可观测性 - 查询延迟、索引操作
  4. MCP可观测性 - 工具调用率、错误
  5. GPU监控 - 利用率、内存、温度

Setup Path

配置流程

  1. In Grafana Cloud: Connections > search "AI Observability" > click the card
  2. Follow the UI wizard to get your OTLP endpoint and API key
  3. Set the environment variables
  4. pip install openlit
    and call
    openlit.init()
    at app startup
  5. Deploy - dashboards populate automatically within minutes

  1. 在Grafana Cloud中:Connections > 搜索「AI Observability」 > 点击卡片
  2. 按照UI向导获取OTel端点和API密钥
  3. 设置环境变量
  4. 执行
    pip install openlit
    并在应用启动时调用
    openlit.init()
  5. 部署——几分钟内仪表盘会自动填充数据

Full-Stack Correlation Summary

全链路关联总结

SignalProductStorageQuery Language
Metrics (RED)App ObservabilityMimirPromQL
TracesTempoTempoTraceQL
LogsLokiLokiLogQL
ProfilesPyroscopePyroscope-
Browser RUMFaro/Frontend ObsLoki + Tempo-
LLM metricsAI ObservabilityMimirPromQL
Correlation keys:
  • service.name
    /
    service_name
    links all signals for a service
  • Trace exemplars embed trace IDs in metric data points (RED metrics -> traces)
  • traceID
    in logs enables log-to-trace correlation
  • profileID
    / time range enables trace-to-profile correlation
  • Faro injects
    traceparent
    headers to link browser sessions to backend traces

信号产品存储查询语言
指标(RED)应用可观测性MimirPromQL
追踪TempoTempoTraceQL
日志LokiLokiLogQL
性能剖析PyroscopePyroscope-
浏览器RUMFaro/前端可观测性Loki + Tempo-
LLM指标AI可观测性MimirPromQL
关联键:
  • service.name
    /
    service_name
    关联服务的所有信号
  • 追踪示例将Trace ID嵌入指标数据点(RED指标 -> 追踪)
  • 日志中的
    traceID
    实现日志到追踪的关联
  • profileID
    / 时间范围实现追踪到性能剖析的关联
  • Faro注入
    traceparent
    头以关联浏览器会话与后端追踪

Common Tasks

常见任务

Find Why a Service Has High Latency

排查服务高延迟原因

  1. App Observability > Service Inventory > click service
  2. In Service Overview: check p95/p99 latency trend in Operations panel
  3. Click a high-latency operation > "View traces" to open exemplar traces in Tempo
  4. In Tempo trace: use "Go to profiles" to see CPU profile at that time
  5. Check correlated logs in the Logs panel of Service Overview
  1. 应用可观测性 > 服务清单 > 点击目标服务
  2. 在服务概览中:查看Operations面板的p95/p99延迟趋势
  3. 点击高延迟操作 > 「查看追踪」在Tempo中打开示例追踪
  4. 在Tempo追踪中:使用「前往性能剖析」查看对应时间的CPU剖析
  5. 在服务概览的Logs面板中查看关联日志

Debug a Frontend Error

调试前端错误

  1. Frontend Observability > Errors panel > click error
  2. View stack trace, browser, OS, session info
  3. Click "View session replay" to see what the user did
  4. Check correlated backend trace if
    TracingInstrumentation
    is configured
  1. 前端可观测性 > Errors面板 > 点击目标错误
  2. 查看堆栈追踪、浏览器、操作系统、会话信息
  3. 点击「查看会话重放」查看用户操作流程
  4. 若配置了
    TracingInstrumentation
    ,可查看关联的后端追踪

Monitor LLM Cost Drift

监控LLM成本异常

  1. AI Observability dashboard > GenAI Observability
  2. Use
    gen_ai_usage_cost_USD_sum
    metric to see cost by model/provider
  3. Set alert on cost threshold or token usage spike
  4. Drill into traces to see which prompts are consuming the most tokens

  1. AI可观测性仪表盘 > GenAI可观测性
  2. 使用
    gen_ai_usage_cost_USD_sum
    指标查看各模型/提供商的成本
  3. 设置成本阈值或令牌使用量峰值告警
  4. 深入追踪查看哪些提示消耗了最多令牌

References

参考链接