analyzing-mlflow-session

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Analyzing an MLflow Chat Session

分析MLflow聊天会话

What is a Session?

什么是会话？

A session groups multiple traces that belong to the same chat conversation or user interaction. Each trace in the session represents one turn: the user's input and the system's response. Traces within a session are linked by a shared session ID stored in trace metadata.

The session ID is stored in trace metadata under the key

mlflow.trace.session

. This key contains dots, which affects filter syntax (see below). All traces sharing the same value for this key belong to the same session.

会话将属于同一场聊天对话或用户交互的多个trace归为一组。会话中的每个trace代表一个轮次：包含用户输入和系统响应。会话内的trace通过存储在trace元数据中的共享session ID关联。

session ID存储在trace元数据的

mlflow.trace.session

键下。该键包含点符号，会影响过滤语法（详见下文）。所有拥有相同该键值的trace都属于同一个会话。

Reconstructing the Conversation

还原对话过程

Reconstructing a session's conversation is a multi-step process: discover the input/output schema from the first trace, extract those fields efficiently across all session traces, then inspect specific turns as needed. Do NOT fetch full traces for every turn — use

--extract-fields

on the search command instead.

Step 1: Discover the schema. First, find a trace ID from the session, then fetch its full JSON to inspect the schema:

bash

undefined

还原会话的对话是一个多步骤的过程：从第一个trace中发现输入/输出结构，高效提取所有会话trace中的对应字段，然后按需检查特定轮次。请勿获取每个轮次的完整trace——应在搜索命令中使用

--extract-fields

参数。

步骤1：发现数据结构。首先找到会话中的一个trace ID，然后获取其完整JSON以查看结构：

bash

undefined

Get the first trace in the session

mlflow traces search
--experiment-id <EXPERIMENT_ID>
--filter-string 'metadata.

mlflow.trace.session

= "<SESSION_ID>"'
--order-by "timestamp_ms ASC"
--extract-fields 'info.trace_id'
--output json
--max-results 1 > /tmp/first_trace.json

mlflow traces search \ --experiment-id <EXPERIMENT_ID> \ --filter-string 'metadata.

mlflow.trace.session

= "<SESSION_ID>"' \ --order-by "timestamp_ms ASC" \ --extract-fields 'info.trace_id' \ --output json \ --max-results 1 > /tmp/first_trace.json

Fetch the full trace (always outputs JSON, no --output flag needed)

mlflow traces get
--trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json


Find the **root span** — the span with `parent_span_id` equal to `null` (i.e., it has no parent). This is the top-level operation in the trace:

```bash

mlflow traces get \ --trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json


找到**根span**——即`parent_span_id`为`null`的span（没有父span）。这是trace中的顶层操作：

```bash

Find the root span

jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json


Examine its `attributes` dict to identify which keys hold the user input and system output. These could be:

- **MLflow standard attributes**: `mlflow.spanInputs` and `mlflow.spanOutputs` (set by the MLflow Python client)
- **Custom attributes**: Application-specific keys set via `@mlflow.trace` or `mlflow.start_span()` with custom attribute logging
- **Third-party OTel attributes**: Keys following GenAI Semantic Conventions, OpenInference, or other instrumentation conventions

The structure of these values also varies by application (e.g., a `query` string, a `messages` array, a dict with multiple fields). Inspect the actual attribute values to understand the format.

**If the root span has empty or missing inputs/outputs**, it may be a wrapper span (e.g., an orchestrator or middleware) that doesn't directly carry the chat turn data. In that case, look at its immediate children — find the closest span to the top of the hierarchy that has meaningful inputs and outputs corresponding to a chat turn:

The following example assumes the trace comes from the MLflow Python client (which stores inputs/outputs in `mlflow.spanInputs`/`mlflow.spanOutputs`) and that the relevant span is a direct child of root. In practice, the relevant span may be deeper in the hierarchy, and traces from other clients may use different attribute keys — explore the span tree as needed:

```bash

jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json


查看其`attributes`字典，确定哪些键存储了用户输入和系统输出。这些键可能是：

- **MLflow标准属性**：`mlflow.spanInputs`和`mlflow.spanOutputs`（由MLflow Python客户端设置）
- **自定义属性**：通过`@mlflow.trace`或`mlflow.start_span()`设置的应用特定键，包含自定义属性日志
- **第三方OTel属性**：遵循GenAI语义约定、OpenInference或其他插桩约定的键

这些值的结构也因应用而异（例如，`query`字符串、`messages`数组、包含多个字段的字典）。查看实际属性值以理解格式。

**如果根span的输入/输出为空或缺失**，它可能是一个包装span（例如，编排器或中间件），不直接承载聊天轮次数据。在这种情况下，查看它的直接子span——找到层级中最顶层的、包含对应聊天轮次有意义输入和输出的span：

以下示例假设trace来自MLflow Python客户端（在`mlflow.spanInputs`/`mlflow.spanOutputs`中存储输入/输出），且相关span是根span的直接子span。实际场景中，相关span可能在层级更深的位置，其他客户端的trace可能使用不同的属性键——按需探索span树：

```bash

Get the root span's ID

ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)

List immediate children of the root span with their inputs/outputs

jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json


Also check the first trace's assessments. **Session-level assessments are attached to the first trace in the session** — these evaluate the session as a whole (e.g., overall conversation quality, multi-turn coherence) and can indicate the presence of issues somewhere across the entire session, not just the first turn. The first trace may also have per-turn assessments for that specific turn.

Both types appear in `.info.assessments`. Session-level assessments are identified by the presence of `mlflow.trace.session` in their `metadata` field:

```bash


同时检查第一个trace的评估结果。**会话级评估附加在会话的第一个trace上**——这些评估针对整个会话（例如，整体对话质量、多轮连贯性），可以指示整个会话中是否存在问题，而不仅仅是第一个轮次。第一个trace也可能包含该特定轮次的单轮评估结果。

两种评估都出现在`.info.assessments`中。会话级评估的识别特征是其`metadata`字段中存在`mlflow.trace.session`：

```bash

Show session-level assessments (exclude scorer errors)

jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

Show per-turn assessments (exclude scorer errors)

jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json


**Assessment errors are not trace errors.** If an assessment has a `feedback.error` field, it means the scorer or judge failed — not that the trace itself has a problem. Exclude these when using assessments to identify trace issues.

**Always consult the rationale when interpreting assessment values.** The `value` alone can be misleading — for example, a `user_frustration` assessment with `value: "no"` could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration *is* present), depending on how the scorer was configured. The `.rationale` field (a top-level assessment field, **not** nested under `.feedback`) explains what the value means in context. Include rationale when extracting assessments:

```bash
jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json

Step 2: Extract across all session traces. Once you know which attribute keys hold inputs and outputs, search for all traces in the session using

--extract-fields

to pull those fields along with assessments (see Handling CLI Output for why output is written to a file):

bash

mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
  --output json \
  --max-results 100 > /tmp/session_traces.json

Then use bash commands (e.g.,

jq

wc

head

) on the file to analyze it.

The

--extract-fields

example above uses

mlflow.traceInputs

mlflow.traceOutputs

from trace metadata — adjust the field paths based on what you discovered in step 1.

Assessments contain quality judgments (e.g., correctness, relevance) that can pinpoint which turns had issues without needing to read every trace in detail. To identify which turns have assessment signals (excluding scorer errors):

bash

undefined

jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json


**评估错误不是trace错误**。如果评估包含`feedback.error`字段，这意味着评分器或判断器失败——而非trace本身存在问题。使用评估识别trace问题时，请排除这些评估。

**解读评估值时务必参考理由**。仅看`value`可能产生误导——例如，`user_frustration`评估的`value: "no"`可能表示“未检测到沮丧情绪”，也可能表示“沮丧检查未通过”（即存在沮丧情绪），具体取决于评分器的配置。`.rationale`字段（评估的顶级字段，**不**嵌套在`.feedback`下）解释了该值在上下文中的含义。提取评估时请包含理由：

```bash
jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json

步骤2：提取所有会话trace的数据。确定存储输入和输出的属性键后，使用

--extract-fields

搜索会话中的所有trace，提取这些字段以及评估结果（详见CLI输出处理了解为何将输出写入文件）：

bash

mlflow traces search \\
  --experiment-id <EXPERIMENT_ID> \\
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \\
  --order-by "timestamp_ms ASC" \\
  --extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \\
  --output json \\
  --max-results 100 > /tmp/session_traces.json

然后使用bash命令（如

jq

、

wc

、

head

）处理该文件。

上述

--extract-fields

示例使用了trace元数据中的

mlflow.traceInputs

mlflow.traceOutputs

——根据步骤1的发现调整字段路径。

评估包含质量判断（例如，正确性、相关性），无需阅读每个trace的详细内容即可定位有问题的轮次。要识别哪些轮次有评估信号（排除评分器错误）：

bash

undefined

List turns with their valid assessments (scorer errors filtered out)

jq '.traces[] | { trace_id: .info.trace_id, time: .info.request_time, state: .info.state, assessments: [.info.assessments[]? | select(.feedback.error == null) | { name: .assessment_name, value: .feedback.value }] }' /tmp/session_traces.json


**CLI syntax notes:**

- **`--experiment-id` is required** for all `mlflow traces search` commands. The command will fail without it.
- Metadata keys containing dots **must** be escaped with backticks in filter strings and extract-fields: `` metadata.`mlflow.trace.session` ``
- **Shell quoting**: Backticks inside **double quotes** are interpreted by bash as command substitution (e.g., bash will try to run `` `mlflow.trace.session` `` as a command). Always use **single quotes** for the outer string when the value contains backticks. For example: `--filter-string 'metadata.\`mlflow.trace.session\` = "value"'`
- `--max-results` defaults to 100, which is sufficient for most sessions. Increase up to 500 (the maximum) for longer conversations. If 500 results are returned, use pagination to retrieve the rest.


**CLI语法说明：**

- **`--experiment-id`是必填项**，所有`mlflow traces search`命令都需要该参数。缺少该参数命令会失败。
- 包含点符号的元数据键**必须**在过滤字符串和extract-fields中用反引号转义：`` metadata.`mlflow.trace.session` ``
- **Shell引号**：双引号内的反引号会被bash解释为命令替换（例如，bash会尝试将`` `mlflow.trace.session` ``作为命令执行）。当值包含反引号时，外层字符串请始终使用单引号。例如：`--filter-string 'metadata.\\`mlflow.trace.session\\` = "value"'`
- `--max-results`默认值为100，足以处理大多数会话。对于较长的对话，可增加至最大值500。如果返回500条结果，使用分页获取剩余内容。

Handling CLI Output

CLI输出处理

MLflow trace output can be large, and Claude Code's Bash tool has a ~30KB output limit for piped commands. When output exceeds this threshold, it gets saved to a file instead of being piped, causing silent failures.

Safe approach (always works):

bash

undefined

MLflow trace输出可能很大，Claude Code的Bash工具对管道命令有~30KB的输出限制。当输出超过此阈值时，会被保存到文件而非通过管道传输，导致无提示失败。

安全方法（始终有效）：

bash

undefined

Step 1: Save to file

mlflow traces search
--experiment-id <EXPERIMENT_ID>
[...]
--output json > /tmp/output.json

mlflow traces search \ --experiment-id <EXPERIMENT_ID> \ [...] \ --output json > /tmp/output.json

Step 2: Process the file

cat /tmp/output.json | jq '.traces[0].info.trace_id' head -50 /tmp/output.json wc -l /tmp/output.json


**Never pipe MLflow CLI output directly** (e.g., `mlflow traces search ... | jq '.'`). This can silently produce no output. Always redirect to a file first, then run commands on the file.

To inspect a specific turn in detail (e.g., after identifying a problematic turn), fetch its full trace:

```bash
mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json

cat /tmp/output.json | jq '.traces[0].info.trace_id' head -50 /tmp/output.json wc -l /tmp/output.json


**切勿直接管道MLflow CLI输出**（例如，`mlflow traces search ... | jq '.'`）。这可能会无提示地产生空输出。请始终先重定向到文件，然后对文件执行命令。

要详细检查特定轮次（例如，定位到有问题的轮次后），获取其完整trace：

```bash
mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json

Codebase Correlation

代码库关联

Session ID assignment: Search the codebase for where
```
mlflow.trace.session
```
is set to understand how sessions are created — per user login, per browser tab, per explicit "new conversation" action, etc.
Context window management: Look for how the application constructs the message history passed to the LLM at each turn. Common patterns include sliding window (last N messages), summarization of older turns, or full history. This implementation determines what context the model sees and is a frequent source of multi-turn failures.
Memory and state: Some applications maintain state across turns beyond message history (e.g., extracted entities, user preferences, accumulated tool results). Search for how this state is stored and passed between turns.

Session ID分配：在代码库中搜索
```
mlflow.trace.session
```
的设置位置，了解会话的创建方式——每个用户登录、每个浏览器标签页、每个显式的“新对话”操作等。
上下文窗口管理：查看应用在每个轮次如何构建传递给LLM的消息历史。常见模式包括滑动窗口（最近N条消息）、旧轮次总结、或完整历史。该实现决定了模型能看到的上下文，是多轮失败的常见原因。
内存与状态：一些应用会在轮次间维护消息历史之外的状态（例如，提取的实体、用户偏好、累积的工具结果）。搜索该状态的存储和轮次间传递方式。

Reference Scripts

参考脚本

The

scripts/

subdirectory contains ready-to-run bash scripts for each analysis step. All scripts follow the output handling rules above (redirect to file, then process).

scripts/discover_schema.sh <EXPERIMENT_ID> <SESSION_ID>
— Finds the first trace in the session, fetches its full detail, and prints the root span's attribute keys and input/output values.
scripts/inspect_turn.sh <TRACE_ID>
— Fetches a specific trace, lists all spans, highlights error spans, and shows assessments.

scripts/

子目录包含每个分析步骤的可运行bash脚本。所有脚本都遵循上述输出处理规则（重定向到文件，然后处理）。

scripts/discover_schema.sh <EXPERIMENT_ID> <SESSION_ID>
——找到会话中的第一个trace，获取其完整详情，并打印根span的属性键和输入/输出值。
scripts/inspect_turn.sh <TRACE_ID>
——获取特定trace，列出所有span，高亮错误span，并显示评估结果。

Example: Wrong Answer on Chat Turn 5

示例：第5轮聊天回答错误

A user reports that their chatbot gave an incorrect answer on the 5th message of a chat conversation.

1. Discover the schema and reconstruct the conversation.

Fetch the first trace in the session and inspect the root span's attributes to find which keys hold inputs and outputs. In this case,

mlflow.spanInputs

contains the user query and

mlflow.spanOutputs

contains the assistant response. Then search all session traces, extracting those fields in chronological order. Scanning the extracted inputs and outputs confirms that turn 5's response is wrong, and reveals whether earlier turns look correct.

2. Check if the error originated in an earlier turn.

Turn 3's response contains a factual error that the user didn't challenge. Turn 4 builds on that incorrect information, and turn 5 compounds it. The root cause is in turn 3, not turn 5.

3. Analyze the root-cause turn as a single trace.

Fetch the full trace for turn 3 and analyze it — examine assessments (if any), walk the span tree, check retriever results, and correlate with code. The retriever returned an outdated document, causing the wrong answer.

4. Recommendations.

Fix the retriever's data source to exclude or update outdated documents.
Add per-turn assessments to detect errors before they propagate across the conversation.
Consider implementing conversation-level error detection (e.g., checking consistency of answers across turns).

用户反馈其聊天机器人在聊天对话的第5条消息中给出了错误答案。

1. 发现数据结构并还原对话

获取会话中的第一个trace，查看根span的属性，找到存储输入和输出的键。在本例中，

mlflow.spanInputs

包含用户查询，

mlflow.spanOutputs

包含助手回复。然后按时间顺序搜索所有会话trace，提取这些字段。扫描提取的输入和输出确认第5轮的回复错误，并查看之前的轮次是否正常。

2. 检查错误是否起源于更早的轮次

第3轮的回复包含一个用户未质疑的事实错误。第4轮基于该错误信息展开，第5轮进一步放大了错误。根本原因在第3轮，而非第5轮。

3. 分析根本原因轮次的单个trace

获取第3轮的完整trace并分析——查看评估结果（如果有）、遍历span树、检查检索器结果，并与代码关联。检索器返回了过时的文档，导致错误答案。

4. 建议

修复检索器的数据源，排除或更新过时文档。
添加单轮评估，在错误传播到整个对话前检测到错误。
考虑实现对话级错误检测（例如，检查各轮次答案的一致性）。",