Analyzing an MLflow Chat Session

What is a Session?

A session groups multiple traces that belong to the same chat conversation or user interaction. Each trace in the session represents one turn: the user's input and the system's response. Traces within a session are linked by a shared session ID stored in trace metadata.

The session ID is stored in trace metadata under the key

mlflow.trace.session

. This key contains dots, which affects filter syntax (see below). All traces sharing the same value for this key belong to the same session.

Reconstructing the Conversation

Reconstructing a session's conversation is a multi-step process: discover the input/output schema from the first trace, extract those fields efficiently across all session traces, then inspect specific turns as needed. Do NOT fetch full traces for every turn — use

--extract-fields

on the search command instead.

Step 1: Discover the schema. First, find a trace ID from the session, then fetch its full JSON to inspect the schema:

bash

# Get the first trace in the session
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id' \
  --output json \
  --max-results 1 > /tmp/first_trace.json

# Fetch the full trace (always outputs JSON, no --output flag needed)
mlflow traces get \
  --trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json

Find the root span — the span with

parent_span_id

equal to

null

(i.e., it has no parent). This is the top-level operation in the trace:

bash

# Find the root span
jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json

Examine its

attributes

dict to identify which keys hold the user input and system output. These could be:

MLflow standard attributes:
```
mlflow.spanInputs
```
and
```
mlflow.spanOutputs
```
(set by the MLflow Python client)
Custom attributes: Application-specific keys set via
```
@mlflow.trace
```
or
```
mlflow.start_span()
```
with custom attribute logging
Third-party OTel attributes: Keys following GenAI Semantic Conventions, OpenInference, or other instrumentation conventions

The structure of these values also varies by application (e.g., a

query

string, a

messages

array, a dict with multiple fields). Inspect the actual attribute values to understand the format.

If the root span has empty or missing inputs/outputs, it may be a wrapper span (e.g., an orchestrator or middleware) that doesn't directly carry the chat turn data. In that case, look at its immediate children — find the closest span to the top of the hierarchy that has meaningful inputs and outputs corresponding to a chat turn:

The following example assumes the trace comes from the MLflow Python client (which stores inputs/outputs in

mlflow.spanInputs

mlflow.spanOutputs

) and that the relevant span is a direct child of root. In practice, the relevant span may be deeper in the hierarchy, and traces from other clients may use different attribute keys — explore the span tree as needed:

bash

# Get the root span's ID
ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)

# List immediate children of the root span with their inputs/outputs
jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json

Also check the first trace's assessments. Session-level assessments are attached to the first trace in the session — these evaluate the session as a whole (e.g., overall conversation quality, multi-turn coherence) and can indicate the presence of issues somewhere across the entire session, not just the first turn. The first trace may also have per-turn assessments for that specific turn.

Both types appear in

.info.assessments

. Session-level assessments are identified by the presence of

mlflow.trace.session

in their

metadata

field:

bash

# Show session-level assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

# Show per-turn assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

Assessment errors are not trace errors. If an assessment has a

feedback.error

field, it means the scorer or judge failed — not that the trace itself has a problem. Exclude these when using assessments to identify trace issues.

Always consult the rationale when interpreting assessment values. The

value

alone can be misleading — for example, a

user_frustration

assessment with

value: "no"

could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. The

.rationale

field (a top-level assessment field, not nested under

.feedback

) explains what the value means in context. Include rationale when extracting assessments:

bash

jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json

Step 2: Extract across all session traces. Once you know which attribute keys hold inputs and outputs, search for all traces in the session using

--extract-fields

to pull those fields along with assessments (see Handling CLI Output for why output is written to a file):

bash

mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
  --output json \
  --max-results 100 > /tmp/session_traces.json

Then use bash commands (e.g.,

jq

wc

head

) on the file to analyze it.

The

--extract-fields

example above uses

mlflow.traceInputs

mlflow.traceOutputs

from trace metadata — adjust the field paths based on what you discovered in step 1.

Assessments contain quality judgments (e.g., correctness, relevance) that can pinpoint which turns had issues without needing to read every trace in detail. To identify which turns have assessment signals (excluding scorer errors):

bash

# List turns with their valid assessments (scorer errors filtered out)
jq '.traces[] | {
  trace_id: .info.trace_id,
  time: .info.request_time,
  state: .info.state,
  assessments: [.info.assessments[]? | select(.feedback.error == null) | {
    name: .assessment_name,
    value: .feedback.value
  }]
}' /tmp/session_traces.json

CLI syntax notes:

--experiment-id
is required for all
```
mlflow traces search
```
commands. The command will fail without it.
Metadata keys containing dots must be escaped with backticks in filter strings and extract-fields:
```
metadata.`mlflow.trace.session`
```
Shell quoting: Backticks inside double quotes are interpreted by bash as command substitution (e.g., bash will try to run
```
`mlflow.trace.session`
```
as a command). Always use single quotes for the outer string when the value contains backticks. For example:
```
--filter-string 'metadata.\
```
mlflow.trace.session` = "value"'`
```
--max-results
```
defaults to 100, which is sufficient for most sessions. Increase up to 500 (the maximum) for longer conversations. If 500 results are returned, use pagination to retrieve the rest.

Handling CLI Output

MLflow trace output can be large, and Claude Code's Bash tool has a ~30KB output limit for piped commands. When output exceeds this threshold, it gets saved to a file instead of being piped, causing silent failures.

Safe approach (always works):

bash

# Step 1: Save to file
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  [...] \
  --output json > /tmp/output.json

# Step 2: Process the file
cat /tmp/output.json | jq '.traces[0].info.trace_id'
head -50 /tmp/output.json
wc -l /tmp/output.json

Never pipe MLflow CLI output directly (e.g.,

mlflow traces search ... | jq '.'

). This can silently produce no output. Always redirect to a file first, then run commands on the file.

To inspect a specific turn in detail (e.g., after identifying a problematic turn), fetch its full trace:

bash

mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json

Codebase Correlation

Session ID assignment: Search the codebase for where
```
mlflow.trace.session
```
is set to understand how sessions are created — per user login, per browser tab, per explicit "new conversation" action, etc.
Context window management: Look for how the application constructs the message history passed to the LLM at each turn. Common patterns include sliding window (last N messages), summarization of older turns, or full history. This implementation determines what context the model sees and is a frequent source of multi-turn failures.
Memory and state: Some applications maintain state across turns beyond message history (e.g., extracted entities, user preferences, accumulated tool results). Search for how this state is stored and passed between turns.

Reference Scripts

The

scripts/

subdirectory contains ready-to-run bash scripts for each analysis step. All scripts follow the output handling rules above (redirect to file, then process).

scripts/discover_schema.sh <EXPERIMENT_ID> <SESSION_ID>
— Finds the first trace in the session, fetches its full detail, and prints the root span's attribute keys and input/output values.
scripts/inspect_turn.sh <TRACE_ID>
— Fetches a specific trace, lists all spans, highlights error spans, and shows assessments.

Example: Wrong Answer on Chat Turn 5

A user reports that their chatbot gave an incorrect answer on the 5th message of a chat conversation.

1. Discover the schema and reconstruct the conversation.

Fetch the first trace in the session and inspect the root span's attributes to find which keys hold inputs and outputs. In this case,

mlflow.spanInputs

contains the user query and

mlflow.spanOutputs

contains the assistant response. Then search all session traces, extracting those fields in chronological order. Scanning the extracted inputs and outputs confirms that turn 5's response is wrong, and reveals whether earlier turns look correct.

2. Check if the error originated in an earlier turn.

Turn 3's response contains a factual error that the user didn't challenge. Turn 4 builds on that incorrect information, and turn 5 compounds it. The root cause is in turn 3, not turn 5.

3. Analyze the root-cause turn as a single trace.

Fetch the full trace for turn 3 and analyze it — examine assessments (if any), walk the span tree, check retriever results, and correlate with code. The retriever returned an outdated document, causing the wrong answer.

4. Recommendations.

Fix the retriever's data source to exclude or update outdated documents.
Add per-turn assessments to detect errors before they propagate across the conversation.
Consider implementing conversation-level error detection (e.g., checking consistency of answers across turns).

analyzing-mlflow-session

NPX Install

Tags

SKILL.md Content

Analyzing an MLflow Chat Session

What is a Session?

Reconstructing the Conversation

Handling CLI Output

Codebase Correlation

Reference Scripts

Example: Wrong Answer on Chat Turn 5