analyzing-mlflow-trace

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Analyzing a Single MLflow Trace

分析单个MLflow Trace

Trace Structure

Trace结构

A trace captures the full execution of an AI/ML application as a tree of spans. Each span represents one operation (LLM call, tool invocation, retrieval step, etc.) and records its inputs, outputs, timing, and status. Traces also carry assessments — feedback from humans or LLM judges about quality.
It is recommended to read references/trace-structure.md before analyzing a trace — it covers the complete data model, all fields and types, analysis guidance, and OpenTelemetry compatibility notes.
Trace会以Span树的形式捕获AI/ML应用的完整执行过程。每个Span代表一个操作(LLM调用、工具调用、检索步骤等),并记录其输入、输出、时间和状态。Trace还包含评估结果——来自人类或LLM评判者的质量反馈。
建议在分析Trace前阅读references/trace-structure.md,其中涵盖了完整的数据模型、所有字段和类型、分析指南以及OpenTelemetry兼容性说明。

Handling CLI Output

处理CLI输出

Traces can be 100KB+ for complex agent executions. Always redirect output to a file — do not pipe
mlflow traces get
directly to
jq
,
head
, or other commands, as piping can silently produce no output.
bash
undefined
对于复杂的Agent执行,Trace大小可能超过100KB。请始终将输出重定向到文件——不要直接将
mlflow traces get
的输出通过管道传给
jq
head
或其他命令,因为管道可能会静默地产生空输出。
bash
undefined

Fetch full trace to a file (traces get always outputs JSON, no --output flag needed)

将完整Trace抓取到文件中(traces get始终输出JSON,无需--output参数)

mlflow traces get --trace-id <ID> > /tmp/trace.json
mlflow traces get --trace-id <ID> > /tmp/trace.json

Then process the file

然后处理该文件

jq '.info.state' /tmp/trace.json jq '.data.spans | length' /tmp/trace.json

**Prefer fetching the full trace and parsing the JSON directly** rather than using `--extract-fields`. The `--extract-fields` flag has limited support for nested span data (e.g., span inputs/outputs may return empty objects). Fetch the complete trace once and parse it as needed.
jq '.info.state' /tmp/trace.json jq '.data.spans | length' /tmp/trace.json

**优先抓取完整Trace并直接解析JSON**,而不是使用`--extract-fields`。`--extract-fields`标志对嵌套Span数据的支持有限(例如,Span的输入/输出可能返回空对象)。只需抓取一次完整的Trace,然后根据需要解析即可。

JSON Structure

JSON结构

The trace JSON has two top-level keys:
info
(metadata, assessments) and
data
(spans).
{
  "info": { "trace_id", "state", "request_time", "assessments", ... },
  "data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}
Key paths (verified against actual CLI output):
Whatjq path
Trace state
.info.state
All spans
.data.spans
Root span
.data.spans[] | select(.parent_span_id == null)
Span status code
.data.spans[].status.code
(values:
STATUS_CODE_OK
,
STATUS_CODE_ERROR
,
STATUS_CODE_UNSET
)
Span status message
.data.spans[].status.message
Span inputs
.data.spans[].attributes["mlflow.spanInputs"]
Span outputs
.data.spans[].attributes["mlflow.spanOutputs"]
Assessments
.info.assessments
Assessment name
.info.assessments[].assessment_name
Feedback value
.info.assessments[].feedback.value
Feedback error
.info.assessments[].feedback.error
Assessment rationale
.info.assessments[].rationale
Important: Span inputs and outputs are stored as serialized JSON strings inside
attributes
, not as top-level span fields. Traces from third-party OpenTelemetry clients may use different attribute names (e.g., GenAI Semantic Conventions, OpenInference, or custom keys) — check the raw
attributes
dict to find the equivalent fields.
If paths don't match (structure may vary by MLflow version), discover them:
bash
undefined
Trace JSON包含两个顶级键:
info
(元数据、评估结果)和
data
(Span)。
{
  "info": { "trace_id", "state", "request_time", "assessments", ... },
  "data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}
关键路径(已通过实际CLI输出验证):
内容jq路径
Trace状态
.info.state
所有Span
.data.spans
根Span
.data.spans[] | select(.parent_span_id == null)
Span状态码
.data.spans[].status.code
(取值:
STATUS_CODE_OK
,
STATUS_CODE_ERROR
,
STATUS_CODE_UNSET
Span状态消息
.data.spans[].status.message
Span输入
.data.spans[].attributes["mlflow.spanInputs"]
Span输出
.data.spans[].attributes["mlflow.spanOutputs"]
评估结果
.info.assessments
评估名称
.info.assessments[].assessment_name
反馈值
.info.assessments[].feedback.value
反馈错误
.info.assessments[].feedback.error
评估理由
.info.assessments[].rationale
重要提示:Span的输入和输出以序列化JSON字符串的形式存储在
attributes
中,而不是作为Span的顶级字段。来自第三方OpenTelemetry客户端的Trace可能使用不同的属性名称(例如GenAI语义约定、OpenInference或自定义键)——请检查原始
attributes
字典以找到等效字段。
如果路径不匹配(结构可能因MLflow版本而异),可以通过以下方式发现:
bash
undefined

Top-level keys

顶级键

jq 'keys' /tmp/trace.json
jq 'keys' /tmp/trace.json

Span keys

Span键

jq '.data.spans[0] | keys' /tmp/trace.json
jq '.data.spans[0] | keys' /tmp/trace.json

Status structure

状态结构

jq '.data.spans[0].status' /tmp/trace.json
undefined
jq '.data.spans[0].status' /tmp/trace.json
undefined

Quick Health Check

快速健康检查

After fetching a trace to a file, run this to get a summary:
bash
jq '{
  state: .info.state,
  span_count: (.data.spans | length),
  error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
  assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.json
将Trace抓取到文件后,运行以下命令获取摘要:
bash
jq '{
  state: .info.state,
  span_count: (.data.spans | length),
  error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
  assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.json

Analysis Insights

分析见解

  • state: OK
    does not mean correct output.
    It only means no unhandled exception. Check assessments for quality signals, and if none exist, analyze the trace's inputs, outputs, and intermediate span data directly for issues.
  • Always consult the
    rationale
    when interpreting assessment values.
    The
    value
    alone can be misleading — for example, a
    user_frustration
    assessment with
    value: "no"
    could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. The
    .rationale
    field (a top-level assessment field, not nested under
    .feedback
    ) explains what the value means in context and often describes the issue in plain language before you need to examine any spans.
  • Assessments tell you what went wrong; spans tell you where. If assessments exist, use feedback/expectations to form a hypothesis, then confirm it in the span tree. If no assessments exist, examine span inputs/outputs to identify where the execution diverged from expected behavior.
  • Assessment errors are not trace errors. If an assessment has an
    error
    field, it means the scorer or judge that evaluated the trace failed — not that the trace itself has a problem. The trace may be perfectly fine; the assessment's
    value
    is just unreliable. This can happen when a scorer crashes (e.g., timed out, returned unparseable output) or when a scorer was applied to a trace type it wasn't designed for (e.g., a retrieval relevance scorer applied to a trace with no retrieval steps). The latter is a scorer configuration issue, not a trace issue.
  • Span timing reveals performance issues. Gaps between parent and child spans indicate overhead; repeated span names suggest retries; compare individual span durations to find bottlenecks.
  • Token usage explains latency and cost. Look for token usage in trace metadata (e.g.,
    mlflow.trace.tokenUsage
    ) or span attributes (e.g.,
    mlflow.chat.tokenUsage
    ). Not all clients set these — check the raw
    attributes
    dict for equivalent fields. Spikes in input tokens may indicate prompt injection or overly large context.
  • state: OK
    不代表输出正确
    。它仅表示没有未处理的异常。请检查评估结果以获取质量信号,如果没有评估结果,则直接分析Trace的输入、输出和中间Span数据以查找问题。
  • 解读评估值时务必参考
    rationale
    。仅看
    value
    可能会产生误导——例如,
    user_frustration
    评估的
    value: "no"
    可能表示“未检测到挫败感”或“挫败感检查未通过”(即确实存在挫败感),具体取决于评分器的配置。
    .rationale
    字段(评估的顶级字段,嵌套在
    .feedback
    下)会解释该值在上下文中的含义,并且通常会用通俗易懂的语言描述问题,无需查看任何Span。
  • **评估结果告诉你哪里出了问题;Span告诉你问题发生在何处。**如果存在评估结果,请使用反馈/预期形成假设,然后在Span树中验证。如果没有评估结果,请检查Span的输入/输出以确定执行与预期行为的偏差点。
  • 评估错误不是Trace错误。如果某个评估包含
    error
    字段,这意味着评估该Trace的评分器或评判者失败了——而不是Trace本身有问题。Trace可能完全正常;只是评估的
    value
    不可靠。这种情况可能发生在评分器崩溃(例如超时、返回无法解析的输出)或评分器应用于其不适用的Trace类型(例如,将检索相关性评分器应用于没有检索步骤的Trace)时。后者是评分器配置问题,而非Trace问题。
  • Span时序揭示性能问题。父Span和子Span之间的间隔表示开销;重复的Span名称表示重试;比较单个Span的持续时间以找到瓶颈。
  • Token使用情况解释延迟和成本。在Trace元数据(例如
    mlflow.trace.tokenUsage
    )或Span属性(例如
    mlflow.chat.tokenUsage
    )中查找Token使用情况。并非所有客户端都会设置这些字段——请检查原始
    attributes
    字典以找到等效字段。输入Token的峰值可能提示提示注入或上下文过大。

Codebase Correlation

代码库关联

MLflow Tracing captures inputs, outputs, and metadata from different parts of an application's call stack. By correlating trace contents with the source code, issues can be root-caused more precisely than from the trace alone.
  • Span names map to functions. Span names typically match the function decorated with
    @mlflow.trace
    or wrapped in
    mlflow.start_span()
    . For autologged spans (LangChain, OpenAI, etc.), names follow framework conventions instead (e.g.,
    ChatOpenAI
    ,
    RetrievalQA
    ).
  • The span tree mirrors the call stack. If span A is the parent of span B, then function A called function B.
  • Span inputs/outputs correspond to function parameters/return values. Comparing them against the code logic reveals whether the function behaved as designed or produced an unexpected result.
  • The trace shows what happened; the code shows why. A retriever returning irrelevant results might trace back to a faulty similarity threshold. Incorrect span inputs might reveal wrong model parameters or missing environment variables set in code.
MLflow Tracing会捕获应用调用栈不同部分的输入、输出和元数据。通过将Trace内容与源代码关联,可以比仅通过Trace更精确地定位问题根因。
  • Span名称与函数对应。Span名称通常与使用
    @mlflow.trace
    装饰或用
    mlflow.start_span()
    包裹的函数匹配。对于自动记录的Span(LangChain、OpenAI等),名称遵循框架约定(例如
    ChatOpenAI
    RetrievalQA
    )。
  • Span树镜像调用栈。如果Span A是Span B的父Span,则表示函数A调用了函数B。
  • Span输入/输出对应函数参数/返回值。将它们与代码逻辑进行比较,可以发现函数是否按设计运行或产生了意外结果。
  • **Trace显示发生了什么;代码显示为什么会发生。**检索器返回不相关结果可能追溯到错误的相似度阈值。不正确的Span输入可能揭示错误的模型参数或代码中未设置的环境变量。

Example: Investigating a Wrong Answer

示例:调查错误答案

A user reports that their customer support agent gave an incorrect answer for the query "What is our refund policy?" There are no assessments on the trace.
1. Fetch the trace and check high-level signals.
The trace has
state: OK
— no crash occurred. No assessments are present, so examine the trace's inputs and outputs directly. The
response_preview
says "Our shipping policy states that orders are delivered within 3-5 business days..." — this answers a different question than what was asked.
2. Examine spans to locate the problem.
The span tree shows:
customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│   outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│   inputs: {"query": "refund policy"}
│   outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│   inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│   outputs: {"content": "Our shipping policy states..."}
The agent correctly decided to search for "refund policy," but the
search_knowledge_base
tool returned a shipping document. The LLM then faithfully answered using the wrong context. The problem is in the tool's retrieval, not the agent's reasoning or the LLM's generation.
3. Correlate with the codebase.
The span
search_knowledge_base
maps to a function in the application code. Investigating reveals the vector index was built from only the shipping FAQ — the refund policy documents were never indexed.
4. Recommendations.
  • Re-index the knowledge base to include refund policy documents.
  • Add a retrieval relevance scorer to detect when retrieved context doesn't match the query topic.
  • Consider adding expectation assessments with correct answers for common queries to enable regression testing.
用户报告其客户支持Agent对查询“我们的退款政策是什么?”给出了错误答案。该Trace上没有评估结果。
1. 抓取Trace并检查高级信号。
Trace的
state: OK
——未发生崩溃。没有评估结果,因此直接检查Trace的输入和输出。
response_preview
显示*“我们的运输政策规定订单将在3-5个工作日内送达...”*——这回答的是另一个问题,而非用户所问的问题。
2. 检查Span以定位问题。
Span树显示:
customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│   outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│   inputs: {"query": "refund policy"}
│   outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│   inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│   outputs: {"content": "Our shipping policy states..."}
Agent正确决定搜索“refund policy”,但
search_knowledge_base
工具返回了运输文档。LLM随后根据错误的上下文生成了回答。问题出在工具的检索环节,而非Agent的推理或LLM的生成环节。
3. 与代码库关联。
Span
search_knowledge_base
对应应用代码中的一个函数。调查发现向量索引仅基于运输FAQ构建——退款政策文档从未被索引。
4. 建议。
  • 重新索引知识库,包含退款政策文档。
  • 添加检索相关性评分器,以检测检索到的上下文是否与查询主题匹配。
  • 考虑为常见查询添加带有正确答案的预期评估,以启用回归测试。