llm-obs-experiment-py-bootstrap
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Obs Experiment (Python) Bootstrap — Generate a Python Experiment Using ddtrace.llmobs
ddtrace.llmobsLLM Obs实验(Python)引导程序——使用ddtrace.llmobs
生成Python实验
ddtrace.llmobsProduce a single self-contained Python experiment that uses the official SDK. Output is either a script or an notebook. The generated code mirrors the patterns shown in DataDog's reference notebooks at https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks.
ddtrace.llmobs.py.ipynbThe SDK handles lazy project/experiment creation, dataset push diffing, the 5 MB / 1000-record bulk threshold, eval metric streaming, and the status state machine on the user's behalf. This skill must therefore never re-implement those primitives — it just imports and trusts it.
LLMObs生成一个独立的Python实验,该实验使用官方** SDK**。输出格式可为脚本或笔记本。生成的代码与DataDog参考笔记本中的模式一致,参考地址为https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks。
ddtrace.llmobs.py.ipynbSDK会自动处理延迟项目/实验创建、数据集推送差异、5MB/1000条记录的批量阈值、评估指标流式传输以及状态机等操作。因此,本工具绝不能重新实现这些基础功能——只需导入并依赖其处理即可。
LLMObsUsage
使用方法
/llm-obs-experiment-py-bootstrap [--format py|ipynb] [--dataset <path>] [--dataset-name <name>] [--dataset-version <int>] [--project-name <name>] [--evaluator-style function|class|remote] [--jobs <n>] [--output <path>]Arguments: $ARGUMENTS
/llm-obs-experiment-py-bootstrap [--format py|ipynb] [--dataset <路径>] [--dataset-name <名称>] [--dataset-version <整数>] [--project-name <名称>] [--evaluator-style function|class|remote] [--jobs <数量>] [--output <路径>]参数:$ARGUMENTS
Inputs
输入项
All inputs are optional. If the user omits a flag, fall back to the default — never block on prompting for , , etc.
--jobs--format| Input | Default | Description |
|---|---|---|
| | |
| none — emit a sample 3-record | Path to a local |
| none | Name of an existing Datadog dataset to fetch at runtime via |
| none (latest) | Pin to a specific dataset version when using |
| | Datadog project name (visible in the LLM Experiments UI). The SDK's |
| | |
| | Passed to |
| | File extension derives from |
所有输入项均为可选。如果用户省略某个标志,则使用默认值——切勿因缺少、等标志而中断流程。
--jobs--format| 输入项 | 默认值 | 描述 |
|---|---|---|
| | |
| 无——内置包含3条记录的示例 | 本地 |
| 无 | 现有Datadog数据集的名称,运行时通过 |
| 无(最新版本) | 使用 |
| | Datadog项目名称(在LLM实验UI中可见)。SDK的 |
| | |
| | 传递给 |
| | 文件扩展名由 |
SDK Surface (Cited)
SDK接口(引用)
These are the public symbols the generated code uses. All come from (the public package — never from or other underscore-prefixed modules).
ddtrace.llmobsddtrace.llmobs._experiment| Import | Source | What it gives you |
|---|---|---|
| | |
| | LLM-as-Judge that runs server-side; preferred over inline |
| | Class-based evaluator path (advanced) |
| | Inline LLM-as-Judge with prompt template support |
Canonical call signatures (must match the generated code exactly):
python
LLMObs.enable(
api_key=os.getenv("DD_API_KEY"),
app_key=os.getenv("DD_APPLICATION_KEY"),
site=os.getenv("DD_SITE", "datadoghq.com"), # required for non-prod sites (e.g. datad0g.com, datadoghq.eu)
project_name="<project>",
agentless_enabled=True, # required when not running behind the dd-agent
)以下是生成代码使用的公共符号,全部来自(公共包——切勿使用或其他以下划线开头的模块)。
ddtrace.llmobsddtrace.llmobs._experiment| 导入项 | 来源 | 提供的功能 |
|---|---|---|
| | |
| | 在服务器端运行的LLM-as-Judge;优先于内联 |
| | 基于类的评估器路径(高级用法) |
| | 支持提示模板的内联LLM-as-Judge |
标准调用签名(生成代码必须完全匹配):
python
LLMObs.enable(
api_key=os.getenv("DD_API_KEY"),
app_key=os.getenv("DD_APPLICATION_KEY"),
site=os.getenv("DD_SITE", "datadoghq.com"), # 非生产站点必填(如datad0g.com、datadoghq.eu)
project_name="<project>",
agentless_enabled=True, # 不运行在dd-agent后方时必填
)Note: ml_app is not a separate input. The SDK derives it from project_name
注意:ml_app不是单独的输入项。SDK会在未提供时从project_name派生。
when not supplied. If a user really wants to override it later, they can
如果用户确实想要稍后覆盖它,可以自行在enable()中添加ml_app="..."
。
ml_app="..."add ml_app="..."
to enable() themselves.
ml_app="..."—
dataset = LLMObs.create_dataset(
dataset_name="<name>",
description="<optional>",
records=[
# Per-record MUST be a list of "key:value" strings (e.g. "env:smoke"),
# never bare strings — the SDK rejects malformed tags with a ValueError on append.
{"input_data": {"<k>": "<v>"}, "expected_output": "<v>", "metadata": {}, "tags": ["env:<env>"]},
# ...
],
)
tagsdataset = LLMObs.create_dataset(
dataset_name="<name>",
description="<可选>",
records=[
# 每条记录的必须是"key:value"格式的字符串列表(例如"env:smoke"),
# 绝不能是纯字符串——SDK会在追加时因格式错误的标签抛出ValueError。
{"input_data": {"<k>": "<v>"}, "expected_output": "<v>", "metadata": {}, "tags": ["env:<env>"]},
# ...
],
)
tagsOR
或
dataset = LLMObs.create_dataset_from_csv(
csv_path="<path>",
dataset_name="<name>",
input_data_columns=["<col1>", "<col2>"],
expected_output_columns=["<col>"],
)
dataset = LLMObs.create_dataset_from_csv(
csv_path="<path>",
dataset_name="<name>",
input_data_columns=["<col1>", "<col2>"],
expected_output_columns=["<col>"],
)
OR pull an existing Datadog dataset by name (no local file needed)
或按名称拉取现有Datadog数据集(无需本地文件)
dataset = LLMObs.pull_dataset(
dataset_name="<name>",
project_name="<project>", # optional — defaults to the project on enable()
version=2, # optional — pin a version; omit for the latest
)
def task_fn(input_data: dict, config: dict):
# TODO(user): replace with your actual LLM call
...
dataset = LLMObs.pull_dataset(
dataset_name="<name>",
project_name="<project>", # 可选——默认使用enable()中设置的项目
version=2, # 可选——固定版本;省略则使用最新版本
)
def task_fn(input_data: dict, config: dict):
# TODO(user): 替换为实际的LLM调用
...
Plain function evaluator (default style)
普通函数评估器(默认风格)
def exact_match(input_data, output_data, expected_output) -> bool:
return output_data == expected_output
experiment = LLMObs.experiment(
name="<experiment_name>",
dataset=dataset,
task=task_fn,
evaluators=[exact_match],
config={
"model": "gpt-4o-mini",
"temperature": 0.0,
# Provenance also lives in so it renders in the
# experiment's Configuration view alongside model/temperature.
# below only reaches metadata.tags, which the current UI
# does not surface as chips — config is what users actually see.
"generated_by": "claude-code",
"skill": "llm-obs-experiment-py-bootstrap",
},
description="<optional>",
tags={
# Same provenance, sent to experiment metadata.tags for any future
# tag-filter UI / API consumers. Always emitted alongside the
# config copy — never one without the other.
"generated_by": "claude-code",
"skill": "llm-obs-experiment-py-bootstrap",
},
)
configtags=experiment.run(jobs=10)
print(experiment.url)
---def exact_match(input_data, output_data, expected_output) -> bool:
return output_data == expected_output
experiment = LLMObs.experiment(
name="<experiment_name>",
dataset=dataset,
task=task_fn,
evaluators=[exact_match],
config={
"model": "gpt-4o-mini",
"temperature": 0.0,
# 来源信息也存储在中,以便在实验的配置视图中与model/temperature一起显示。
# 下面的仅会传递到metadata.tags,当前UI不会将其显示为筛选标签——用户实际看到的是config中的内容。
"generated_by": "claude-code",
"skill": "llm-obs-experiment-py-bootstrap",
},
description="<可选>",
tags={
# 相同的来源信息,发送到实验metadata.tags,供未来的标签筛选UI/API使用。始终与config中的副本一起输出——切勿只输出其中一个。
"generated_by": "claude-code",
"skill": "llm-obs-experiment-py-bootstrap",
},
)
configtags=experiment.run(jobs=10)
print(experiment.url)
---Evaluator Styles
评估器风格
Generated code uses one of three evaluator surfaces, picked by . Whichever style is chosen, prefer returning over a bare / whenever the evaluator has any signal beyond the raw value — see "Return EvaluatorResult, not bare values" below.
--evaluator-styleEvaluatorResultboolfloat生成的代码会使用以下三种评估器接口之一,由选择。无论选择哪种风格,只要评估器有原始值之外的信号,就优先返回而非纯/——请参阅下文“返回EvaluatorResult,而非纯值”部分。
--evaluator-styleEvaluatorResultboolfloatReturn EvaluatorResult
, not bare values
EvaluatorResult返回EvaluatorResult
,而非纯值
EvaluatorResultPlain functions are allowed to return / / , and is allowed to return raw . The SDK accepts both — but carries fields the Datadog UI surfaces in ways the raw value cannot:
boolfloatdictBaseEvaluator.evaluate()JSONTypeEvaluatorResult| Field | Type | Used by Datadog UI for |
|---|---|---|
| | The score itself — shown on the experiment metric. Required. |
| | Per-record explanation shown in the compare UI; lets reviewers see why an evaluator passed/failed without re-running the LLM. |
| | Determines whether a metric trend going up vs. down is an improvement; the UI uses this to color baseline-vs-candidate comparisons. |
| | Free-form per-record context (e.g. |
| | Used to slice experiment results in the UI (e.g. |
The generated code should default to for any evaluator richer than a one-line equality check. The trivial and shown below are the only cases where a bare is acceptable.
EvaluatorResultexact_matchlength_under_500bool普通函数可以返回//,可以返回原始。SDK都接受这些返回值——但包含Datadog UI可以以特殊方式展示的字段:
boolfloatdictBaseEvaluator.evaluate()JSONTypeEvaluatorResult| 字段 | 类型 | Datadog UI用途 |
|---|---|---|
| | 分数本身——显示在实验指标中。必填。 |
| | 在对比UI中显示每条记录的解释;让审核者无需重新运行LLM即可了解评估器通过/失败的原因。 |
| | 决定指标趋势上升或下降是否代表改进;UI使用此字段为基线与候选者的对比结果着色。 |
| | 每条记录的自由格式上下文(例如 |
| | 用于在UI中筛选实验结果(例如 |
对于任何比单行相等检查更复杂的评估器,生成的代码应默认使用。下面展示的简单和是唯一可以接受返回纯的情况。
EvaluatorResultexact_matchlength_under_500boolfunction
(default — what the notebooks use)
functionfunction
(默认——笔记本使用的风格)
functionPlain Python functions with the signature . Always emit at least three: a trivial boolean (returns ), a richer rule-based one (returns ), and an LLM-as-Judge surrogate (a reference or a placeholder).
(input_data, output_data, expected_output)boolEvaluatorResultRemoteEvaluatorpython
from ddtrace.llmobs import EvaluatorResult具有签名的普通Python函数。始终至少生成三个:一个简单的布尔函数(返回)、一个更丰富的基于规则的函数(返回带有+的),以及一个LLM-as-Judge代理(引用或占位符)。
(input_data, output_data, expected_output)boolreasoningassessmentEvaluatorResultRemoteEvaluatorpython
from ddtrace.llmobs import EvaluatorResultTrivial check — bare bool is fine here, the result has no extra signal.
简单检查——返回纯bool即可,结果没有额外信号。
def exact_match(input_data, output_data, expected_output) -> bool:
return output_data == expected_output
def exact_match(input_data, output_data, expected_output) -> bool:
return output_data == expected_output
Richer check — use EvaluatorResult so reasoning/assessment surface in the UI.
更复杂的检查——使用EvaluatorResult以便在UI中展示reasoning/assessment。
def response_well_formed(input_data, output_data, expected_output) -> EvaluatorResult:
if not isinstance(output_data, str):
return EvaluatorResult(
value=False,
reasoning=f"output_data was {type(output_data).name}, expected str",
assessment="fail",
)
if len(output_data) > 500:
return EvaluatorResult(
value=False,
reasoning=f"output exceeded 500 chars (was {len(output_data)})",
assessment="fail",
metadata={"length": len(output_data)},
)
return EvaluatorResult(value=True, assessment="pass")
undefineddef response_well_formed(input_data, output_data, expected_output) -> EvaluatorResult:
if not isinstance(output_data, str):
return EvaluatorResult(
value=False,
reasoning=f"output_data类型为{type(output_data).name},预期为str",
assessment="fail",
)
if len(output_data) > 500:
return EvaluatorResult(
value=False,
reasoning=f"输出长度超过500字符(实际为{len(output_data)})",
assessment="fail",
metadata={"length": len(output_data)},
)
return EvaluatorResult(value=True, assessment="pass")
undefinedclass
(advanced — for evaluators that need state or async I/O)
classclass
(高级——需要状态或异步I/O的评估器)
classAlways return from — never a bare value. State-bearing evaluators usually have richer reasoning to surface anyway.
EvaluatorResultevaluate()python
from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult
class FaithfulnessJudge(BaseEvaluator):
def __init__(self):
super().__init__(name="faithfulness")
# TODO(user): initialize any client or state here
def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
# context exposes: input_data, output_data, expected_output, metadata
# TODO(user): replace placeholder logic with your faithfulness check
passed = context.output_data is not None
return EvaluatorResult(
value=1.0 if passed else 0.0,
reasoning="placeholder — replace with your faithfulness rubric",
assessment="pass" if passed else "fail",
metadata={"evaluator_version": "v1"},
)始终从返回——切勿返回纯值。带有状态的评估器通常需要展示更丰富的推理信息。
evaluate()EvaluatorResultpython
from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult
class FaithfulnessJudge(BaseEvaluator):
def __init__(self):
super().__init__(name="faithfulness")
# TODO(user): 在此处初始化任何客户端或状态
def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
# context包含:input_data, output_data, expected_output, metadata
# TODO(user): 替换占位符逻辑为实际的忠实度检查
passed = context.output_data is not None
return EvaluatorResult(
value=1.0 if passed else 0.0,
reasoning="占位符——替换为你的忠实度评估规则",
assessment="pass" if passed else "fail",
metadata={"evaluator_version": "v1"},
)remote
(LLM-as-Judge running server-side)
remoteremote
(在服务器端运行的LLM-as-Judge)
remotepython
from ddtrace.llmobs import RemoteEvaluatorpython
from ddtrace.llmobs import RemoteEvaluatorCreate the judge in Datadog UI first: LLM Observability → Evaluations → New Evaluator
先在Datadog UI中创建评估器:LLM Observability → Evaluations → New Evaluator
quality_judge = RemoteEvaluator(eval_name="<name-from-datadog-ui>")
quality_judge = RemoteEvaluator(eval_name="<name-from-datadog-ui>")
Optional: customize the payload the judge receives
可选:自定义评估器接收的负载
custom_judge = RemoteEvaluator(
eval_name="<name>",
transform_fn=lambda ctx: {
"question": ctx.input_data.get("question"),
"answer": ctx.output_data,
"reference": ctx.expected_output,
},
)
---custom_judge = RemoteEvaluator(
eval_name="<name>",
transform_fn=lambda ctx: {
"question": ctx.input_data.get("question"),
"answer": ctx.output_data,
"reference": ctx.expected_output,
},
)
---Generated File Structure
生成文件结构
The same section sequence in both formats. In these become comment banners; in each becomes one markdown cell + one code cell.
.py.ipynb1. Env setup — load_dotenv(), os.getenv reads, hard assert keys present
2. LLMObs.enable() — explicit api_key/app_key/project_name/agentless_enabled
3. Dataset — inline records OR create_dataset_from_csv
4. Task function — placeholder OpenAI call with # TODO(user) marker
5. Evaluators — 2-3 in the requested style
6. Experiment — LLMObs.experiment(config={..., "generated_by": "claude-code", ...}, tags={"generated_by": "claude-code", ...})
7. Run — experiment.run(jobs=N); print(experiment.url)
8. Results inspection — experiment.as_dataframe() if pandas, else print两种格式使用相同的章节顺序。在文件中,章节以注释横幅形式呈现;在文件中,每个章节对应一个markdown单元格+一个代码单元格。
.py.ipynb1. 环境设置 — load_dotenv()、os.getenv读取、硬断言密钥存在
2. LLMObs.enable() — 显式设置api_key/app_key/project_name/agentless_enabled
3. 数据集 — 内置记录或CSV加载器
4. 任务函数 — 占位符OpenAI调用,带有# TODO(user)标记
5. 评估器 — 2-3个符合指定风格的评估器
6. 实验 — LLMObs.experiment(config={..., "generated_by": "claude-code", ...}, tags={"generated_by": "claude-code", ...})
7. 运行 — experiment.run(jobs=N); print(experiment.url)
8. 结果检查 — 如果安装了pandas则使用experiment.as_dataframe(),否则使用printWorkflow
工作流
-
Parse arguments. Default. Resolve
--format pyextension from--output.--formatIfis not provided, resolve a default of the form--project-nameby walking these sources in order, taking the first match:experiment-<service-name>- →
pyproject.toml(PEP 621) or[project] name.[tool.poetry] name - →
setup.cfg.[metadata] name - → first
setup.pyargument toname="...".setup(...) - →
package.json(useful when the LLM app lives in a TS/JS monorepo Python package)."name" - The basename of the current working directory, lowercased and slugified (— replace non-matching chars with
/^[a-z0-9-]+$/).-
The final project name is. Strip a leadingexperiment-<service-name>fromexperiment-if it already starts with one (so a package literally named<service-name>yieldsexperiment-foo, notexperiment-foo). If none of the five sources resolve to a non-empty string, fall back toexperiment-experiment-fooand emit a warning in the next-steps output that the user should setexperiment-sdk-defaultexplicitly.--project-nameEmbed the resolved name as a string literal in the generatedline — don't emit runtimePROJECT_NAME = "..."lookups, since the user may run the file from a different directory than where the skill resolved it.os.getcwd() -
Resolve the dataset source. Error out if bothand
--datasetare passed — they're mutually exclusive.--dataset-name-
(local file → inline records or CSV loader):
--dataset <path>- Read the file. If JSON, validate top-level array of shape (
DatasetRecordRaw, optionalinput_data,expected_output,metadata). If CSV, parse header and auto-detect columns using thetagsheuristics:dataset-bootstrap→ input,prompt|input|query|question→ expected.expected|gold|truth|answer - Run a PII scrub (email/phone/SSN/API-key regexes) on all string values; replace matches with and surface a warning listing affected indices.
<REDACTED:pii-type> - For JSON datasets, embed the records inline in the generated file () so the user has a single self-contained artifact. For CSV datasets, emit
records=[...]and tell the user the CSV needs to be present at runtime.LLMObs.create_dataset_from_csv(csv_path="<absolute path>", ...)
- Read the file. If JSON, validate top-level array of
-
(existing Datadog dataset → runtime pull):
--dataset-name <name>- Emit in place of any
LLMObs.pull_dataset(dataset_name="<name>", project_name="<project>"[, version=<n>])call. The fetch happens when the generated experiment runs — the skill itself does not call Datadog.create_dataset* - Pass through only if
version=<n>was set; otherwise omit it so the SDK resolves the latest.--dataset-version - Add a one-line comment above the call documenting what's being pulled, e.g. .
# Pulled from Datadog: dataset_name="qa_v3", version=latest - Skip the PII scrub and the inline-records emission — there are no local records to scrub.
- Emit
-
Neither flag given:
- Fall back to the inline 3-record sample described under 's default, so the generated file remains runnable as-is.
--dataset
- Fall back to the inline 3-record sample described under
Note on dataset IDs. The public SDK'stakes a name, not an ID — so there's noLLMObs.pull_dataset(...)flag. If a user only has a dataset ID from a Datadog UI URL (--dataset-id), the workflow is: open that URL in the UI, copy the dataset name, and pass it as/llm/datasets/<id>. The skill must not import--dataset-nameor any other underscore module to work around this.ddtrace.llmobs._experiment -
-
Pick evaluator template based on:
--evaluator-style- : 3 plain functions — one trivial boolean (
function-style, bareexact_matchOK), one richer rule-based check returningboolwithEvaluatorResult+reasoning, and one LLM-as-Judge surrogate. Ifassessmenthad structured--dataset, add a JSON-shape check (also returningexpected_output).EvaluatorResult - : 2
classsubclasses withBaseEvaluator. Always returnevaluate(self, context: EvaluatorContext) -> EvaluatorResult(never a bare value) — state-bearing evaluators have richer signal to surface.EvaluatorResult - : 1-2
remoteinstances with a comment instructing the user to create the judge in the Datadog UI first.RemoteEvaluator(eval_name=...)
In all styles: any evaluator with non-trivial logic must returnpopulating at minimumEvaluatorResult+value+reasoning(see the "Returnassessment, not bare values" section). The compare UI usesEvaluatorResultfor per-record drill-downs andreasoningto determine whether a metric trend is an improvement.assessment -
Emit the file.For— single file, one blank line between sections, banner comments like:
.pypython# ─── 3. Dataset ───────────────────────────────────────────────────────────────Useandfrom __future__ import annotationsat the top. Type-hint task and evaluator function signatures.from typing import Any, DictFor— valid Jupyter notebook JSON. Schema:.ipynbjson{ "cells": [ {"cell_type": "markdown", "metadata": {}, "source": ["## 1. Env setup\n", "..."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["..."]}, ... ], "metadata": { "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"name": "python", "version": "3.10"} }, "nbformat": 4, "nbformat_minor": 5 }One markdown cell + one code cell per section. Keep each code cell self-contained enough that re-running it in isolation makes sense. -
Best-effort syntax check via Bash. Don't fail the skill if the toolchain is missing — just report.
- :
.pypython -m py_compile <path> - :
.ipynbpython -c "import json; nb = json.load(open('<path>')); assert nb.get('cells'); print(f'cells={len(nb[\"cells\"])}')"
-
Print next-steps (see Output section).
-
解析参数。默认。根据
--format py解析--format的扩展名。--output如果未提供,则按以下顺序解析来源,取第一个匹配项,生成--project-name格式的默认值:experiment-<service-name>- →
pyproject.toml(PEP 621)或[project] name。[tool.poetry] name - →
setup.cfg。[metadata] name - →
setup.py中的第一个setup(...)参数。name="..." - →
package.json(当LLM应用位于TS/JS monorepo的Python包中时有用)。"name" - 当前工作目录的基名,转换为小写并清理(——将不符合的字符替换为
/^[a-z0-9-]+$/)。-
最终的项目名称为。如果experiment-<service-name>已经以<service-name>开头,则去除前缀(例如名为experiment-的包会生成experiment-foo,而非experiment-foo)。如果上述五个来源均无法解析为非空字符串,则回退到experiment-experiment-foo,并在后续步骤输出中发出警告,提示用户应显式设置experiment-sdk-default。--project-name将解析后的名称作为字符串字面量嵌入生成的行中——不要生成运行时PROJECT_NAME = "..."查找代码,因为用户可能在与工具解析时不同的目录中运行文件。os.getcwd() -
解析数据集来源。如果同时传递了和
--dataset,则报错——它们互斥。--dataset-name-
(本地文件 → 内置记录或CSV加载器):
--dataset <路径>- 读取文件。如果是JSON文件,验证顶级数组是否符合格式(
DatasetRecordRaw、可选的input_data、expected_output、metadata)。如果是CSV文件,解析表头并使用tags启发式自动检测列:dataset-bootstrap→ 输入列,prompt|input|query|question→ 预期输出列。expected|gold|truth|answer - 对所有字符串值进行PII清理(使用正则表达式匹配邮箱/电话/SSN/API密钥);将匹配项替换为,并在警告中列出受影响的索引。
<REDACTED:pii-type> - 对于JSON数据集,将记录内置到生成文件中(),确保用户拥有一个独立的工件。对于CSV数据集,生成
records=[...],并告知用户运行时CSV文件必须存在。LLMObs.create_dataset_from_csv(csv_path="<绝对路径>", ...)
- 读取文件。如果是JSON文件,验证顶级数组是否符合
-
(现有Datadog数据集 → 运行时拉取):
--dataset-name <名称>- 生成,替代任何
LLMObs.pull_dataset(dataset_name="<name>", project_name="<project>"[, version=<n>])调用。拉取操作在生成的实验运行时进行——工具本身不会调用Datadog。create_dataset* - 仅当设置了时才传递
--dataset-version;否则省略,让SDK解析为最新版本。version=<n> - 在调用上方添加一行注释,说明正在拉取的内容,例如。
# 从Datadog拉取:dataset_name="qa_v3", version=latest - 跳过PII清理和内置记录生成——没有本地记录需要清理。
- 生成
-
未提供任何标志:
- 回退到默认值中描述的3条记录内置示例,确保生成的文件可直接运行。
--dataset
- 回退到
关于数据集ID的说明。公共SDK的接受名称而非ID——因此没有LLMObs.pull_dataset(...)标志。如果用户只有Datadog UI URL中的数据集ID(--dataset-id),则工作流为:在UI中打开该URL,复制数据集名称,然后作为/llm/datasets/<id>传递。工具切勿导入--dataset-name或任何其他下划线开头的模块来规避此限制。ddtrace.llmobs._experiment -
-
根据选择评估器模板:
--evaluator-style- : 3个普通函数——一个简单的布尔函数(
function风格,可返回纯exact_match)、一个更丰富的基于规则的检查(返回带有bool+reasoning的assessment),以及一个LLM-as-Judge代理。如果EvaluatorResult包含结构化的--dataset,则添加JSON格式检查(同样返回expected_output)。EvaluatorResult - : 2个
class子类,带有BaseEvaluator方法。始终返回evaluate(self, context: EvaluatorContext) -> EvaluatorResult(切勿返回纯值)——带有状态的评估器需要展示更丰富的信号。EvaluatorResult - : 1-2个
remote实例,并添加注释,指导用户先在Datadog UI中创建评估器。RemoteEvaluator(eval_name=...)
所有风格通用: 任何具有非平凡逻辑的评估器都必须返回,至少填充EvaluatorResult+value+reasoning(请参阅“返回assessment,而非纯值”部分)。对比UI使用EvaluatorResult进行每条记录的详情查看,使用reasoning判断指标趋势是否代表改进。assessment -
生成文件。对于文件——单个文件,章节之间空一行,使用如下横幅注释:
.pypython# ─── 3. 数据集 ───────────────────────────────────────────────────────────────在顶部使用和from __future__ import annotations。为任务和评估器函数签名添加类型提示。from typing import Any, Dict对于文件——有效的Jupyter笔记本JSON。 schema:.ipynbjson{ "cells": [ {"cell_type": "markdown", "metadata": {}, "source": ["## 1. 环境设置\n", "..."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["..."]}, ... ], "metadata": { "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"name": "python", "version": "3.10"} }, "nbformat": 4, "nbformat_minor": 5 }每个章节对应一个markdown单元格+一个代码单元格。确保每个代码单元格足够独立,单独重新运行也有意义。 -
通过Bash进行语法检查(尽力而为)。如果工具链缺失,不要让工具失败——只需报告。
- :
.pypython -m py_compile <路径> - :
.ipynbpython -c "import json; nb = json.load(open('<path>')); assert nb.get('cells'); print(f'cells={len(nb[\"cells\"])}')"
-
打印后续步骤(见输出部分)。
What the Generated Code MUST NOT Do
生成代码绝对不能做的事情
A reviewer should be able to run these checks against the generated file and get zero matches:
grep | Why it's wrong |
|---|---|
| |
| Status state machine and dataset diff are SDK responsibilities. |
| Private import paths. Always use |
| The SDK owns them. |
| Always read from |
| The skill produces SDK-only code. Direct HTTP calls bypass the SDK's lazy creation, push-diff, and bulk-threshold handling. |
If any of those slip into the output, the skill is wrong — re-emit.
审核者应该能够对生成的文件运行以下检查,且结果为零匹配:
grep | 错误原因 |
|---|---|
| |
| 状态机和数据集差异是SDK的职责。 |
| 私有导入路径。始终使用 |
| 这些由SDK管理。 |
| 始终从 |
| 工具应仅生成SDK代码。直接HTTP调用会绕过SDK的延迟创建、推送差异和批量阈值处理。 |
如果生成的文件中出现上述任何内容,则工具出错——重新生成。
Output
输出
After writing, print:
Generated SDK experiment: <format>
Path: <path>
Lines: <count> (or Cells: <count> for .ipynb)
SDK calls used:
✓ LLMObs.enable(...) (line/cell ~<N>)
✓ LLMObs.<create_dataset|create_dataset_from_csv|pull_dataset>(...) (line/cell ~<N>)
✓ task_fn(input_data, config) (line/cell ~<N>)
✓ <N> evaluators (style: <function|class|remote>)
✓ LLMObs.experiment(...).run(jobs=<N>) (line/cell ~<N>)
✓ Provenance (in config + tags): generated_by=claude-code, skill=llm-obs-experiment-py-bootstrap
Syntax check: <pass | skipped: toolchain missing | fail with details>
Install:
pip install "ddtrace>=4.7" python-dotenv openai
Environment variables (required at runtime):
export DD_API_KEY=...
export DD_APPLICATION_KEY=...
export DD_SITE=datadoghq.com
export OPENAI_API_KEY=... # only if you keep the placeholder task
Run:
python <path> # for --format py
jupyter notebook <path> # for --format ipynb
Next steps:
1. Replace the placeholder task_fn with your actual LLM call.
2. Adjust the evaluators (or wire up RemoteEvaluator names you created in the Datadog UI).
3. Run it. The script prints experiment.url at the end.
4. Watch the experiment: https://app.datadoghq.com/llm/experiments生成文件后,打印:
生成的SDK实验:<格式>
路径:<路径>
行数:<数量> (对于.ipynb文件则显示单元格数:<数量>)
使用的SDK调用:
✓ LLMObs.enable(...) (行/单元格 ~<N>)
✓ LLMObs.<create_dataset|create_dataset_from_csv|pull_dataset>(...) (行/单元格 ~<N>)
✓ task_fn(input_data, config) (行/单元格 ~<N>)
✓ <N>个评估器(风格:<function|class|remote>)
✓ LLMObs.experiment(...).run(jobs=<N>) (行/单元格 ~<N>)
✓ 来源信息(在config和tags中):generated_by=claude-code, skill=llm-obs-experiment-py-bootstrap
语法检查:<通过 | 跳过:工具链缺失 | 失败并显示详情>
安装依赖:
pip install "ddtrace>=4.7" python-dotenv openai
运行时所需环境变量:
export DD_API_KEY=...
export DD_APPLICATION_KEY=...
export DD_SITE=datadoghq.com
export OPENAI_API_KEY=... # 仅当保留占位符任务时需要
运行:
python <路径> # 适用于--format py
jupyter notebook <路径> # 适用于--format ipynb
后续步骤:
1. 将占位符task_fn替换为实际的LLM调用。
2. 调整评估器(或连接你在Datadog UI中创建的RemoteEvaluator名称)。
3. 运行实验。脚本最后会打印experiment.url。
4. 查看实验:https://app.datadoghq.com/llm/experimentsReference Notebook Patterns (use as templates)
参考笔记本模式(用作模板)
The canonical set lives at https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks and serves as the style reference — the generated code should feel like it could have come from this set.
| Notebook | Pattern demonstrated |
|---|---|
| Dataset create/append/push lifecycle |
| Minimum viable experiment — inline records, OpenAI task, 2 boolean evaluators |
| CSV-loaded dataset, multi-value task output, confidence-based evaluators |
| Two-step LLM pipelines inside a single |
| |
When , lean toward the style. When is a CSV, lean toward . Default (no , ) is the style.
--evaluator-style remote07--dataset02--dataset--evaluator-style function01标准参考笔记本集位于https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks,是风格参考——生成的代码应看起来像是来自该集合。
| 笔记本 | 演示的模式 |
|---|---|
| 数据集创建/追加/推送生命周期 |
| 最简可行实验——内置记录、OpenAI任务、2个布尔评估器 |
| CSV加载的数据集、多值任务输出、基于置信度的评估器 |
| 单个 |
| 带有自定义 |
当时,参考风格。当是CSV文件时,参考风格。默认情况(无,)参考风格。
--evaluator-style remote07--dataset02--dataset--evaluator-style function01Datadog Documentation
Datadog文档
These are the canonical reference pages on https://docs.datadoghq.com/. Use them to ground answers about LLM Observability features and to look up details that aren't covered in this skill.
| Topic | URL | Use when |
|---|---|---|
| LLM Observability overview | https://docs.datadoghq.com/llm_observability/ | Establishing what the product covers, terminology |
| Setup | https://docs.datadoghq.com/llm_observability/setup/ | API/app key creation, project + ml_app setup, region/site selection |
| Instrumentation overview | https://docs.datadoghq.com/llm_observability/instrumentation/ | Auto-instrumentation, manual SDK usage, span model |
| Python SDK reference | https://docs.datadoghq.com/llm_observability/instrumentation/sdk/ | Public symbol list, decorator semantics, span kinds, annotate/enable signatures |
| Experiments | https://docs.datadoghq.com/llm_observability/experiments/ | |
| Evaluations | https://docs.datadoghq.com/llm_observability/evaluations/ | Evaluator concepts, managed vs custom evaluators |
| Custom LLM-as-a-judge evaluations | https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/ | |
| Managed evaluations | https://docs.datadoghq.com/llm_observability/evaluations/managed_evaluations/ | Pre-built judges (faithfulness, toxicity, etc.) |
| Monitoring | https://docs.datadoghq.com/llm_observability/monitoring/ | Alerts, dashboards, span-level monitors |
| Terms / glossary | https://docs.datadoghq.com/llm_observability/terms/ | Span kinds, sessions, traces, ml_app |
| Evaluation developer guide | https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide/ | Writing offline evaluators, validation strategy |
| Claude Code skills guide | https://docs.datadoghq.com/llm_observability/guide/claude_code_skills/ | How this skill fits alongside the rest of the |
| MCP server | https://docs.datadoghq.com/llm_observability/mcp_server/ | Connecting MCP-compatible clients to LLM Obs data |
| Reference notebooks (GitHub) | https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks | Style-of-life examples for the generated |
以下是https://docs.datadoghq.com/上的标准参考页面。使用这些页面来解答LLM可观测性功能相关问题,并查找本工具未涵盖的细节。
Researching features the skill does not cover
研究本工具未涵盖的功能
If the user asks about an LLM Observability feature the skill's body doesn't address (e.g., specific span kinds, dataset versioning semantics, an evaluator type not covered above), fetch the relevant page from rather than guessing:
docs.datadoghq.com- Pick the most specific URL from the table above. Most LLM Obs questions resolve under .
/llm_observability/{experiments,evaluations,instrumentation,monitoring}/ - Use on that URL with a focused query (e.g.,
WebFetch). Prefer"How does Dataset.push() handle the 5 MB threshold?"over generic web search — the canonical page is almost always underWebFetch.docs.datadoghq.com/llm_observability/ - Fall back to with
WebSearchif you don't know which subpage owns the topic.site:docs.datadoghq.com/llm_observability - Cite the page in the answer with its URL so the user can verify and bookmark.
Never invent symbols or behaviors not present in this skill body or the docs above. If the docs don't cover the question either, say so explicitly and suggest filing an issue on rather than fabricating a workaround.
DataDog/llm-observability如果用户询问本工具未涉及的LLM可观测性功能(例如特定Span类型、数据集版本控制语义、上述未涵盖的评估器类型),请从获取相关页面,而非猜测:
docs.datadoghq.com- 选择最具体的URL从上述表格中。大多数LLM Obs问题可在下找到答案。
/llm_observability/{experiments,evaluations,instrumentation,monitoring}/ - **使用**对该URL进行聚焦查询(例如
WebFetch)。优先使用"Dataset.push()如何处理5MB阈值?"而非通用网络搜索——标准页面几乎都在WebFetch下。docs.datadoghq.com/llm_observability/ - **如果不知道哪个子页面涵盖该主题,回退到**并使用
WebSearch。site:docs.datadoghq.com/llm_observability - 在答案中引用页面URL,以便用户验证和收藏。
切勿发明本工具或上述文档中未提及的符号或行为。如果文档也未涵盖该问题,请明确说明,并建议在上提交issue,而非编造解决方案。
DataDog/llm-observabilityOperating Rules
操作规则
- SDK only. No , no manual JSON:API envelope construction, no manual ID generation. If a feature seems to require those, you're solving the wrong problem — the SDK already covers it.
requests.post - Public imports only. . Never
from ddtrace.llmobs import ...,_experiment, or any underscore-prefixed module._llmobs - Env vars, not literals. Credentials always read from . The generated
os.environ(or the env-setup cell) mustmain()they're set with a clear message.assert - Always pass to
site=. Read it fromLLMObs.enable(). Omittingos.getenv("DD_SITE", "datadoghq.com")silently defaults to US1 prod, which breaks every non-prod org (e.g. stagingsite=,datad0g.com). The canonical signature already includes it — never drop it.datadoghq.eu - Per-record are
tagsstrings. When inlining records (whether from"key:value"JSON, CSV, or the default sample), each entry in a record's--datasetlist must be a"tags"string like"key:value","env:prod","source:traces". Bare strings ("category:geography","smoke") trigger"baseline"atValueError: Tag '<name>' is malformed.time. If the source data has bare-string tags, namespace them — e.g. wrapDataset.append()as"smoke"rather than dropping it."tag:smoke" - markers on the placeholder task and on at least one evaluator so reviewers can't ship the placeholder by accident.
# TODO(user) - Match notebook conventions. Plain function evaluators by default; class-based only when the user opts in. Print at the end of every generated file.
experiment.url - Tag every experiment with provenance — in both and
config. Everytagscall must carryLLMObs.experiment(...)and"generated_by": "claude-code"as keys in both the"skill": "llm-obs-experiment-py-bootstrap"dict (so they render in the experiment's Configuration view, which is where users actually look) and theconfig={...}dict (which the SDK serializes intotags={...}for future tag-filter consumers). Themetadata.tagspath alone is not enough: the current LLM Experiments UI does not surfacetags=as filterable chips, so users won't see the provenance unless it's also inmetadata.tags. If a user later edits the generated file to add their own keys, they extend both dicts — never replace the provenance keys silently.config - PII scrub at the door. If is given, scrub before inlining into the generated file. Never embed a record that contains an unmasked email/phone/SSN/API-key pattern.
--dataset - Don't generate or
requirements.txt. Print thepyproject.tomlcommand in the next-steps message instead — most users already have a venv.pip install - No silent fallbacks. If is unsupported, error out with the valid choices.
--format - Python only. If a user passes (or any non-Python language flag), error out — this skill produces Python
--language typescriptSDK code only.ddtrace.llmobs - Research, don't invent. If the user asks about an LLM Observability feature, span kind, evaluator type, or SDK symbol that is not documented in this skill body, the relevant
WebFetchpage (see the Datadog Documentation table above for the canonical URLs) before answering. Cite the page URL in the response. If the docs don't cover the topic, say so explicitly — never fabricate symbols, flags, or behaviors.docs.datadoghq.com/llm_observability/*
- 仅使用SDK。禁止使用、手动构建JSON:API信封、手动生成ID。如果某个功能似乎需要这些操作,说明你解决问题的方式有误——SDK已经涵盖了这些功能。
requests.post - 仅使用公共导入。。切勿使用
from ddtrace.llmobs import ...、_experiment或任何以下划线开头的模块。_llmobs - 使用环境变量,而非字面量。凭据始终从读取。生成的
os.environ(或环境设置单元格)必须通过main()确保它们已设置,并给出清晰的提示信息。assert - 始终向传递
LLMObs.enable()参数。从site=读取。省略os.getenv("DD_SITE", "datadoghq.com")会默认使用US1生产环境,这会导致所有非生产组织(例如 stagingsite=、datad0g.com)的功能失效。标准签名已包含此参数——切勿省略。datadoghq.eu - 每条记录的是
tags格式的字符串。当内置记录时(无论是来自"key:value"JSON、CSV还是默认示例),记录的--dataset列表中的每个条目必须是"tags"格式的字符串,例如"key:value"、"env:prod"、"source:traces"。纯字符串("category:geography"、"smoke")会在"baseline"时触发Dataset.append()。如果源数据包含纯字符串标签,请为其添加命名空间——例如将ValueError: Tag '<name>' is malformed.包装为"smoke",而非删除它。"tag:smoke" - 添加标记在占位符任务和至少一个评估器上,确保审核者不会意外发布占位符代码。
# TODO(user) - 匹配笔记本惯例。默认使用普通函数评估器;仅当用户选择时才使用基于类的评估器。在每个生成文件的末尾打印。
experiment.url - 为每个实验添加来源标签——同时在和
config中。每个tags调用必须在LLMObs.experiment(...)字典(以便在实验的配置视图中显示,这是用户实际查看的地方)和config={...}字典(SDK会将其序列化为tags={...},供未来的标签筛选消费者使用)中包含metadata.tags和"generated_by": "claude-code"键。仅使用"skill": "llm-obs-experiment-py-bootstrap"路径是不够的:当前LLM实验UI不会将tags=显示为可筛选标签,因此用户只有在config中看到来源信息。如果用户稍后编辑生成的文件添加自己的键,他们应扩展两个字典——切勿静默替换来源键。metadata.tags - 在入口处进行PII清理。如果提供了,在将记录内置到生成文件前进行清理。切勿嵌入包含未掩码邮箱/电话/SSN/API密钥模式的记录。
--dataset - 不要生成或
requirements.txt。在后续步骤消息中打印pyproject.toml命令即可——大多数用户已经有虚拟环境。pip install - 无静默回退。如果不支持,则报错并列出有效选项。
--format - 仅支持Python。如果用户传递(或任何非Python语言标志),则报错——本工具仅生成Python
--language typescriptSDK代码。ddtrace.llmobs - 研究,而非发明。如果用户询问本工具未记录的LLM可观测性功能、Span类型、评估器类型或SDK符号,请先相关的
WebFetch页面(请参阅上述Datadog文档表格中的标准URL),然后再回答。在响应中引用页面URL。如果文档未涵盖该主题,请明确说明——切勿编造符号、标志或行为。docs.datadoghq.com/llm_observability/*