llm-obs-experiment-py-bootstrap

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM Obs Experiment (Python) Bootstrap — Generate a Python Experiment Using

ddtrace.llmobs

LLM Obs实验（Python）引导程序——使用

ddtrace.llmobs

生成Python实验

Produce a single self-contained Python experiment that uses the official ddtrace.llmobs
SDK. Output is either a

.py

script or an

.ipynb

notebook. The generated code mirrors the patterns shown in DataDog's reference notebooks at https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks.

The SDK handles lazy project/experiment creation, dataset push diffing, the 5 MB / 1000-record bulk threshold, eval metric streaming, and the status state machine on the user's behalf. This skill must therefore never re-implement those primitives — it just imports

LLMObs

and trusts it.

生成一个独立的Python实验，该实验使用官方**

ddtrace.llmobs

SDK**。输出格式可为

.py

脚本或

.ipynb

笔记本。生成的代码与DataDog参考笔记本中的模式一致，参考地址为https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks。

SDK会自动处理延迟项目/实验创建、数据集推送差异、5MB/1000条记录的批量阈值、评估指标流式传输以及状态机等操作。因此，本工具绝不能重新实现这些基础功能——只需导入

LLMObs

并依赖其处理即可。

Usage

使用方法

/llm-obs-experiment-py-bootstrap [--format py|ipynb] [--dataset <path>] [--dataset-name <name>] [--dataset-version <int>] [--project-name <name>] [--evaluator-style function|class|remote] [--jobs <n>] [--output <path>]

Arguments: $ARGUMENTS

/llm-obs-experiment-py-bootstrap [--format py|ipynb] [--dataset <路径>] [--dataset-name <名称>] [--dataset-version <整数>] [--project-name <名称>] [--evaluator-style function|class|remote] [--jobs <数量>] [--output <路径>]

参数：$ARGUMENTS

Inputs

输入项

All inputs are optional. If the user omits a flag, fall back to the default — never block on prompting for

--jobs

--format

, etc.

Input	Default	Description
`--format`	`py`	`py` (single `.py` file) or `ipynb` (Jupyter notebook with one cell per section).
`--dataset`	none — emit a sample 3-record `records=[...]` inline so the file is runnable as-is	Path to a local `DatasetRecordRaw[]` JSON or CSV. JSON → `create_dataset(records=...)` ; CSV → `create_dataset_from_csv(...)` . Mutually exclusive with `--dataset-name` .
`--dataset-name`	none	Name of an existing Datadog dataset to fetch at runtime via `LLMObs.pull_dataset(...)` . Use this when the dataset already lives in Datadog (e.g. created in the UI or by a prior run) — no local file required. Mutually exclusive with `--dataset` .
`--dataset-version`	none (latest)	Pin to a specific dataset version when using `--dataset-name` . Passed through as `pull_dataset(version=N)` . Ignored if `--dataset-name` is not set.
`--project-name`	`experiment-<service-name>` — derived from the codebase (see Workflow step 1); falls back to `experiment-sdk-default` only if nothing resolves	Datadog project name (visible in the LLM Experiments UI). The SDK's `ml_app` tag falls back to this automatically — no separate flag needed.
`--evaluator-style`	`function`	`function` (plain functions — notebook default), `class` ( `BaseEvaluator` subclasses), or `remote` ( `RemoteEvaluator` instances).
`--jobs`	`10`	Passed to `experiment.run(jobs=N)` .
`--output`	`./experiments/experiment.<ext>`	File extension derives from `--format` : `.py` or `.ipynb` .

所有输入项均为可选。如果用户省略某个标志，则使用默认值——切勿因缺少

--jobs

、

--format

等标志而中断流程。

输入项	默认值	描述
`--format`	`py`	`py` （单个 `.py` 文件）或 `ipynb` （每个部分对应一个单元格的Jupyter笔记本）。
`--dataset`	无——内置包含3条记录的示例 `records=[...]` ，确保文件可直接运行	本地 `DatasetRecordRaw[]` 格式的JSON或CSV文件路径。JSON文件使用 `create_dataset(records=...)` ；CSV文件使用 `create_dataset_from_csv(...)` 。与 `--dataset-name` 互斥。
`--dataset-name`	无	现有Datadog数据集的名称，运行时通过 `LLMObs.pull_dataset(...)` 获取。当数据集已存储在Datadog中（例如通过UI或之前的运行创建）时使用此选项——无需本地文件。与 `--dataset` 互斥。
`--dataset-version`	无（最新版本）	使用 `--dataset-name` 时，固定到特定的数据集版本。传递给 `pull_dataset(version=N)` 。如果未设置 `--dataset-name` ，则忽略此参数。
`--project-name`	`experiment-<service-name>` ——从代码库派生（见工作流步骤1）；若无法解析则回退到 `experiment-sdk-default`	Datadog项目名称（在LLM实验UI中可见）。SDK的 `ml_app` 标签会自动回退为此值——无需单独设置标志。
`--evaluator-style`	`function`	`function` （普通函数——笔记本默认风格）、 `class` （ `BaseEvaluator` 子类）或 `remote` （ `RemoteEvaluator` 实例）。
`--jobs`	`10`	传递给 `experiment.run(jobs=N)` 。
`--output`	`./experiments/experiment.<扩展名>`	文件扩展名由 `--format` 决定： `.py` 或 `.ipynb` 。

SDK Surface (Cited)

SDK接口（引用）

These are the public symbols the generated code uses. All come from

ddtrace.llmobs

(the public package — never from

ddtrace.llmobs._experiment

or other underscore-prefixed modules).

Import	Source	What it gives you
`LLMObs`	`ddtrace/llmobs/__init__.py` re-exports `_llmobs.py`	`.enable()` , `.create_dataset()` , `.create_dataset_from_csv()` , `.pull_dataset(dataset_name, project_name, version)` , `.experiment()` , `.async_experiment()`
`RemoteEvaluator` , `EvaluatorContext`	`ddtrace/llmobs/__init__.py`	LLM-as-Judge that runs server-side; preferred over inline `LLMJudge`
`BaseEvaluator` , `EvaluatorResult`	`ddtrace/llmobs/__init__.py`	Class-based evaluator path (advanced)
`LLMJudge`	`ddtrace/llmobs/_evaluators/llm_judge.py` (re-exported)	Inline LLM-as-Judge with prompt template support

Canonical call signatures (must match the generated code exactly):

python

LLMObs.enable(
    api_key=os.getenv("DD_API_KEY"),
    app_key=os.getenv("DD_APPLICATION_KEY"),
    site=os.getenv("DD_SITE", "datadoghq.com"),  # required for non-prod sites (e.g. datad0g.com, datadoghq.eu)
    project_name="<project>",
    agentless_enabled=True,  # required when not running behind the dd-agent
)

以下是生成代码使用的公共符号，全部来自

ddtrace.llmobs

（公共包——切勿使用

ddtrace.llmobs._experiment

或其他以下划线开头的模块）。

导入项	来源	提供的功能
`LLMObs`	`ddtrace/llmobs/__init__.py` 重导出 `_llmobs.py`	`.enable()` 、 `.create_dataset()` 、 `.create_dataset_from_csv()` 、 `.pull_dataset(dataset_name, project_name, version)` 、 `.experiment()` 、 `.async_experiment()`
`RemoteEvaluator` , `EvaluatorContext`	`ddtrace/llmobs/__init__.py`	在服务器端运行的LLM-as-Judge；优先于内联 `LLMJudge`
`BaseEvaluator` , `EvaluatorResult`	`ddtrace/llmobs/__init__.py`	基于类的评估器路径（高级用法）
`LLMJudge`	`ddtrace/llmobs/_evaluators/llm_judge.py` （重导出）	支持提示模板的内联LLM-as-Judge

标准调用签名（生成代码必须完全匹配）：

python

LLMObs.enable(
    api_key=os.getenv("DD_API_KEY"),
    app_key=os.getenv("DD_APPLICATION_KEY"),
    site=os.getenv("DD_SITE", "datadoghq.com"),  # 非生产站点必填（如datad0g.com、datadoghq.eu）
    project_name="<project>",
    agentless_enabled=True,  # 不运行在dd-agent后方时必填
)

Note: ml_app is not a separate input. The SDK derives it from project_name

注意：ml_app不是单独的输入项。SDK会在未提供时从project_name派生。

when not supplied. If a user really wants to override it later, they can

如果用户确实想要稍后覆盖它，可以自行在enable()中添加

ml_app="..."

。

add

ml_app="..."

to enable() themselves.

—

dataset = LLMObs.create_dataset( dataset_name="<name>", description="<optional>", records=[ # Per-record

tags

MUST be a list of "key:value" strings (e.g. "env:smoke"), # never bare strings — the SDK rejects malformed tags with a ValueError on append. {"input_data": {"<k>": "<v>"}, "expected_output": "<v>", "metadata": {}, "tags": ["env:<env>"]}, # ... ], )

dataset = LLMObs.create_dataset( dataset_name="<name>", description="<可选>", records=[ # 每条记录的

tags

必须是"key:value"格式的字符串列表（例如"env:smoke"）， # 绝不能是纯字符串——SDK会在追加时因格式错误的标签抛出ValueError。 {"input_data": {"<k>": "<v>"}, "expected_output": "<v>", "metadata": {}, "tags": ["env:<env>"]}, # ... ], )

OR

或

dataset = LLMObs.create_dataset_from_csv( csv_path="<path>", dataset_name="<name>", input_data_columns=["<col1>", "<col2>"], expected_output_columns=["<col>"], )

OR pull an existing Datadog dataset by name (no local file needed)

或按名称拉取现有Datadog数据集（无需本地文件）

dataset = LLMObs.pull_dataset( dataset_name="<name>", project_name="<project>", # optional — defaults to the project on enable() version=2, # optional — pin a version; omit for the latest )

def task_fn(input_data: dict, config: dict): # TODO(user): replace with your actual LLM call ...

dataset = LLMObs.pull_dataset( dataset_name="<name>", project_name="<project>", # 可选——默认使用enable()中设置的项目 version=2, # 可选——固定版本；省略则使用最新版本 )

def task_fn(input_data: dict, config: dict): # TODO(user): 替换为实际的LLM调用 ...

Plain function evaluator (default style)

普通函数评估器（默认风格）

def exact_match(input_data, output_data, expected_output) -> bool: return output_data == expected_output

experiment = LLMObs.experiment( name="<experiment_name>", dataset=dataset, task=task_fn, evaluators=[exact_match], config={ "model": "gpt-4o-mini", "temperature": 0.0, # Provenance also lives in

config

so it renders in the # experiment's Configuration view alongside model/temperature. #

tags=

below only reaches metadata.tags, which the current UI # does not surface as chips — config is what users actually see. "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, description="<optional>", tags={ # Same provenance, sent to experiment metadata.tags for any future # tag-filter UI / API consumers. Always emitted alongside the # config copy — never one without the other. "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, )

experiment.run(jobs=10) print(experiment.url)

---

def exact_match(input_data, output_data, expected_output) -> bool: return output_data == expected_output

experiment = LLMObs.experiment( name="<experiment_name>", dataset=dataset, task=task_fn, evaluators=[exact_match], config={ "model": "gpt-4o-mini", "temperature": 0.0, # 来源信息也存储在

config

中，以便在实验的配置视图中与model/temperature一起显示。 # 下面的

tags=

仅会传递到metadata.tags，当前UI不会将其显示为筛选标签——用户实际看到的是config中的内容。 "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, description="<可选>", tags={ # 相同的来源信息，发送到实验metadata.tags，供未来的标签筛选UI/API使用。始终与config中的副本一起输出——切勿只输出其中一个。 "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, )

experiment.run(jobs=10) print(experiment.url)

---

Evaluator Styles

评估器风格

Generated code uses one of three evaluator surfaces, picked by

--evaluator-style

. Whichever style is chosen, prefer returning
EvaluatorResult
over a bare
bool
/
float
whenever the evaluator has any signal beyond the raw value — see "Return EvaluatorResult, not bare values" below.

生成的代码会使用以下三种评估器接口之一，由

--evaluator-style

选择。无论选择哪种风格，只要评估器有原始值之外的信号，就优先返回
EvaluatorResult
而非纯
bool
/
float
——请参阅下文“返回EvaluatorResult，而非纯值”部分。

Return

EvaluatorResult

, not bare values

EvaluatorResult

，而非纯值

Plain functions are allowed to return

bool

float

dict

, and

BaseEvaluator.evaluate()

is allowed to return raw

JSONType

. The SDK accepts both — but

EvaluatorResult

carries fields the Datadog UI surfaces in ways the raw value cannot:

Field	Type	Used by Datadog UI for
`value`	`bool` / `float` / `str` / `dict` (JSONType)	The score itself — shown on the experiment metric. Required.
`reasoning`	`str`	Per-record explanation shown in the compare UI; lets reviewers see why an evaluator passed/failed without re-running the LLM.
`assessment`	`str` (e.g. `"pass"` / `"fail"` / `"partial"` )	Determines whether a metric trend going up vs. down is an improvement; the UI uses this to color baseline-vs-candidate comparisons.
`metadata`	`dict[str, JSONType]`	Free-form per-record context (e.g. `{"confidence": 0.95}` ); shown in record drill-down.
`tags`	`dict[str, JSONType]`	Used to slice experiment results in the UI (e.g. `{"category": "accuracy"}` ).

The generated code should default to

EvaluatorResult

for any evaluator richer than a one-line equality check. The trivial

exact_match

and

length_under_500

shown below are the only cases where a bare

bool

is acceptable.

普通函数可以返回

bool

float

dict

，

BaseEvaluator.evaluate()

可以返回原始

JSONType

。SDK都接受这些返回值——但

EvaluatorResult

包含Datadog UI可以以特殊方式展示的字段：

字段	类型	Datadog UI用途
`value`	`bool` / `float` / `str` / `dict` （JSONType）	分数本身——显示在实验指标中。必填。
`reasoning`	`str`	在对比UI中显示每条记录的解释；让审核者无需重新运行LLM即可了解评估器通过/失败的原因。
`assessment`	`str` （例如 `"pass"` / `"fail"` / `"partial"` ）	决定指标趋势上升或下降是否代表改进；UI使用此字段为基线与候选者的对比结果着色。
`metadata`	`dict[str, JSONType]`	每条记录的自由格式上下文（例如 `{"confidence": 0.95}` ）；在记录详情中显示。
`tags`	`dict[str, JSONType]`	用于在UI中筛选实验结果（例如 `{"category": "accuracy"}` ）。

对于任何比单行相等检查更复杂的评估器，生成的代码应默认使用

EvaluatorResult

。下面展示的简单

exact_match

和

length_under_500

是唯一可以接受返回纯

bool

的情况。

function

(default — what the notebooks use)

function

（默认——笔记本使用的风格）

Plain Python functions with the signature

(input_data, output_data, expected_output)

. Always emit at least three: a trivial boolean (returns

bool

), a richer rule-based one (returns

EvaluatorResult

), and an LLM-as-Judge surrogate (a

RemoteEvaluator

reference or a placeholder).

python

from ddtrace.llmobs import EvaluatorResult

具有签名

(input_data, output_data, expected_output)

的普通Python函数。始终至少生成三个：一个简单的布尔函数（返回

bool

）、一个更丰富的基于规则的函数（返回带有

reasoning

assessment

的

EvaluatorResult

），以及一个LLM-as-Judge代理（

RemoteEvaluator

引用或占位符）。

python

from ddtrace.llmobs import EvaluatorResult

Trivial check — bare bool is fine here, the result has no extra signal.

简单检查——返回纯bool即可，结果没有额外信号。

def exact_match(input_data, output_data, expected_output) -> bool: return output_data == expected_output

Richer check — use EvaluatorResult so reasoning/assessment surface in the UI.

更复杂的检查——使用EvaluatorResult以便在UI中展示reasoning/assessment。

def response_well_formed(input_data, output_data, expected_output) -> EvaluatorResult: if not isinstance(output_data, str): return EvaluatorResult( value=False, reasoning=f"output_data was {type(output_data).name}, expected str", assessment="fail", ) if len(output_data) > 500: return EvaluatorResult( value=False, reasoning=f"output exceeded 500 chars (was {len(output_data)})", assessment="fail", metadata={"length": len(output_data)}, ) return EvaluatorResult(value=True, assessment="pass")

undefined

def response_well_formed(input_data, output_data, expected_output) -> EvaluatorResult: if not isinstance(output_data, str): return EvaluatorResult( value=False, reasoning=f"output_data类型为{type(output_data).name}，预期为str", assessment="fail", ) if len(output_data) > 500: return EvaluatorResult( value=False, reasoning=f"输出长度超过500字符（实际为{len(output_data)}）", assessment="fail", metadata={"length": len(output_data)}, ) return EvaluatorResult(value=True, assessment="pass")

undefined

class

(advanced — for evaluators that need state or async I/O)

class

（高级——需要状态或异步I/O的评估器）

Always return

EvaluatorResult

from

evaluate()

— never a bare value. State-bearing evaluators usually have richer reasoning to surface anyway.

python

from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult

class FaithfulnessJudge(BaseEvaluator):
    def __init__(self):
        super().__init__(name="faithfulness")
        # TODO(user): initialize any client or state here

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # context exposes: input_data, output_data, expected_output, metadata
        # TODO(user): replace placeholder logic with your faithfulness check
        passed = context.output_data is not None
        return EvaluatorResult(
            value=1.0 if passed else 0.0,
            reasoning="placeholder — replace with your faithfulness rubric",
            assessment="pass" if passed else "fail",
            metadata={"evaluator_version": "v1"},
        )

始终从

evaluate()

EvaluatorResult

——切勿返回纯值。带有状态的评估器通常需要展示更丰富的推理信息。

python

from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult

class FaithfulnessJudge(BaseEvaluator):
    def __init__(self):
        super().__init__(name="faithfulness")
        # TODO(user): 在此处初始化任何客户端或状态

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # context包含：input_data, output_data, expected_output, metadata
        # TODO(user): 替换占位符逻辑为实际的忠实度检查
        passed = context.output_data is not None
        return EvaluatorResult(
            value=1.0 if passed else 0.0,
            reasoning="占位符——替换为你的忠实度评估规则",
            assessment="pass" if passed else "fail",
            metadata={"evaluator_version": "v1"},
        )

remote

(LLM-as-Judge running server-side)

remote

（在服务器端运行的LLM-as-Judge）

python

from ddtrace.llmobs import RemoteEvaluator

python

from ddtrace.llmobs import RemoteEvaluator

Create the judge in Datadog UI first: LLM Observability → Evaluations → New Evaluator

先在Datadog UI中创建评估器：LLM Observability → Evaluations → New Evaluator

quality_judge = RemoteEvaluator(eval_name="<name-from-datadog-ui>")

Optional: customize the payload the judge receives

可选：自定义评估器接收的负载

custom_judge = RemoteEvaluator( eval_name="<name>", transform_fn=lambda ctx: { "question": ctx.input_data.get("question"), "answer": ctx.output_data, "reference": ctx.expected_output, }, )

---

custom_judge = RemoteEvaluator( eval_name="<name>", transform_fn=lambda ctx: { "question": ctx.input_data.get("question"), "answer": ctx.output_data, "reference": ctx.expected_output, }, )

---

Generated File Structure

生成文件结构

The same section sequence in both formats. In

.py

these become comment banners; in

.ipynb

each becomes one markdown cell + one code cell.

1. Env setup           — load_dotenv(), os.getenv reads, hard assert keys present
2. LLMObs.enable()     — explicit api_key/app_key/project_name/agentless_enabled
3. Dataset             — inline records OR create_dataset_from_csv
4. Task function       — placeholder OpenAI call with # TODO(user) marker
5. Evaluators          — 2-3 in the requested style
6. Experiment          — LLMObs.experiment(config={..., "generated_by": "claude-code", ...}, tags={"generated_by": "claude-code", ...})
7. Run                 — experiment.run(jobs=N); print(experiment.url)
8. Results inspection  — experiment.as_dataframe() if pandas, else print

两种格式使用相同的章节顺序。在

.py

文件中，章节以注释横幅形式呈现；在

.ipynb

文件中，每个章节对应一个markdown单元格+一个代码单元格。

1. 环境设置           — load_dotenv()、os.getenv读取、硬断言密钥存在
2. LLMObs.enable()     — 显式设置api_key/app_key/project_name/agentless_enabled
3. 数据集             — 内置记录或CSV加载器
4. 任务函数           — 占位符OpenAI调用，带有# TODO(user)标记
5. 评估器          — 2-3个符合指定风格的评估器
6. 实验          — LLMObs.experiment(config={..., "generated_by": "claude-code", ...}, tags={"generated_by": "claude-code", ...})
7. 运行                 — experiment.run(jobs=N); print(experiment.url)
8. 结果检查  — 如果安装了pandas则使用experiment.as_dataframe()，否则使用print

Workflow

工作流

Parse arguments. Default
```
--format py
```
. Resolve
```
--output
```
extension from
```
--format
```
.
If
```
--project-name
```
is not provided, resolve a default of the form
```
experiment-<service-name>
```
by walking these sources in order, taking the first match:
1. ```
pyproject.toml
```
  →
```
[project] name
```
  (PEP 621) or
```
[tool.poetry] name
```
  .
2. ```
setup.cfg
```
  →
```
[metadata] name
```
  .
3. ```
setup.py
```
  → first
```
name="..."
```
  argument to
```
setup(...)
```
  .
4. ```
package.json
```
  →
```
"name"
```
  (useful when the LLM app lives in a TS/JS monorepo Python package).
5. The basename of the current working directory, lowercased and slugified (
```
/^[a-z0-9-]+$/
```
  — replace non-matching chars with
```
-
```
  ).
The final project name is
```
experiment-<service-name>
```
. Strip a leading
```
experiment-
```
from
```
<service-name>
```
if it already starts with one (so a package literally named
```
experiment-foo
```
yields
```
experiment-foo
```
, not
```
experiment-experiment-foo
```
). If none of the five sources resolve to a non-empty string, fall back to
```
experiment-sdk-default
```
and emit a warning in the next-steps output that the user should set
```
--project-name
```
explicitly.
Embed the resolved name as a string literal in the generated
```
PROJECT_NAME = "..."
```
line — don't emit runtime
```
os.getcwd()
```
lookups, since the user may run the file from a different directory than where the skill resolved it.
Resolve the dataset source. Error out if both
```
--dataset
```
and
```
--dataset-name
```
are passed — they're mutually exclusive.
- --dataset <path>
  (local file → inline records or CSV loader):
  - Read the file. If JSON, validate top-level array of
```
DatasetRecordRaw
```
    shape (
```
input_data
```
    , optional
```
expected_output
```
    ,
```
metadata
```
    ,
```
tags
```
    ). If CSV, parse header and auto-detect columns using the
```
dataset-bootstrap
```
    heuristics:
```
prompt|input|query|question
```
    → input,
```
expected|gold|truth|answer
```
    → expected.
  - Run a PII scrub (email/phone/SSN/API-key regexes) on all string values; replace matches with
```
<REDACTED:pii-type>
```
    and surface a warning listing affected indices.
  - For JSON datasets, embed the records inline in the generated file (
```
records=[...]
```
    ) so the user has a single self-contained artifact. For CSV datasets, emit
```
LLMObs.create_dataset_from_csv(csv_path="<absolute path>", ...)
```
    and tell the user the CSV needs to be present at runtime.
- --dataset-name <name>
  (existing Datadog dataset → runtime pull):
  - Emit
```
LLMObs.pull_dataset(dataset_name="<name>", project_name="<project>"[, version=<n>])
```
    in place of any
```
create_dataset*
```
    call. The fetch happens when the generated experiment runs — the skill itself does not call Datadog.
  - Pass
```
version=<n>
```
    through only if
```
--dataset-version
```
    was set; otherwise omit it so the SDK resolves the latest.
  - Add a one-line comment above the call documenting what's being pulled, e.g.
```
# Pulled from Datadog: dataset_name="qa_v3", version=latest
```
    .
  - Skip the PII scrub and the inline-records emission — there are no local records to scrub.
- Neither flag given:
  - Fall back to the inline 3-record sample described under
```
--dataset
```
    's default, so the generated file remains runnable as-is.
Note on dataset IDs. The public SDK's
```
LLMObs.pull_dataset(...)
```
takes a name, not an ID — so there's no
```
--dataset-id
```
flag. If a user only has a dataset ID from a Datadog UI URL (
```
/llm/datasets/<id>
```
), the workflow is: open that URL in the UI, copy the dataset name, and pass it as
```
--dataset-name
```
. The skill must not import
```
ddtrace.llmobs._experiment
```
or any other underscore module to work around this.
Pick evaluator template based on
```
--evaluator-style
```
:
- ```
function
```
  : 3 plain functions — one trivial boolean (
```
exact_match
```
  -style, bare
```
bool
```
  OK), one richer rule-based check returning
```
EvaluatorResult
```
  with
```
reasoning
```
  +
```
assessment
```
  , and one LLM-as-Judge surrogate. If
```
--dataset
```
  had structured
```
expected_output
```
  , add a JSON-shape check (also returning
```
EvaluatorResult
```
  ).
- ```
class
```
  : 2
```
BaseEvaluator
```
  subclasses with
```
evaluate(self, context: EvaluatorContext) -> EvaluatorResult
```
  . Always return
```
EvaluatorResult
```
  (never a bare value) — state-bearing evaluators have richer signal to surface.
- ```
remote
```
  : 1-2
```
RemoteEvaluator(eval_name=...)
```
  instances with a comment instructing the user to create the judge in the Datadog UI first.
In all styles: any evaluator with non-trivial logic must return
```
EvaluatorResult
```
populating at minimum
```
value
```
+
```
reasoning
```
+
```
assessment
```
(see the "Return
```
EvaluatorResult
```
, not bare values" section). The compare UI uses
```
reasoning
```
for per-record drill-downs and
```
assessment
```
to determine whether a metric trend is an improvement.

Emit the file.

For
.py
— single file, one blank line between sections, banner comments like:

python

# ─── 3. Dataset ───────────────────────────────────────────────────────────────

Use

from __future__ import annotations

and

from typing import Any, Dict

at the top. Type-hint task and evaluator function signatures.

For
.ipynb
— valid Jupyter notebook JSON. Schema:

json

{
  "cells": [
    {"cell_type": "markdown", "metadata": {}, "source": ["## 1. Env setup\n", "..."]},
    {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["..."]},
    ...
  ],
  "metadata": {
    "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
    "language_info": {"name": "python", "version": "3.10"}
  },
  "nbformat": 4,
  "nbformat_minor": 5
}

One markdown cell + one code cell per section. Keep each code cell self-contained enough that re-running it in isolation makes sense.

Best-effort syntax check via Bash. Don't fail the skill if the toolchain is missing — just report.

```
.py
```
:
```
python -m py_compile <path>
```

.ipynb

python -c "import json; nb = json.load(open('<path>')); assert nb.get('cells'); print(f'cells={len(nb[\"cells\"])}')"

Print next-steps (see Output section).

解析参数。默认
```
--format py
```
。根据
```
--format
```
解析
```
--output
```
的扩展名。
如果未提供
```
--project-name
```
，则按以下顺序解析来源，取第一个匹配项，生成
```
experiment-<service-name>
```
格式的默认值：
1. ```
pyproject.toml
```
  →
```
[project] name
```
  （PEP 621）或
```
[tool.poetry] name
```
  。
2. ```
setup.cfg
```
  →
```
[metadata] name
```
  。
3. ```
setup.py
```
  →
```
setup(...)
```
  中的第一个
```
name="..."
```
  参数。
4. ```
package.json
```
  →
```
"name"
```
  （当LLM应用位于TS/JS monorepo的Python包中时有用）。
5. 当前工作目录的基名，转换为小写并清理（
```
/^[a-z0-9-]+$/
```
  ——将不符合的字符替换为
```
-
```
  ）。
最终的项目名称为
```
experiment-<service-name>
```
。如果
```
<service-name>
```
已经以
```
experiment-
```
开头，则去除前缀（例如名为
```
experiment-foo
```
的包会生成
```
experiment-foo
```
，而非
```
experiment-experiment-foo
```
）。如果上述五个来源均无法解析为非空字符串，则回退到
```
experiment-sdk-default
```
，并在后续步骤输出中发出警告，提示用户应显式设置
```
--project-name
```
。
将解析后的名称作为字符串字面量嵌入生成的
```
PROJECT_NAME = "..."
```
行中——不要生成运行时
```
os.getcwd()
```
查找代码，因为用户可能在与工具解析时不同的目录中运行文件。
解析数据集来源。如果同时传递了
```
--dataset
```
和
```
--dataset-name
```
，则报错——它们互斥。
- --dataset <路径>
  （本地文件 → 内置记录或CSV加载器）:
  - 读取文件。如果是JSON文件，验证顶级数组是否符合
```
DatasetRecordRaw
```
    格式（
```
input_data
```
    、可选的
```
expected_output
```
    、
```
metadata
```
    、
```
tags
```
    ）。如果是CSV文件，解析表头并使用
```
dataset-bootstrap
```
    启发式自动检测列：
```
prompt|input|query|question
```
    → 输入列，
```
expected|gold|truth|answer
```
    → 预期输出列。
  - 对所有字符串值进行PII清理（使用正则表达式匹配邮箱/电话/SSN/API密钥）；将匹配项替换为
```
<REDACTED:pii-type>
```
    ，并在警告中列出受影响的索引。
  - 对于JSON数据集，将记录内置到生成文件中（
```
records=[...]
```
    ），确保用户拥有一个独立的工件。对于CSV数据集，生成
```
LLMObs.create_dataset_from_csv(csv_path="<绝对路径>", ...)
```
    ，并告知用户运行时CSV文件必须存在。
- --dataset-name <名称>
  （现有Datadog数据集 → 运行时拉取）:
  - 生成
```
LLMObs.pull_dataset(dataset_name="<name>", project_name="<project>"[, version=<n>])
```
    ，替代任何
```
create_dataset*
```
    调用。拉取操作在生成的实验运行时进行——工具本身不会调用Datadog。
  - 仅当设置了
```
--dataset-version
```
    时才传递
```
version=<n>
```
    ；否则省略，让SDK解析为最新版本。
  - 在调用上方添加一行注释，说明正在拉取的内容，例如
```
# 从Datadog拉取：dataset_name="qa_v3", version=latest
```
    。
  - 跳过PII清理和内置记录生成——没有本地记录需要清理。
- 未提供任何标志:
  - 回退到
```
--dataset
```
    默认值中描述的3条记录内置示例，确保生成的文件可直接运行。
关于数据集ID的说明。公共SDK的
```
LLMObs.pull_dataset(...)
```
接受名称而非ID——因此没有
```
--dataset-id
```
标志。如果用户只有Datadog UI URL中的数据集ID（
```
/llm/datasets/<id>
```
），则工作流为：在UI中打开该URL，复制数据集名称，然后作为
```
--dataset-name
```
传递。工具切勿导入
```
ddtrace.llmobs._experiment
```
或任何其他下划线开头的模块来规避此限制。
根据
--evaluator-style
选择评估器模板:
- ```
function
```
  : 3个普通函数——一个简单的布尔函数（
```
exact_match
```
  风格，可返回纯
```
bool
```
  ）、一个更丰富的基于规则的检查（返回带有
```
reasoning
```
  +
```
assessment
```
  的
```
EvaluatorResult
```
  ），以及一个LLM-as-Judge代理。如果
```
--dataset
```
  包含结构化的
```
expected_output
```
  ，则添加JSON格式检查（同样返回
```
EvaluatorResult
```
  ）。
- ```
class
```
  : 2个
```
BaseEvaluator
```
  子类，带有
```
evaluate(self, context: EvaluatorContext) -> EvaluatorResult
```
  方法。始终返回
```
EvaluatorResult
```
  （切勿返回纯值）——带有状态的评估器需要展示更丰富的信号。
- ```
remote
```
  : 1-2个
```
RemoteEvaluator(eval_name=...)
```
  实例，并添加注释，指导用户先在Datadog UI中创建评估器。
所有风格通用: 任何具有非平凡逻辑的评估器都必须返回
```
EvaluatorResult
```
，至少填充
```
value
```
+
```
reasoning
```
+
```
assessment
```
（请参阅“返回
```
EvaluatorResult
```
，而非纯值”部分）。对比UI使用
```
reasoning
```
进行每条记录的详情查看，使用
```
assessment
```
判断指标趋势是否代表改进。

生成文件。

对于
.py
文件——单个文件，章节之间空一行，使用如下横幅注释：

python

# ─── 3. 数据集 ───────────────────────────────────────────────────────────────

在顶部使用

from __future__ import annotations

和

from typing import Any, Dict

。为任务和评估器函数签名添加类型提示。

对于
.ipynb
文件——有效的Jupyter笔记本JSON。 schema:

json

{
  "cells": [
    {"cell_type": "markdown", "metadata": {}, "source": ["## 1. 环境设置\n", "..."]},
    {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["..."]},
    ...
  ],
  "metadata": {
    "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
    "language_info": {"name": "python", "version": "3.10"}
  },
  "nbformat": 4,
  "nbformat_minor": 5
}

每个章节对应一个markdown单元格+一个代码单元格。确保每个代码单元格足够独立，单独重新运行也有意义。

通过Bash进行语法检查（尽力而为）。如果工具链缺失，不要让工具失败——只需报告。

```
.py
```
:
```
python -m py_compile <路径>
```

.ipynb

python -c "import json; nb = json.load(open('<path>')); assert nb.get('cells'); print(f'cells={len(nb[\"cells\"])}')"

打印后续步骤（见输出部分）。

What the Generated Code MUST NOT Do

生成代码绝对不能做的事情

A reviewer should be able to run these

grep

checks against the generated file and get zero matches:

`grep` pattern	Why it's wrong
`uuid4` , `uuid.uuid4`	`record_id` is minted by the SDK on `dataset.append()` ; never client-generate.
`PATCH` , `batch_update` , `records/upload`	Status state machine and dataset diff are SDK responsibilities.
`from ddtrace.llmobs._`	Private import paths. Always use `from ddtrace.llmobs import ...` .
`"record_id"` , `"canonical_id"` (as dict keys in records)	The SDK owns them.
`DD_API_KEY = "<actual key>"`	Always read from `os.environ` .
`requests.post` , `httpx.post`	The skill produces SDK-only code. Direct HTTP calls bypass the SDK's lazy creation, push-diff, and bulk-threshold handling.

If any of those slip into the output, the skill is wrong — re-emit.

审核者应该能够对生成的文件运行以下

grep

检查，且结果为零匹配：

`grep` 模式	错误原因
`uuid4` , `uuid.uuid4`	`record_id` 由SDK在 `dataset.append()` 时生成；切勿在客户端生成。
`PATCH` , `batch_update` , `records/upload`	状态机和数据集差异是SDK的职责。
`from ddtrace.llmobs._`	私有导入路径。始终使用 `from ddtrace.llmobs import ...` 。
`"record_id"` , `"canonical_id"` （作为记录中的字典键）	这些由SDK管理。
`DD_API_KEY = "<实际密钥>"`	始终从 `os.environ` 读取。
`requests.post` , `httpx.post`	工具应仅生成SDK代码。直接HTTP调用会绕过SDK的延迟创建、推送差异和批量阈值处理。

如果生成的文件中出现上述任何内容，则工具出错——重新生成。

Output

输出

After writing, print:

Generated SDK experiment: <format>
Path: <path>
Lines: <count>   (or Cells: <count> for .ipynb)

SDK calls used:
  ✓ LLMObs.enable(...)                       (line/cell ~<N>)
  ✓ LLMObs.<create_dataset|create_dataset_from_csv|pull_dataset>(...)  (line/cell ~<N>)
  ✓ task_fn(input_data, config)              (line/cell ~<N>)
  ✓ <N> evaluators (style: <function|class|remote>)
  ✓ LLMObs.experiment(...).run(jobs=<N>)     (line/cell ~<N>)
  ✓ Provenance (in config + tags): generated_by=claude-code, skill=llm-obs-experiment-py-bootstrap

Syntax check: <pass | skipped: toolchain missing | fail with details>

Install:
  pip install "ddtrace>=4.7" python-dotenv openai

Environment variables (required at runtime):
  export DD_API_KEY=...
  export DD_APPLICATION_KEY=...
  export DD_SITE=datadoghq.com
  export OPENAI_API_KEY=...   # only if you keep the placeholder task

Run:
  python <path>                  # for --format py
  jupyter notebook <path>        # for --format ipynb

Next steps:
1. Replace the placeholder task_fn with your actual LLM call.
2. Adjust the evaluators (or wire up RemoteEvaluator names you created in the Datadog UI).
3. Run it. The script prints experiment.url at the end.
4. Watch the experiment: https://app.datadoghq.com/llm/experiments

生成文件后，打印：

生成的SDK实验：<格式>
路径：<路径>
行数：<数量>   （对于.ipynb文件则显示单元格数：<数量>）

使用的SDK调用：
  ✓ LLMObs.enable(...)                       （行/单元格 ~<N>）
  ✓ LLMObs.<create_dataset|create_dataset_from_csv|pull_dataset>(...)  （行/单元格 ~<N>）
  ✓ task_fn(input_data, config)              （行/单元格 ~<N>）
  ✓ <N>个评估器（风格：<function|class|remote>）
  ✓ LLMObs.experiment(...).run(jobs=<N>)     （行/单元格 ~<N>）
  ✓ 来源信息（在config和tags中）：generated_by=claude-code, skill=llm-obs-experiment-py-bootstrap

语法检查：<通过 | 跳过：工具链缺失 | 失败并显示详情>

安装依赖：
  pip install "ddtrace>=4.7" python-dotenv openai

运行时所需环境变量：
  export DD_API_KEY=...
  export DD_APPLICATION_KEY=...
  export DD_SITE=datadoghq.com
  export OPENAI_API_KEY=...   # 仅当保留占位符任务时需要

运行：
  python <路径>                  # 适用于--format py
  jupyter notebook <路径>        # 适用于--format ipynb

后续步骤：
1. 将占位符task_fn替换为实际的LLM调用。
2. 调整评估器（或连接你在Datadog UI中创建的RemoteEvaluator名称）。
3. 运行实验。脚本最后会打印experiment.url。
4. 查看实验：https://app.datadoghq.com/llm/experiments

Reference Notebook Patterns (use as templates)

参考笔记本模式（用作模板）

The canonical set lives at https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks and serves as the style reference — the generated code should feel like it could have come from this set.

Notebook	Pattern demonstrated
`00-basic-datasets.ipynb`	Dataset create/append/push lifecycle
`01-basic-experiments.ipynb`	Minimum viable experiment — inline records, OpenAI task, 2 boolean evaluators
`02-extra-data.ipynb`	CSV-loaded dataset, multi-value task output, confidence-based evaluators
`04-multi-span-experiments.ipynb`	Two-step LLM pipelines inside a single `task_fn`
`07-remote-evaluators.ipynb`	`RemoteEvaluator` with custom `transform_fn`

When

--evaluator-style remote

, lean toward the

style. When

--dataset

is a CSV, lean toward

. Default (no

--dataset

--evaluator-style function

) is the

style.

标准参考笔记本集位于https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks，是风格参考——生成的代码应看起来像是来自该集合。

笔记本	演示的模式
`00-basic-datasets.ipynb`	数据集创建/追加/推送生命周期
`01-basic-experiments.ipynb`	最简可行实验——内置记录、OpenAI任务、2个布尔评估器
`02-extra-data.ipynb`	CSV加载的数据集、多值任务输出、基于置信度的评估器
`04-multi-span-experiments.ipynb`	单个 `task_fn` 内的两步LLM流水线
`07-remote-evaluators.ipynb`	带有自定义 `transform_fn` 的 `RemoteEvaluator`

当

--evaluator-style remote

时，参考

风格。当

--dataset

是CSV文件时，参考

风格。默认情况（无

--dataset

，

--evaluator-style function

）参考

风格。

Datadog Documentation

Datadog文档

These are the canonical reference pages on https://docs.datadoghq.com/. Use them to ground answers about LLM Observability features and to look up details that aren't covered in this skill.

Topic	URL	Use when
LLM Observability overview	https://docs.datadoghq.com/llm_observability/	Establishing what the product covers, terminology
Setup	https://docs.datadoghq.com/llm_observability/setup/	API/app key creation, project + ml_app setup, region/site selection
Instrumentation overview	https://docs.datadoghq.com/llm_observability/instrumentation/	Auto-instrumentation, manual SDK usage, span model
Python SDK reference	https://docs.datadoghq.com/llm_observability/instrumentation/sdk/	Public symbol list, decorator semantics, span kinds, annotate/enable signatures
Experiments	https://docs.datadoghq.com/llm_observability/experiments/	`LLMObs.experiment(...)` , dataset lifecycle, eval streaming, status states
Evaluations	https://docs.datadoghq.com/llm_observability/evaluations/	Evaluator concepts, managed vs custom evaluators
Custom LLM-as-a-judge evaluations	https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/	`RemoteEvaluator` payload shape and rubric design
Managed evaluations	https://docs.datadoghq.com/llm_observability/evaluations/managed_evaluations/	Pre-built judges (faithfulness, toxicity, etc.)
Monitoring	https://docs.datadoghq.com/llm_observability/monitoring/	Alerts, dashboards, span-level monitors
Terms / glossary	https://docs.datadoghq.com/llm_observability/terms/	Span kinds, sessions, traces, ml_app
Evaluation developer guide	https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide/	Writing offline evaluators, validation strategy
Claude Code skills guide	https://docs.datadoghq.com/llm_observability/guide/claude_code_skills/	How this skill fits alongside the rest of the `dd-llmo` set
MCP server	https://docs.datadoghq.com/llm_observability/mcp_server/	Connecting MCP-compatible clients to LLM Obs data
Reference notebooks (GitHub)	https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks	Style-of-life examples for the generated `.py` / `.ipynb`

以下是https://docs.datadoghq.com/上的标准参考页面。使用这些页面来解答LLM可观测性功能相关问题，并查找本工具未涵盖的细节。

主题	URL	使用场景
LLM可观测性概述	https://docs.datadoghq.com/llm_observability/	了解产品覆盖范围、术语
设置	https://docs.datadoghq.com/llm_observability/setup/	API/app密钥创建、项目+ml_app设置、区域/站点选择
instrumentation概述	https://docs.datadoghq.com/llm_observability/instrumentation/	自动instrumentation、手动SDK使用、Span模型
Python SDK参考	https://docs.datadoghq.com/llm_observability/instrumentation/sdk/	公共符号列表、装饰器语义、Span类型、annotate/enable签名
实验	https://docs.datadoghq.com/llm_observability/experiments/	`LLMObs.experiment(...)` 、数据集生命周期、评估流式传输、状态
评估	https://docs.datadoghq.com/llm_observability/evaluations/	评估器概念、托管与自定义评估器
自定义LLM-as-a-judge评估	https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/	`RemoteEvaluator` 负载格式和评估规则设计
托管评估	https://docs.datadoghq.com/llm_observability/evaluations/managed_evaluations/	预构建评估器（忠实度、毒性等）
监控	https://docs.datadoghq.com/llm_observability/monitoring/	告警、仪表板、Span级监控
术语/词汇表	https://docs.datadoghq.com/llm_observability/terms/	Span类型、会话、Trace、ml_app
评估开发者指南	https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide/	编写离线评估器、验证策略
Claude Code技能指南	https://docs.datadoghq.com/llm_observability/guide/claude_code_skills/	本技能如何与其他 `dd-llmo` 技能配合使用
MCP服务器	https://docs.datadoghq.com/llm_observability/mcp_server/	将MCP兼容客户端连接到LLM Obs数据
参考笔记本（GitHub）	https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks	生成的 `.py` / `.ipynb` 文件的风格示例

Researching features the skill does not cover

研究本工具未涵盖的功能

If the user asks about an LLM Observability feature the skill's body doesn't address (e.g., specific span kinds, dataset versioning semantics, an evaluator type not covered above), fetch the relevant page from

docs.datadoghq.com

rather than guessing:

Pick the most specific URL from the table above. Most LLM Obs questions resolve under
```
/llm_observability/{experiments,evaluations,instrumentation,monitoring}/
```
.
Use
WebFetch
on that URL with a focused query (e.g.,
```
"How does Dataset.push() handle the 5 MB threshold?"
```
). Prefer
```
WebFetch
```
over generic web search — the canonical page is almost always under
```
docs.datadoghq.com/llm_observability/
```
.
Fall back to
WebSearch
with
```
site:docs.datadoghq.com/llm_observability
```
if you don't know which subpage owns the topic.
Cite the page in the answer with its URL so the user can verify and bookmark.

Never invent symbols or behaviors not present in this skill body or the docs above. If the docs don't cover the question either, say so explicitly and suggest filing an issue on

DataDog/llm-observability

rather than fabricating a workaround.

如果用户询问本工具未涉及的LLM可观测性功能（例如特定Span类型、数据集版本控制语义、上述未涵盖的评估器类型），请从

docs.datadoghq.com

获取相关页面，而非猜测：

选择最具体的URL从上述表格中。大多数LLM Obs问题可在
```
/llm_observability/{experiments,evaluations,instrumentation,monitoring}/
```
下找到答案。
**使用
```
WebFetch
```
**对该URL进行聚焦查询（例如
```
"Dataset.push()如何处理5MB阈值？"
```
）。优先使用
```
WebFetch
```
而非通用网络搜索——标准页面几乎都在
```
docs.datadoghq.com/llm_observability/
```
下。
**如果不知道哪个子页面涵盖该主题，回退到
```
WebSearch
```
**并使用
```
site:docs.datadoghq.com/llm_observability
```
。
在答案中引用页面URL，以便用户验证和收藏。

切勿发明本工具或上述文档中未提及的符号或行为。如果文档也未涵盖该问题，请明确说明，并建议在

DataDog/llm-observability

上提交issue，而非编造解决方案。

Operating Rules

操作规则

SDK only. No
```
requests.post
```
, no manual JSON:API envelope construction, no manual ID generation. If a feature seems to require those, you're solving the wrong problem — the SDK already covers it.
Public imports only.
```
from ddtrace.llmobs import ...
```
. Never
```
_experiment
```
,
```
_llmobs
```
, or any underscore-prefixed module.
Env vars, not literals. Credentials always read from
```
os.environ
```
. The generated
```
main()
```
(or the env-setup cell) must
```
assert
```
they're set with a clear message.
Always pass
site=
to
LLMObs.enable()
. Read it from
```
os.getenv("DD_SITE", "datadoghq.com")
```
. Omitting
```
site=
```
silently defaults to US1 prod, which breaks every non-prod org (e.g. staging
```
datad0g.com
```
,
```
datadoghq.eu
```
). The canonical signature already includes it — never drop it.
Per-record
tags
are
"key:value"
strings. When inlining records (whether from
```
--dataset
```
JSON, CSV, or the default sample), each entry in a record's
```
"tags"
```
list must be a
```
"key:value"
```
string like
```
"env:prod"
```
,
```
"source:traces"
```
,
```
"category:geography"
```
. Bare strings (
```
"smoke"
```
,
```
"baseline"
```
) trigger
```
ValueError: Tag '<name>' is malformed.
```
at
```
Dataset.append()
```
time. If the source data has bare-string tags, namespace them — e.g. wrap
```
"smoke"
```
as
```
"tag:smoke"
```
rather than dropping it.
# TODO(user)
markers on the placeholder task and on at least one evaluator so reviewers can't ship the placeholder by accident.
Match notebook conventions. Plain function evaluators by default; class-based only when the user opts in. Print
```
experiment.url
```
at the end of every generated file.
Tag every experiment with provenance — in both
config
and
tags
. Every
```
LLMObs.experiment(...)
```
call must carry
```
"generated_by": "claude-code"
```
and
```
"skill": "llm-obs-experiment-py-bootstrap"
```
as keys in both the
```
config={...}
```
dict (so they render in the experiment's Configuration view, which is where users actually look) and the
```
tags={...}
```
dict (which the SDK serializes into
```
metadata.tags
```
for future tag-filter consumers). The
```
tags=
```
path alone is not enough: the current LLM Experiments UI does not surface
```
metadata.tags
```
as filterable chips, so users won't see the provenance unless it's also in
```
config
```
. If a user later edits the generated file to add their own keys, they extend both dicts — never replace the provenance keys silently.
PII scrub at the door. If
```
--dataset
```
is given, scrub before inlining into the generated file. Never embed a record that contains an unmasked email/phone/SSN/API-key pattern.
Don't generate
requirements.txt
or
pyproject.toml
. Print the
```
pip install
```
command in the next-steps message instead — most users already have a venv.
No silent fallbacks. If
```
--format
```
is unsupported, error out with the valid choices.
Python only. If a user passes
```
--language typescript
```
(or any non-Python language flag), error out — this skill produces Python
```
ddtrace.llmobs
```
SDK code only.
Research, don't invent. If the user asks about an LLM Observability feature, span kind, evaluator type, or SDK symbol that is not documented in this skill body,
```
WebFetch
```
the relevant
```
docs.datadoghq.com/llm_observability/*
```
page (see the Datadog Documentation table above for the canonical URLs) before answering. Cite the page URL in the response. If the docs don't cover the topic, say so explicitly — never fabricate symbols, flags, or behaviors.

仅使用SDK。禁止使用
```
requests.post
```
、手动构建JSON:API信封、手动生成ID。如果某个功能似乎需要这些操作，说明你解决问题的方式有误——SDK已经涵盖了这些功能。
仅使用公共导入。
```
from ddtrace.llmobs import ...
```
。切勿使用
```
_experiment
```
、
```
_llmobs
```
或任何以下划线开头的模块。
使用环境变量，而非字面量。凭据始终从
```
os.environ
```
读取。生成的
```
main()
```
（或环境设置单元格）必须通过
```
assert
```
确保它们已设置，并给出清晰的提示信息。
始终向
LLMObs.enable()
传递
site=
参数。从
```
os.getenv("DD_SITE", "datadoghq.com")
```
读取。省略
```
site=
```
会默认使用US1生产环境，这会导致所有非生产组织（例如 staging
```
datad0g.com
```
、
```
datadoghq.eu
```
）的功能失效。标准签名已包含此参数——切勿省略。
每条记录的
tags
是
"key:value"
格式的字符串。当内置记录时（无论是来自
```
--dataset
```
JSON、CSV还是默认示例），记录的
```
"tags"
```
列表中的每个条目必须是
```
"key:value"
```
格式的字符串，例如
```
"env:prod"
```
、
```
"source:traces"
```
、
```
"category:geography"
```
。纯字符串（
```
"smoke"
```
、
```
"baseline"
```
）会在
```
Dataset.append()
```
时触发
```
ValueError: Tag '<name>' is malformed.
```
。如果源数据包含纯字符串标签，请为其添加命名空间——例如将
```
"smoke"
```
包装为
```
"tag:smoke"
```
，而非删除它。
添加
# TODO(user)
标记在占位符任务和至少一个评估器上，确保审核者不会意外发布占位符代码。
匹配笔记本惯例。默认使用普通函数评估器；仅当用户选择时才使用基于类的评估器。在每个生成文件的末尾打印
```
experiment.url
```
。
为每个实验添加来源标签——同时在
config
和
tags
中。每个
```
LLMObs.experiment(...)
```
调用必须在
```
config={...}
```
字典（以便在实验的配置视图中显示，这是用户实际查看的地方）和
```
tags={...}
```
字典（SDK会将其序列化为
```
metadata.tags
```
，供未来的标签筛选消费者使用）中包含
```
"generated_by": "claude-code"
```
和
```
"skill": "llm-obs-experiment-py-bootstrap"
```
键。仅使用
```
tags=
```
路径是不够的：当前LLM实验UI不会将
```
metadata.tags
```
显示为可筛选标签，因此用户只有在config中看到来源信息。如果用户稍后编辑生成的文件添加自己的键，他们应扩展两个字典——切勿静默替换来源键。
在入口处进行PII清理。如果提供了
```
--dataset
```
，在将记录内置到生成文件前进行清理。切勿嵌入包含未掩码邮箱/电话/SSN/API密钥模式的记录。
不要生成
requirements.txt
或
pyproject.toml
。在后续步骤消息中打印
```
pip install
```
命令即可——大多数用户已经有虚拟环境。
无静默回退。如果
```
--format
```
不支持，则报错并列出有效选项。
仅支持Python。如果用户传递
```
--language typescript
```
（或任何非Python语言标志），则报错——本工具仅生成Python
```
ddtrace.llmobs
```
SDK代码。
研究，而非发明。如果用户询问本工具未记录的LLM可观测性功能、Span类型、评估器类型或SDK符号，请先
```
WebFetch
```
相关的
```
docs.datadoghq.com/llm_observability/*
```
页面（请参阅上述Datadog文档表格中的标准URL），然后再回答。在响应中引用页面URL。如果文档未涵盖该主题，请明确说明——切勿编造符号、标志或行为。

llm-obs-experiment-py-bootstrap

Original

Translation

LLM Obs Experiment (Python) Bootstrap — Generate a Python Experiment Using ddtrace.llmobs

LLM Obs实验（Python）引导程序——使用ddtrace.llmobs生成Python实验

Usage

使用方法

Inputs

输入项

SDK Surface (Cited)

SDK接口（引用）

Note: ml_app is not a separate input. The SDK derives it from project_name

注意：ml_app不是单独的输入项。SDK会在未提供时从project_name派生。

when not supplied. If a user really wants to override it later, they can

如果用户确实想要稍后覆盖它，可以自行在enable()中添加ml_app="..."。

add ml_app="..." to enable() themselves.

OR

或

OR pull an existing Datadog dataset by name (no local file needed)

或按名称拉取现有Datadog数据集（无需本地文件）

Plain function evaluator (default style)

普通函数评估器（默认风格）

Evaluator Styles

评估器风格

Return EvaluatorResult, not bare values

返回EvaluatorResult，而非纯值

function (default — what the notebooks use)

function（默认——笔记本使用的风格）

Trivial check — bare bool is fine here, the result has no extra signal.

简单检查——返回纯bool即可，结果没有额外信号。

Richer check — use EvaluatorResult so reasoning/assessment surface in the UI.

更复杂的检查——使用EvaluatorResult以便在UI中展示reasoning/assessment。

class (advanced — for evaluators that need state or async I/O)

class（高级——需要状态或异步I/O的评估器）

remote (LLM-as-Judge running server-side)

remote（在服务器端运行的LLM-as-Judge）

Create the judge in Datadog UI first: LLM Observability → Evaluations → New Evaluator

先在Datadog UI中创建评估器：LLM Observability → Evaluations → New Evaluator

Optional: customize the payload the judge receives

可选：自定义评估器接收的负载

Generated File Structure

生成文件结构

Workflow

工作流

What the Generated Code MUST NOT Do

生成代码绝对不能做的事情

Output

输出

Reference Notebook Patterns (use as templates)

参考笔记本模式（用作模板）

Datadog Documentation

Datadog文档

Researching features the skill does not cover

研究本工具未涵盖的功能

Operating Rules

操作规则

LLM Obs Experiment (Python) Bootstrap — Generate a Python Experiment Using
`ddtrace.llmobs`

LLM Obs实验（Python）引导程序——使用
`ddtrace.llmobs`
生成Python实验

如果用户确实想要稍后覆盖它，可以自行在enable()中添加
`ml_app="..."`
。

add
`ml_app="..."`
to enable() themselves.

Return
`EvaluatorResult`
, not bare values

返回
`EvaluatorResult`
，而非纯值

`function`
(default — what the notebooks use)

`function`
（默认——笔记本使用的风格）

`class`
(advanced — for evaluators that need state or async I/O)

`class`
（高级——需要状态或异步I/O的评估器）

`remote`
(LLM-as-Judge running server-side)

`remote`
（在服务器端运行的LLM-as-Judge）