llm-obs-experiment-py-bootstrap

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Obs Experiment (Python) Bootstrap — Generate a Python Experiment Using
ddtrace.llmobs

LLM Obs实验(Python)引导程序——使用
ddtrace.llmobs
生成Python实验

Produce a single self-contained Python experiment that uses the official
ddtrace.llmobs
SDK
. Output is either a
.py
script or an
.ipynb
notebook. The generated code mirrors the patterns shown in DataDog's reference notebooks at https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks.
The SDK handles lazy project/experiment creation, dataset push diffing, the 5 MB / 1000-record bulk threshold, eval metric streaming, and the status state machine on the user's behalf. This skill must therefore never re-implement those primitives — it just imports
LLMObs
and trusts it.
生成一个独立的Python实验,该实验使用官方**
ddtrace.llmobs
SDK**。输出格式可为
.py
脚本或
.ipynb
笔记本。生成的代码与DataDog参考笔记本中的模式一致,参考地址为https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks
SDK会自动处理延迟项目/实验创建、数据集推送差异、5MB/1000条记录的批量阈值、评估指标流式传输以及状态机等操作。因此,本工具绝不能重新实现这些基础功能——只需导入
LLMObs
并依赖其处理即可。

Usage

使用方法

/llm-obs-experiment-py-bootstrap [--format py|ipynb] [--dataset <path>] [--dataset-name <name>] [--dataset-version <int>] [--project-name <name>] [--evaluator-style function|class|remote] [--jobs <n>] [--output <path>]
Arguments: $ARGUMENTS
/llm-obs-experiment-py-bootstrap [--format py|ipynb] [--dataset <路径>] [--dataset-name <名称>] [--dataset-version <整数>] [--project-name <名称>] [--evaluator-style function|class|remote] [--jobs <数量>] [--output <路径>]
参数:$ARGUMENTS

Inputs

输入项

All inputs are optional. If the user omits a flag, fall back to the default — never block on prompting for
--jobs
,
--format
, etc.
InputDefaultDescription
--format
py
py
(single
.py
file) or
ipynb
(Jupyter notebook with one cell per section).
--dataset
none — emit a sample 3-record
records=[...]
inline so the file is runnable as-is
Path to a local
DatasetRecordRaw[]
JSON or CSV. JSON →
create_dataset(records=...)
; CSV →
create_dataset_from_csv(...)
. Mutually exclusive with
--dataset-name
.
--dataset-name
noneName of an existing Datadog dataset to fetch at runtime via
LLMObs.pull_dataset(...)
. Use this when the dataset already lives in Datadog (e.g. created in the UI or by a prior run) — no local file required. Mutually exclusive with
--dataset
.
--dataset-version
none (latest)Pin to a specific dataset version when using
--dataset-name
. Passed through as
pull_dataset(version=N)
. Ignored if
--dataset-name
is not set.
--project-name
experiment-<service-name>
— derived from the codebase (see Workflow step 1); falls back to
experiment-sdk-default
only if nothing resolves
Datadog project name (visible in the LLM Experiments UI). The SDK's
ml_app
tag falls back to this automatically — no separate flag needed.
--evaluator-style
function
function
(plain functions — notebook default),
class
(
BaseEvaluator
subclasses), or
remote
(
RemoteEvaluator
instances).
--jobs
10
Passed to
experiment.run(jobs=N)
.
--output
./experiments/experiment.<ext>
File extension derives from
--format
:
.py
or
.ipynb
.

所有输入项均为可选。如果用户省略某个标志,则使用默认值——切勿因缺少
--jobs
--format
等标志而中断流程。
输入项默认值描述
--format
py
py
(单个
.py
文件)或
ipynb
(每个部分对应一个单元格的Jupyter笔记本)。
--dataset
无——内置包含3条记录的示例
records=[...]
,确保文件可直接运行
本地
DatasetRecordRaw[]
格式的JSON或CSV文件路径。JSON文件使用
create_dataset(records=...)
;CSV文件使用
create_dataset_from_csv(...)
。与
--dataset-name
互斥。
--dataset-name
现有Datadog数据集的名称,运行时通过
LLMObs.pull_dataset(...)
获取。当数据集已存储在Datadog中(例如通过UI或之前的运行创建)时使用此选项——无需本地文件。与
--dataset
互斥。
--dataset-version
无(最新版本)使用
--dataset-name
时,固定到特定的数据集版本。传递给
pull_dataset(version=N)
。如果未设置
--dataset-name
,则忽略此参数。
--project-name
experiment-<service-name>
——从代码库派生(见工作流步骤1);若无法解析则回退到
experiment-sdk-default
Datadog项目名称(在LLM实验UI中可见)。SDK的
ml_app
标签会自动回退为此值——无需单独设置标志。
--evaluator-style
function
function
(普通函数——笔记本默认风格)、
class
BaseEvaluator
子类)或
remote
RemoteEvaluator
实例)。
--jobs
10
传递给
experiment.run(jobs=N)
--output
./experiments/experiment.<扩展名>
文件扩展名由
--format
决定:
.py
.ipynb

SDK Surface (Cited)

SDK接口(引用)

These are the public symbols the generated code uses. All come from
ddtrace.llmobs
(the public package — never from
ddtrace.llmobs._experiment
or other underscore-prefixed modules).
ImportSourceWhat it gives you
LLMObs
ddtrace/llmobs/__init__.py
re-exports
_llmobs.py
.enable()
,
.create_dataset()
,
.create_dataset_from_csv()
,
.pull_dataset(dataset_name, project_name, version)
,
.experiment()
,
.async_experiment()
RemoteEvaluator
,
EvaluatorContext
ddtrace/llmobs/__init__.py
LLM-as-Judge that runs server-side; preferred over inline
LLMJudge
BaseEvaluator
,
EvaluatorResult
ddtrace/llmobs/__init__.py
Class-based evaluator path (advanced)
LLMJudge
ddtrace/llmobs/_evaluators/llm_judge.py
(re-exported)
Inline LLM-as-Judge with prompt template support
Canonical call signatures (must match the generated code exactly):
python
LLMObs.enable(
    api_key=os.getenv("DD_API_KEY"),
    app_key=os.getenv("DD_APPLICATION_KEY"),
    site=os.getenv("DD_SITE", "datadoghq.com"),  # required for non-prod sites (e.g. datad0g.com, datadoghq.eu)
    project_name="<project>",
    agentless_enabled=True,  # required when not running behind the dd-agent
)
以下是生成代码使用的公共符号,全部来自
ddtrace.llmobs
(公共包——切勿使用
ddtrace.llmobs._experiment
或其他以下划线开头的模块)。
导入项来源提供的功能
LLMObs
ddtrace/llmobs/__init__.py
重导出
_llmobs.py
.enable()
.create_dataset()
.create_dataset_from_csv()
.pull_dataset(dataset_name, project_name, version)
.experiment()
.async_experiment()
RemoteEvaluator
,
EvaluatorContext
ddtrace/llmobs/__init__.py
在服务器端运行的LLM-as-Judge;优先于内联
LLMJudge
BaseEvaluator
,
EvaluatorResult
ddtrace/llmobs/__init__.py
基于类的评估器路径(高级用法)
LLMJudge
ddtrace/llmobs/_evaluators/llm_judge.py
(重导出)
支持提示模板的内联LLM-as-Judge
标准调用签名(生成代码必须完全匹配):
python
LLMObs.enable(
    api_key=os.getenv("DD_API_KEY"),
    app_key=os.getenv("DD_APPLICATION_KEY"),
    site=os.getenv("DD_SITE", "datadoghq.com"),  # 非生产站点必填(如datad0g.com、datadoghq.eu)
    project_name="<project>",
    agentless_enabled=True,  # 不运行在dd-agent后方时必填
)

Note: ml_app is not a separate input. The SDK derives it from project_name

注意:ml_app不是单独的输入项。SDK会在未提供时从project_name派生。

when not supplied. If a user really wants to override it later, they can

如果用户确实想要稍后覆盖它,可以自行在enable()中添加
ml_app="..."

add
ml_app="..."
to enable() themselves.

dataset = LLMObs.create_dataset( dataset_name="<name>", description="<optional>", records=[ # Per-record
tags
MUST be a list of "key:value" strings (e.g. "env:smoke"), # never bare strings — the SDK rejects malformed tags with a ValueError on append. {"input_data": {"<k>": "<v>"}, "expected_output": "<v>", "metadata": {}, "tags": ["env:<env>"]}, # ... ], )
dataset = LLMObs.create_dataset( dataset_name="<name>", description="<可选>", records=[ # 每条记录的
tags
必须是"key:value"格式的字符串列表(例如"env:smoke"), # 绝不能是纯字符串——SDK会在追加时因格式错误的标签抛出ValueError。 {"input_data": {"<k>": "<v>"}, "expected_output": "<v>", "metadata": {}, "tags": ["env:<env>"]}, # ... ], )

OR

dataset = LLMObs.create_dataset_from_csv( csv_path="<path>", dataset_name="<name>", input_data_columns=["<col1>", "<col2>"], expected_output_columns=["<col>"], )
dataset = LLMObs.create_dataset_from_csv( csv_path="<path>", dataset_name="<name>", input_data_columns=["<col1>", "<col2>"], expected_output_columns=["<col>"], )

OR pull an existing Datadog dataset by name (no local file needed)

或按名称拉取现有Datadog数据集(无需本地文件)

dataset = LLMObs.pull_dataset( dataset_name="<name>", project_name="<project>", # optional — defaults to the project on enable() version=2, # optional — pin a version; omit for the latest )
def task_fn(input_data: dict, config: dict): # TODO(user): replace with your actual LLM call ...
dataset = LLMObs.pull_dataset( dataset_name="<name>", project_name="<project>", # 可选——默认使用enable()中设置的项目 version=2, # 可选——固定版本;省略则使用最新版本 )
def task_fn(input_data: dict, config: dict): # TODO(user): 替换为实际的LLM调用 ...

Plain function evaluator (default style)

普通函数评估器(默认风格)

def exact_match(input_data, output_data, expected_output) -> bool: return output_data == expected_output
experiment = LLMObs.experiment( name="<experiment_name>", dataset=dataset, task=task_fn, evaluators=[exact_match], config={ "model": "gpt-4o-mini", "temperature": 0.0, # Provenance also lives in
config
so it renders in the # experiment's Configuration view alongside model/temperature. #
tags=
below only reaches metadata.tags, which the current UI # does not surface as chips — config is what users actually see. "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, description="<optional>", tags={ # Same provenance, sent to experiment metadata.tags for any future # tag-filter UI / API consumers. Always emitted alongside the # config copy — never one without the other. "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, )
experiment.run(jobs=10) print(experiment.url)

---
def exact_match(input_data, output_data, expected_output) -> bool: return output_data == expected_output
experiment = LLMObs.experiment( name="<experiment_name>", dataset=dataset, task=task_fn, evaluators=[exact_match], config={ "model": "gpt-4o-mini", "temperature": 0.0, # 来源信息也存储在
config
中,以便在实验的配置视图中与model/temperature一起显示。 # 下面的
tags=
仅会传递到metadata.tags,当前UI不会将其显示为筛选标签——用户实际看到的是config中的内容。 "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, description="<可选>", tags={ # 相同的来源信息,发送到实验metadata.tags,供未来的标签筛选UI/API使用。始终与config中的副本一起输出——切勿只输出其中一个。 "generated_by": "claude-code", "skill": "llm-obs-experiment-py-bootstrap", }, )
experiment.run(jobs=10) print(experiment.url)

---

Evaluator Styles

评估器风格

Generated code uses one of three evaluator surfaces, picked by
--evaluator-style
. Whichever style is chosen, prefer returning
EvaluatorResult
over a bare
bool
/
float
whenever the evaluator has any signal beyond the raw value — see "Return EvaluatorResult, not bare values" below.
生成的代码会使用以下三种评估器接口之一,由
--evaluator-style
选择。无论选择哪种风格,只要评估器有原始值之外的信号,就优先返回
EvaluatorResult
而非纯
bool
/
float
——请参阅下文“返回EvaluatorResult,而非纯值”部分。

Return
EvaluatorResult
, not bare values

返回
EvaluatorResult
,而非纯值

Plain functions are allowed to return
bool
/
float
/
dict
, and
BaseEvaluator.evaluate()
is allowed to return raw
JSONType
. The SDK accepts both — but
EvaluatorResult
carries fields the Datadog UI surfaces in ways the raw value cannot:
FieldTypeUsed by Datadog UI for
value
bool
/
float
/
str
/
dict
(JSONType)
The score itself — shown on the experiment metric. Required.
reasoning
str
Per-record explanation shown in the compare UI; lets reviewers see why an evaluator passed/failed without re-running the LLM.
assessment
str
(e.g.
"pass"
/
"fail"
/
"partial"
)
Determines whether a metric trend going up vs. down is an improvement; the UI uses this to color baseline-vs-candidate comparisons.
metadata
dict[str, JSONType]
Free-form per-record context (e.g.
{"confidence": 0.95}
); shown in record drill-down.
tags
dict[str, JSONType]
Used to slice experiment results in the UI (e.g.
{"category": "accuracy"}
).
The generated code should default to
EvaluatorResult
for any evaluator richer than a one-line equality check. The trivial
exact_match
and
length_under_500
shown below are the only cases where a bare
bool
is acceptable.
普通函数可以返回
bool
/
float
/
dict
BaseEvaluator.evaluate()
可以返回原始
JSONType
。SDK都接受这些返回值——但
EvaluatorResult
包含Datadog UI可以以特殊方式展示的字段:
字段类型Datadog UI用途
value
bool
/
float
/
str
/
dict
(JSONType)
分数本身——显示在实验指标中。必填。
reasoning
str
在对比UI中显示每条记录的解释;让审核者无需重新运行LLM即可了解评估器通过/失败的原因。
assessment
str
(例如
"pass"
/
"fail"
/
"partial"
决定指标趋势上升或下降是否代表改进;UI使用此字段为基线与候选者的对比结果着色。
metadata
dict[str, JSONType]
每条记录的自由格式上下文(例如
{"confidence": 0.95}
);在记录详情中显示。
tags
dict[str, JSONType]
用于在UI中筛选实验结果(例如
{"category": "accuracy"}
)。
对于任何比单行相等检查更复杂的评估器,生成的代码应默认使用
EvaluatorResult
。下面展示的简单
exact_match
length_under_500
是唯一可以接受返回纯
bool
的情况。

function
(default — what the notebooks use)

function
(默认——笔记本使用的风格)

Plain Python functions with the signature
(input_data, output_data, expected_output)
. Always emit at least three: a trivial boolean (returns
bool
), a richer rule-based one (returns
EvaluatorResult
), and an LLM-as-Judge surrogate (a
RemoteEvaluator
reference or a placeholder).
python
from ddtrace.llmobs import EvaluatorResult
具有签名
(input_data, output_data, expected_output)
的普通Python函数。始终至少生成三个:一个简单的布尔函数(返回
bool
)、一个更丰富的基于规则的函数(返回带有
reasoning
+
assessment
EvaluatorResult
),以及一个LLM-as-Judge代理(
RemoteEvaluator
引用或占位符)。
python
from ddtrace.llmobs import EvaluatorResult

Trivial check — bare bool is fine here, the result has no extra signal.

简单检查——返回纯bool即可,结果没有额外信号。

def exact_match(input_data, output_data, expected_output) -> bool: return output_data == expected_output
def exact_match(input_data, output_data, expected_output) -> bool: return output_data == expected_output

Richer check — use EvaluatorResult so reasoning/assessment surface in the UI.

更复杂的检查——使用EvaluatorResult以便在UI中展示reasoning/assessment。

def response_well_formed(input_data, output_data, expected_output) -> EvaluatorResult: if not isinstance(output_data, str): return EvaluatorResult( value=False, reasoning=f"output_data was {type(output_data).name}, expected str", assessment="fail", ) if len(output_data) > 500: return EvaluatorResult( value=False, reasoning=f"output exceeded 500 chars (was {len(output_data)})", assessment="fail", metadata={"length": len(output_data)}, ) return EvaluatorResult(value=True, assessment="pass")
undefined
def response_well_formed(input_data, output_data, expected_output) -> EvaluatorResult: if not isinstance(output_data, str): return EvaluatorResult( value=False, reasoning=f"output_data类型为{type(output_data).name},预期为str", assessment="fail", ) if len(output_data) > 500: return EvaluatorResult( value=False, reasoning=f"输出长度超过500字符(实际为{len(output_data)})", assessment="fail", metadata={"length": len(output_data)}, ) return EvaluatorResult(value=True, assessment="pass")
undefined

class
(advanced — for evaluators that need state or async I/O)

class
(高级——需要状态或异步I/O的评估器)

Always return
EvaluatorResult
from
evaluate()
— never a bare value. State-bearing evaluators usually have richer reasoning to surface anyway.
python
from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult

class FaithfulnessJudge(BaseEvaluator):
    def __init__(self):
        super().__init__(name="faithfulness")
        # TODO(user): initialize any client or state here

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # context exposes: input_data, output_data, expected_output, metadata
        # TODO(user): replace placeholder logic with your faithfulness check
        passed = context.output_data is not None
        return EvaluatorResult(
            value=1.0 if passed else 0.0,
            reasoning="placeholder — replace with your faithfulness rubric",
            assessment="pass" if passed else "fail",
            metadata={"evaluator_version": "v1"},
        )
始终从
evaluate()
返回
EvaluatorResult
——切勿返回纯值。带有状态的评估器通常需要展示更丰富的推理信息。
python
from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult

class FaithfulnessJudge(BaseEvaluator):
    def __init__(self):
        super().__init__(name="faithfulness")
        # TODO(user): 在此处初始化任何客户端或状态

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # context包含:input_data, output_data, expected_output, metadata
        # TODO(user): 替换占位符逻辑为实际的忠实度检查
        passed = context.output_data is not None
        return EvaluatorResult(
            value=1.0 if passed else 0.0,
            reasoning="占位符——替换为你的忠实度评估规则",
            assessment="pass" if passed else "fail",
            metadata={"evaluator_version": "v1"},
        )

remote
(LLM-as-Judge running server-side)

remote
(在服务器端运行的LLM-as-Judge)

python
from ddtrace.llmobs import RemoteEvaluator
python
from ddtrace.llmobs import RemoteEvaluator

Create the judge in Datadog UI first: LLM Observability → Evaluations → New Evaluator

先在Datadog UI中创建评估器:LLM Observability → Evaluations → New Evaluator

quality_judge = RemoteEvaluator(eval_name="<name-from-datadog-ui>")
quality_judge = RemoteEvaluator(eval_name="<name-from-datadog-ui>")

Optional: customize the payload the judge receives

可选:自定义评估器接收的负载

custom_judge = RemoteEvaluator( eval_name="<name>", transform_fn=lambda ctx: { "question": ctx.input_data.get("question"), "answer": ctx.output_data, "reference": ctx.expected_output, }, )

---
custom_judge = RemoteEvaluator( eval_name="<name>", transform_fn=lambda ctx: { "question": ctx.input_data.get("question"), "answer": ctx.output_data, "reference": ctx.expected_output, }, )

---

Generated File Structure

生成文件结构

The same section sequence in both formats. In
.py
these become comment banners; in
.ipynb
each becomes one markdown cell + one code cell.
1. Env setup           — load_dotenv(), os.getenv reads, hard assert keys present
2. LLMObs.enable()     — explicit api_key/app_key/project_name/agentless_enabled
3. Dataset             — inline records OR create_dataset_from_csv
4. Task function       — placeholder OpenAI call with # TODO(user) marker
5. Evaluators          — 2-3 in the requested style
6. Experiment          — LLMObs.experiment(config={..., "generated_by": "claude-code", ...}, tags={"generated_by": "claude-code", ...})
7. Run                 — experiment.run(jobs=N); print(experiment.url)
8. Results inspection  — experiment.as_dataframe() if pandas, else print

两种格式使用相同的章节顺序。在
.py
文件中,章节以注释横幅形式呈现;在
.ipynb
文件中,每个章节对应一个markdown单元格+一个代码单元格。
1. 环境设置           — load_dotenv()、os.getenv读取、硬断言密钥存在
2. LLMObs.enable()     — 显式设置api_key/app_key/project_name/agentless_enabled
3. 数据集             — 内置记录或CSV加载器
4. 任务函数           — 占位符OpenAI调用,带有# TODO(user)标记
5. 评估器          — 2-3个符合指定风格的评估器
6. 实验          — LLMObs.experiment(config={..., "generated_by": "claude-code", ...}, tags={"generated_by": "claude-code", ...})
7. 运行                 — experiment.run(jobs=N); print(experiment.url)
8. 结果检查  — 如果安装了pandas则使用experiment.as_dataframe(),否则使用print

Workflow

工作流

  1. Parse arguments. Default
    --format py
    . Resolve
    --output
    extension from
    --format
    .
    If
    --project-name
    is not provided, resolve a default of the form
    experiment-<service-name>
    by walking these sources in order, taking the first match:
    1. pyproject.toml
      [project] name
      (PEP 621) or
      [tool.poetry] name
      .
    2. setup.cfg
      [metadata] name
      .
    3. setup.py
      → first
      name="..."
      argument to
      setup(...)
      .
    4. package.json
      "name"
      (useful when the LLM app lives in a TS/JS monorepo Python package).
    5. The basename of the current working directory, lowercased and slugified (
      /^[a-z0-9-]+$/
      — replace non-matching chars with
      -
      ).
    The final project name is
    experiment-<service-name>
    . Strip a leading
    experiment-
    from
    <service-name>
    if it already starts with one (so a package literally named
    experiment-foo
    yields
    experiment-foo
    , not
    experiment-experiment-foo
    ). If none of the five sources resolve to a non-empty string, fall back to
    experiment-sdk-default
    and emit a warning in the next-steps output that the user should set
    --project-name
    explicitly.
    Embed the resolved name as a string literal in the generated
    PROJECT_NAME = "..."
    line — don't emit runtime
    os.getcwd()
    lookups, since the user may run the file from a different directory than where the skill resolved it.
  2. Resolve the dataset source. Error out if both
    --dataset
    and
    --dataset-name
    are passed — they're mutually exclusive.
    • --dataset <path>
      (local file → inline records or CSV loader)
      :
      • Read the file. If JSON, validate top-level array of
        DatasetRecordRaw
        shape (
        input_data
        , optional
        expected_output
        ,
        metadata
        ,
        tags
        ). If CSV, parse header and auto-detect columns using the
        dataset-bootstrap
        heuristics:
        prompt|input|query|question
        → input,
        expected|gold|truth|answer
        → expected.
      • Run a PII scrub (email/phone/SSN/API-key regexes) on all string values; replace matches with
        <REDACTED:pii-type>
        and surface a warning listing affected indices.
      • For JSON datasets, embed the records inline in the generated file (
        records=[...]
        ) so the user has a single self-contained artifact. For CSV datasets, emit
        LLMObs.create_dataset_from_csv(csv_path="<absolute path>", ...)
        and tell the user the CSV needs to be present at runtime.
    • --dataset-name <name>
      (existing Datadog dataset → runtime pull)
      :
      • Emit
        LLMObs.pull_dataset(dataset_name="<name>", project_name="<project>"[, version=<n>])
        in place of any
        create_dataset*
        call. The fetch happens when the generated experiment runs — the skill itself does not call Datadog.
      • Pass
        version=<n>
        through only if
        --dataset-version
        was set; otherwise omit it so the SDK resolves the latest.
      • Add a one-line comment above the call documenting what's being pulled, e.g.
        # Pulled from Datadog: dataset_name="qa_v3", version=latest
        .
      • Skip the PII scrub and the inline-records emission — there are no local records to scrub.
    • Neither flag given:
      • Fall back to the inline 3-record sample described under
        --dataset
        's default, so the generated file remains runnable as-is.
    Note on dataset IDs. The public SDK's
    LLMObs.pull_dataset(...)
    takes a name, not an ID — so there's no
    --dataset-id
    flag. If a user only has a dataset ID from a Datadog UI URL (
    /llm/datasets/<id>
    ), the workflow is: open that URL in the UI, copy the dataset name, and pass it as
    --dataset-name
    . The skill must not import
    ddtrace.llmobs._experiment
    or any other underscore module to work around this.
  3. Pick evaluator template based on
    --evaluator-style
    :
    • function
      : 3 plain functions — one trivial boolean (
      exact_match
      -style, bare
      bool
      OK), one richer rule-based check returning
      EvaluatorResult
      with
      reasoning
      +
      assessment
      , and one LLM-as-Judge surrogate. If
      --dataset
      had structured
      expected_output
      , add a JSON-shape check (also returning
      EvaluatorResult
      ).
    • class
      : 2
      BaseEvaluator
      subclasses with
      evaluate(self, context: EvaluatorContext) -> EvaluatorResult
      . Always return
      EvaluatorResult
      (never a bare value) — state-bearing evaluators have richer signal to surface.
    • remote
      : 1-2
      RemoteEvaluator(eval_name=...)
      instances with a comment instructing the user to create the judge in the Datadog UI first.
    In all styles: any evaluator with non-trivial logic must return
    EvaluatorResult
    populating at minimum
    value
    +
    reasoning
    +
    assessment
    (see the "Return
    EvaluatorResult
    , not bare values" section). The compare UI uses
    reasoning
    for per-record drill-downs and
    assessment
    to determine whether a metric trend is an improvement.
  4. Emit the file.
    For
    .py
    — single file, one blank line between sections, banner comments like:
    python
    # ─── 3. Dataset ───────────────────────────────────────────────────────────────
    Use
    from __future__ import annotations
    and
    from typing import Any, Dict
    at the top. Type-hint task and evaluator function signatures.
    For
    .ipynb
    — valid Jupyter notebook JSON. Schema:
    json
    {
      "cells": [
        {"cell_type": "markdown", "metadata": {}, "source": ["## 1. Env setup\n", "..."]},
        {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["..."]},
        ...
      ],
      "metadata": {
        "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
        "language_info": {"name": "python", "version": "3.10"}
      },
      "nbformat": 4,
      "nbformat_minor": 5
    }
    One markdown cell + one code cell per section. Keep each code cell self-contained enough that re-running it in isolation makes sense.
  5. Best-effort syntax check via Bash. Don't fail the skill if the toolchain is missing — just report.
    • .py
      :
      python -m py_compile <path>
    • .ipynb
      :
      python -c "import json; nb = json.load(open('<path>')); assert nb.get('cells'); print(f'cells={len(nb[\"cells\"])}')"
  6. Print next-steps (see Output section).

  1. 解析参数。默认
    --format py
    。根据
    --format
    解析
    --output
    的扩展名。
    如果未提供
    --project-name
    ,则按以下顺序解析来源,取第一个匹配项,生成
    experiment-<service-name>
    格式的默认值:
    1. pyproject.toml
      [project] name
      (PEP 621)或
      [tool.poetry] name
    2. setup.cfg
      [metadata] name
    3. setup.py
      setup(...)
      中的第一个
      name="..."
      参数。
    4. package.json
      "name"
      (当LLM应用位于TS/JS monorepo的Python包中时有用)。
    5. 当前工作目录的基名,转换为小写并清理(
      /^[a-z0-9-]+$/
      ——将不符合的字符替换为
      -
      )。
    最终的项目名称为
    experiment-<service-name>
    。如果
    <service-name>
    已经以
    experiment-
    开头,则去除前缀(例如名为
    experiment-foo
    的包会生成
    experiment-foo
    ,而非
    experiment-experiment-foo
    )。如果上述五个来源均无法解析为非空字符串,则回退到
    experiment-sdk-default
    ,并在后续步骤输出中发出警告,提示用户应显式设置
    --project-name
    将解析后的名称作为字符串字面量嵌入生成的
    PROJECT_NAME = "..."
    行中——不要生成运行时
    os.getcwd()
    查找代码,因为用户可能在与工具解析时不同的目录中运行文件。
  2. 解析数据集来源。如果同时传递了
    --dataset
    --dataset-name
    ,则报错——它们互斥。
    • --dataset <路径>
      (本地文件 → 内置记录或CSV加载器)
      :
      • 读取文件。如果是JSON文件,验证顶级数组是否符合
        DatasetRecordRaw
        格式(
        input_data
        、可选的
        expected_output
        metadata
        tags
        )。如果是CSV文件,解析表头并使用
        dataset-bootstrap
        启发式自动检测列:
        prompt|input|query|question
        → 输入列,
        expected|gold|truth|answer
        → 预期输出列。
      • 对所有字符串值进行PII清理(使用正则表达式匹配邮箱/电话/SSN/API密钥);将匹配项替换为
        <REDACTED:pii-type>
        ,并在警告中列出受影响的索引。
      • 对于JSON数据集,将记录内置到生成文件中(
        records=[...]
        ),确保用户拥有一个独立的工件。对于CSV数据集,生成
        LLMObs.create_dataset_from_csv(csv_path="<绝对路径>", ...)
        ,并告知用户运行时CSV文件必须存在。
    • --dataset-name <名称>
      (现有Datadog数据集 → 运行时拉取)
      :
      • 生成
        LLMObs.pull_dataset(dataset_name="<name>", project_name="<project>"[, version=<n>])
        ,替代任何
        create_dataset*
        调用。拉取操作在生成的实验运行时进行——工具本身不会调用Datadog。
      • 仅当设置了
        --dataset-version
        时才传递
        version=<n>
        ;否则省略,让SDK解析为最新版本。
      • 在调用上方添加一行注释,说明正在拉取的内容,例如
        # 从Datadog拉取:dataset_name="qa_v3", version=latest
      • 跳过PII清理和内置记录生成——没有本地记录需要清理。
    • 未提供任何标志:
      • 回退到
        --dataset
        默认值中描述的3条记录内置示例,确保生成的文件可直接运行。
    关于数据集ID的说明。公共SDK的
    LLMObs.pull_dataset(...)
    接受名称而非ID——因此没有
    --dataset-id
    标志。如果用户只有Datadog UI URL中的数据集ID(
    /llm/datasets/<id>
    ),则工作流为:在UI中打开该URL,复制数据集名称,然后作为
    --dataset-name
    传递。工具切勿导入
    ddtrace.llmobs._experiment
    或任何其他下划线开头的模块来规避此限制。
  3. 根据
    --evaluator-style
    选择评估器模板
    :
    • function
      : 3个普通函数——一个简单的布尔函数(
      exact_match
      风格,可返回纯
      bool
      )、一个更丰富的基于规则的检查(返回带有
      reasoning
      +
      assessment
      EvaluatorResult
      ),以及一个LLM-as-Judge代理。如果
      --dataset
      包含结构化的
      expected_output
      ,则添加JSON格式检查(同样返回
      EvaluatorResult
      )。
    • class
      : 2个
      BaseEvaluator
      子类,带有
      evaluate(self, context: EvaluatorContext) -> EvaluatorResult
      方法。始终返回
      EvaluatorResult
      (切勿返回纯值)——带有状态的评估器需要展示更丰富的信号。
    • remote
      : 1-2个
      RemoteEvaluator(eval_name=...)
      实例,并添加注释,指导用户先在Datadog UI中创建评估器。
    所有风格通用: 任何具有非平凡逻辑的评估器都必须返回
    EvaluatorResult
    ,至少填充
    value
    +
    reasoning
    +
    assessment
    (请参阅“返回
    EvaluatorResult
    ,而非纯值”部分)。对比UI使用
    reasoning
    进行每条记录的详情查看,使用
    assessment
    判断指标趋势是否代表改进。
  4. 生成文件
    对于
    .py
    文件
    ——单个文件,章节之间空一行,使用如下横幅注释:
    python
    # ─── 3. 数据集 ───────────────────────────────────────────────────────────────
    在顶部使用
    from __future__ import annotations
    from typing import Any, Dict
    。为任务和评估器函数签名添加类型提示。
    对于
    .ipynb
    文件
    ——有效的Jupyter笔记本JSON。 schema:
    json
    {
      "cells": [
        {"cell_type": "markdown", "metadata": {}, "source": ["## 1. 环境设置\n", "..."]},
        {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["..."]},
        ...
      ],
      "metadata": {
        "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
        "language_info": {"name": "python", "version": "3.10"}
      },
      "nbformat": 4,
      "nbformat_minor": 5
    }
    每个章节对应一个markdown单元格+一个代码单元格。确保每个代码单元格足够独立,单独重新运行也有意义。
  5. 通过Bash进行语法检查(尽力而为)。如果工具链缺失,不要让工具失败——只需报告。
    • .py
      :
      python -m py_compile <路径>
    • .ipynb
      :
      python -c "import json; nb = json.load(open('<path>')); assert nb.get('cells'); print(f'cells={len(nb[\"cells\"])}')"
  6. 打印后续步骤(见输出部分)。

What the Generated Code MUST NOT Do

生成代码绝对不能做的事情

A reviewer should be able to run these
grep
checks against the generated file and get zero matches:
grep
pattern
Why it's wrong
uuid4
,
uuid.uuid4
record_id
is minted by the SDK on
dataset.append()
; never client-generate.
PATCH 
,
batch_update
,
records/upload
Status state machine and dataset diff are SDK responsibilities.
from ddtrace.llmobs._
Private import paths. Always use
from ddtrace.llmobs import ...
.
"record_id"
,
"canonical_id"
(as dict keys in records)
The SDK owns them.
DD_API_KEY = "<actual key>"
Always read from
os.environ
.
requests.post
,
httpx.post
The skill produces SDK-only code. Direct HTTP calls bypass the SDK's lazy creation, push-diff, and bulk-threshold handling.
If any of those slip into the output, the skill is wrong — re-emit.

审核者应该能够对生成的文件运行以下
grep
检查,且结果为零匹配:
grep
模式
错误原因
uuid4
,
uuid.uuid4
record_id
由SDK在
dataset.append()
时生成;切勿在客户端生成。
PATCH 
,
batch_update
,
records/upload
状态机和数据集差异是SDK的职责。
from ddtrace.llmobs._
私有导入路径。始终使用
from ddtrace.llmobs import ...
"record_id"
,
"canonical_id"
(作为记录中的字典键)
这些由SDK管理。
DD_API_KEY = "<实际密钥>"
始终从
os.environ
读取。
requests.post
,
httpx.post
工具应仅生成SDK代码。直接HTTP调用会绕过SDK的延迟创建、推送差异和批量阈值处理。
如果生成的文件中出现上述任何内容,则工具出错——重新生成。

Output

输出

After writing, print:
Generated SDK experiment: <format>
Path: <path>
Lines: <count>   (or Cells: <count> for .ipynb)

SDK calls used:
  ✓ LLMObs.enable(...)                       (line/cell ~<N>)
  ✓ LLMObs.<create_dataset|create_dataset_from_csv|pull_dataset>(...)  (line/cell ~<N>)
  ✓ task_fn(input_data, config)              (line/cell ~<N>)
  ✓ <N> evaluators (style: <function|class|remote>)
  ✓ LLMObs.experiment(...).run(jobs=<N>)     (line/cell ~<N>)
  ✓ Provenance (in config + tags): generated_by=claude-code, skill=llm-obs-experiment-py-bootstrap

Syntax check: <pass | skipped: toolchain missing | fail with details>

Install:
  pip install "ddtrace>=4.7" python-dotenv openai

Environment variables (required at runtime):
  export DD_API_KEY=...
  export DD_APPLICATION_KEY=...
  export DD_SITE=datadoghq.com
  export OPENAI_API_KEY=...   # only if you keep the placeholder task

Run:
  python <path>                  # for --format py
  jupyter notebook <path>        # for --format ipynb

Next steps:
1. Replace the placeholder task_fn with your actual LLM call.
2. Adjust the evaluators (or wire up RemoteEvaluator names you created in the Datadog UI).
3. Run it. The script prints experiment.url at the end.
4. Watch the experiment: https://app.datadoghq.com/llm/experiments

生成文件后,打印:
生成的SDK实验:<格式>
路径:<路径>
行数:<数量>   (对于.ipynb文件则显示单元格数:<数量>)

使用的SDK调用:
  ✓ LLMObs.enable(...)                       (行/单元格 ~<N>)
  ✓ LLMObs.<create_dataset|create_dataset_from_csv|pull_dataset>(...)  (行/单元格 ~<N>)
  ✓ task_fn(input_data, config)              (行/单元格 ~<N>)
  ✓ <N>个评估器(风格:<function|class|remote>)
  ✓ LLMObs.experiment(...).run(jobs=<N>)     (行/单元格 ~<N>)
  ✓ 来源信息(在config和tags中):generated_by=claude-code, skill=llm-obs-experiment-py-bootstrap

语法检查:<通过 | 跳过:工具链缺失 | 失败并显示详情>

安装依赖:
  pip install "ddtrace>=4.7" python-dotenv openai

运行时所需环境变量:
  export DD_API_KEY=...
  export DD_APPLICATION_KEY=...
  export DD_SITE=datadoghq.com
  export OPENAI_API_KEY=...   # 仅当保留占位符任务时需要

运行:
  python <路径>                  # 适用于--format py
  jupyter notebook <路径>        # 适用于--format ipynb

后续步骤:
1. 将占位符task_fn替换为实际的LLM调用。
2. 调整评估器(或连接你在Datadog UI中创建的RemoteEvaluator名称)。
3. 运行实验。脚本最后会打印experiment.url。
4. 查看实验:https://app.datadoghq.com/llm/experiments

Reference Notebook Patterns (use as templates)

参考笔记本模式(用作模板)

The canonical set lives at https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks and serves as the style reference — the generated code should feel like it could have come from this set.
NotebookPattern demonstrated
00-basic-datasets.ipynb
Dataset create/append/push lifecycle
01-basic-experiments.ipynb
Minimum viable experiment — inline records, OpenAI task, 2 boolean evaluators
02-extra-data.ipynb
CSV-loaded dataset, multi-value task output, confidence-based evaluators
04-multi-span-experiments.ipynb
Two-step LLM pipelines inside a single
task_fn
07-remote-evaluators.ipynb
RemoteEvaluator
with custom
transform_fn
When
--evaluator-style remote
, lean toward the
07
style. When
--dataset
is a CSV, lean toward
02
. Default (no
--dataset
,
--evaluator-style function
) is the
01
style.

标准参考笔记本集位于https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks,是风格参考——生成的代码应看起来像是来自该集合。
笔记本演示的模式
00-basic-datasets.ipynb
数据集创建/追加/推送生命周期
01-basic-experiments.ipynb
最简可行实验——内置记录、OpenAI任务、2个布尔评估器
02-extra-data.ipynb
CSV加载的数据集、多值任务输出、基于置信度的评估器
04-multi-span-experiments.ipynb
单个
task_fn
内的两步LLM流水线
07-remote-evaluators.ipynb
带有自定义
transform_fn
RemoteEvaluator
--evaluator-style remote
时,参考
07
风格。当
--dataset
是CSV文件时,参考
02
风格。默认情况(无
--dataset
--evaluator-style function
)参考
01
风格。

Datadog Documentation

Datadog文档

These are the canonical reference pages on https://docs.datadoghq.com/. Use them to ground answers about LLM Observability features and to look up details that aren't covered in this skill.
TopicURLUse when
LLM Observability overviewhttps://docs.datadoghq.com/llm_observability/Establishing what the product covers, terminology
Setuphttps://docs.datadoghq.com/llm_observability/setup/API/app key creation, project + ml_app setup, region/site selection
Instrumentation overviewhttps://docs.datadoghq.com/llm_observability/instrumentation/Auto-instrumentation, manual SDK usage, span model
Python SDK referencehttps://docs.datadoghq.com/llm_observability/instrumentation/sdk/Public symbol list, decorator semantics, span kinds, annotate/enable signatures
Experimentshttps://docs.datadoghq.com/llm_observability/experiments/
LLMObs.experiment(...)
, dataset lifecycle, eval streaming, status states
Evaluationshttps://docs.datadoghq.com/llm_observability/evaluations/Evaluator concepts, managed vs custom evaluators
Custom LLM-as-a-judge evaluationshttps://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/
RemoteEvaluator
payload shape and rubric design
Managed evaluationshttps://docs.datadoghq.com/llm_observability/evaluations/managed_evaluations/Pre-built judges (faithfulness, toxicity, etc.)
Monitoringhttps://docs.datadoghq.com/llm_observability/monitoring/Alerts, dashboards, span-level monitors
Terms / glossaryhttps://docs.datadoghq.com/llm_observability/terms/Span kinds, sessions, traces, ml_app
Evaluation developer guidehttps://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide/Writing offline evaluators, validation strategy
Claude Code skills guidehttps://docs.datadoghq.com/llm_observability/guide/claude_code_skills/How this skill fits alongside the rest of the
dd-llmo
set
MCP serverhttps://docs.datadoghq.com/llm_observability/mcp_server/Connecting MCP-compatible clients to LLM Obs data
Reference notebooks (GitHub)https://github.com/DataDog/llm-observability/tree/main/experiments/notebooksStyle-of-life examples for the generated
.py
/
.ipynb
以下是https://docs.datadoghq.com/上的标准参考页面。使用这些页面来解答LLM可观测性功能相关问题,并查找本工具未涵盖的细节。
主题URL使用场景
LLM可观测性概述https://docs.datadoghq.com/llm_observability/了解产品覆盖范围、术语
设置https://docs.datadoghq.com/llm_observability/setup/API/app密钥创建、项目+ml_app设置、区域/站点选择
instrumentation概述https://docs.datadoghq.com/llm_observability/instrumentation/自动instrumentation、手动SDK使用、Span模型
Python SDK参考https://docs.datadoghq.com/llm_observability/instrumentation/sdk/公共符号列表、装饰器语义、Span类型、annotate/enable签名
实验https://docs.datadoghq.com/llm_observability/experiments/
LLMObs.experiment(...)
、数据集生命周期、评估流式传输、状态
评估https://docs.datadoghq.com/llm_observability/evaluations/评估器概念、托管与自定义评估器
自定义LLM-as-a-judge评估https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/
RemoteEvaluator
负载格式和评估规则设计
托管评估https://docs.datadoghq.com/llm_observability/evaluations/managed_evaluations/预构建评估器(忠实度、毒性等)
监控https://docs.datadoghq.com/llm_observability/monitoring/告警、仪表板、Span级监控
术语/词汇表https://docs.datadoghq.com/llm_observability/terms/Span类型、会话、Trace、ml_app
评估开发者指南https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide/编写离线评估器、验证策略
Claude Code技能指南https://docs.datadoghq.com/llm_observability/guide/claude_code_skills/本技能如何与其他
dd-llmo
技能配合使用
MCP服务器https://docs.datadoghq.com/llm_observability/mcp_server/将MCP兼容客户端连接到LLM Obs数据
参考笔记本(GitHub)https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks生成的
.py
/
.ipynb
文件的风格示例

Researching features the skill does not cover

研究本工具未涵盖的功能

If the user asks about an LLM Observability feature the skill's body doesn't address (e.g., specific span kinds, dataset versioning semantics, an evaluator type not covered above), fetch the relevant page from
docs.datadoghq.com
rather than guessing:
  1. Pick the most specific URL from the table above. Most LLM Obs questions resolve under
    /llm_observability/{experiments,evaluations,instrumentation,monitoring}/
    .
  2. Use
    WebFetch
    on that URL with a focused query (e.g.,
    "How does Dataset.push() handle the 5 MB threshold?"
    ). Prefer
    WebFetch
    over generic web search — the canonical page is almost always under
    docs.datadoghq.com/llm_observability/
    .
  3. Fall back to
    WebSearch
    with
    site:docs.datadoghq.com/llm_observability
    if you don't know which subpage owns the topic.
  4. Cite the page in the answer with its URL so the user can verify and bookmark.
Never invent symbols or behaviors not present in this skill body or the docs above. If the docs don't cover the question either, say so explicitly and suggest filing an issue on
DataDog/llm-observability
rather than fabricating a workaround.

如果用户询问本工具未涉及的LLM可观测性功能(例如特定Span类型、数据集版本控制语义、上述未涵盖的评估器类型),请从
docs.datadoghq.com
获取相关页面,而非猜测:
  1. 选择最具体的URL从上述表格中。大多数LLM Obs问题可在
    /llm_observability/{experiments,evaluations,instrumentation,monitoring}/
    下找到答案。
  2. **使用
    WebFetch
    **对该URL进行聚焦查询(例如
    "Dataset.push()如何处理5MB阈值?"
    )。优先使用
    WebFetch
    而非通用网络搜索——标准页面几乎都在
    docs.datadoghq.com/llm_observability/
    下。
  3. **如果不知道哪个子页面涵盖该主题,回退到
    WebSearch
    **并使用
    site:docs.datadoghq.com/llm_observability
  4. 在答案中引用页面URL,以便用户验证和收藏。
切勿发明本工具或上述文档中未提及的符号或行为。如果文档也未涵盖该问题,请明确说明,并建议在
DataDog/llm-observability
上提交issue,而非编造解决方案。

Operating Rules

操作规则

  • SDK only. No
    requests.post
    , no manual JSON:API envelope construction, no manual ID generation. If a feature seems to require those, you're solving the wrong problem — the SDK already covers it.
  • Public imports only.
    from ddtrace.llmobs import ...
    . Never
    _experiment
    ,
    _llmobs
    , or any underscore-prefixed module.
  • Env vars, not literals. Credentials always read from
    os.environ
    . The generated
    main()
    (or the env-setup cell) must
    assert
    they're set with a clear message.
  • Always pass
    site=
    to
    LLMObs.enable()
    .
    Read it from
    os.getenv("DD_SITE", "datadoghq.com")
    . Omitting
    site=
    silently defaults to US1 prod, which breaks every non-prod org (e.g. staging
    datad0g.com
    ,
    datadoghq.eu
    ). The canonical signature already includes it — never drop it.
  • Per-record
    tags
    are
    "key:value"
    strings.
    When inlining records (whether from
    --dataset
    JSON, CSV, or the default sample), each entry in a record's
    "tags"
    list must be a
    "key:value"
    string like
    "env:prod"
    ,
    "source:traces"
    ,
    "category:geography"
    . Bare strings (
    "smoke"
    ,
    "baseline"
    ) trigger
    ValueError: Tag '<name>' is malformed.
    at
    Dataset.append()
    time. If the source data has bare-string tags, namespace them — e.g. wrap
    "smoke"
    as
    "tag:smoke"
    rather than dropping it.
  • # TODO(user)
    markers
    on the placeholder task and on at least one evaluator so reviewers can't ship the placeholder by accident.
  • Match notebook conventions. Plain function evaluators by default; class-based only when the user opts in. Print
    experiment.url
    at the end of every generated file.
  • Tag every experiment with provenance — in both
    config
    and
    tags
    .
    Every
    LLMObs.experiment(...)
    call must carry
    "generated_by": "claude-code"
    and
    "skill": "llm-obs-experiment-py-bootstrap"
    as keys in both the
    config={...}
    dict (so they render in the experiment's Configuration view, which is where users actually look) and the
    tags={...}
    dict (which the SDK serializes into
    metadata.tags
    for future tag-filter consumers). The
    tags=
    path alone is not enough: the current LLM Experiments UI does not surface
    metadata.tags
    as filterable chips, so users won't see the provenance unless it's also in
    config
    . If a user later edits the generated file to add their own keys, they extend both dicts — never replace the provenance keys silently.
  • PII scrub at the door. If
    --dataset
    is given, scrub before inlining into the generated file. Never embed a record that contains an unmasked email/phone/SSN/API-key pattern.
  • Don't generate
    requirements.txt
    or
    pyproject.toml
    .
    Print the
    pip install
    command in the next-steps message instead — most users already have a venv.
  • No silent fallbacks. If
    --format
    is unsupported, error out with the valid choices.
  • Python only. If a user passes
    --language typescript
    (or any non-Python language flag), error out — this skill produces Python
    ddtrace.llmobs
    SDK code only.
  • Research, don't invent. If the user asks about an LLM Observability feature, span kind, evaluator type, or SDK symbol that is not documented in this skill body,
    WebFetch
    the relevant
    docs.datadoghq.com/llm_observability/*
    page (see the Datadog Documentation table above for the canonical URLs) before answering. Cite the page URL in the response. If the docs don't cover the topic, say so explicitly — never fabricate symbols, flags, or behaviors.
  • 仅使用SDK。禁止使用
    requests.post
    、手动构建JSON:API信封、手动生成ID。如果某个功能似乎需要这些操作,说明你解决问题的方式有误——SDK已经涵盖了这些功能。
  • 仅使用公共导入
    from ddtrace.llmobs import ...
    。切勿使用
    _experiment
    _llmobs
    或任何以下划线开头的模块。
  • 使用环境变量,而非字面量。凭据始终从
    os.environ
    读取。生成的
    main()
    (或环境设置单元格)必须通过
    assert
    确保它们已设置,并给出清晰的提示信息。
  • 始终向
    LLMObs.enable()
    传递
    site=
    参数
    。从
    os.getenv("DD_SITE", "datadoghq.com")
    读取。省略
    site=
    会默认使用US1生产环境,这会导致所有非生产组织(例如 staging
    datad0g.com
    datadoghq.eu
    )的功能失效。标准签名已包含此参数——切勿省略。
  • 每条记录的
    tags
    "key:value"
    格式的字符串
    。当内置记录时(无论是来自
    --dataset
    JSON、CSV还是默认示例),记录的
    "tags"
    列表中的每个条目必须是
    "key:value"
    格式的字符串,例如
    "env:prod"
    "source:traces"
    "category:geography"
    。纯字符串(
    "smoke"
    "baseline"
    )会在
    Dataset.append()
    时触发
    ValueError: Tag '<name>' is malformed.
    。如果源数据包含纯字符串标签,请为其添加命名空间——例如将
    "smoke"
    包装为
    "tag:smoke"
    ,而非删除它。
  • 添加
    # TODO(user)
    标记
    在占位符任务和至少一个评估器上,确保审核者不会意外发布占位符代码。
  • 匹配笔记本惯例。默认使用普通函数评估器;仅当用户选择时才使用基于类的评估器。在每个生成文件的末尾打印
    experiment.url
  • 为每个实验添加来源标签——同时在
    config
    tags
    。每个
    LLMObs.experiment(...)
    调用必须
    config={...}
    字典(以便在实验的配置视图中显示,这是用户实际查看的地方)
    tags={...}
    字典(SDK会将其序列化为
    metadata.tags
    ,供未来的标签筛选消费者使用)中包含
    "generated_by": "claude-code"
    "skill": "llm-obs-experiment-py-bootstrap"
    键。仅使用
    tags=
    路径是不够的:当前LLM实验UI不会将
    metadata.tags
    显示为可筛选标签,因此用户只有在config中看到来源信息。如果用户稍后编辑生成的文件添加自己的键,他们应扩展两个字典——切勿静默替换来源键。
  • 在入口处进行PII清理。如果提供了
    --dataset
    ,在将记录内置到生成文件前进行清理。切勿嵌入包含未掩码邮箱/电话/SSN/API密钥模式的记录。
  • 不要生成
    requirements.txt
    pyproject.toml
    。在后续步骤消息中打印
    pip install
    命令即可——大多数用户已经有虚拟环境。
  • 无静默回退。如果
    --format
    不支持,则报错并列出有效选项。
  • 仅支持Python。如果用户传递
    --language typescript
    (或任何非Python语言标志),则报错——本工具仅生成Python
    ddtrace.llmobs
    SDK代码。
  • 研究,而非发明。如果用户询问本工具未记录的LLM可观测性功能、Span类型、评估器类型或SDK符号,请先
    WebFetch
    相关的
    docs.datadoghq.com/llm_observability/*
    页面(请参阅上述Datadog文档表格中的标准URL),然后再回答。在响应中引用页面URL。如果文档未涵盖该主题,请明确说明——切勿编造符号、标志或行为。