LLM Obs Experiment (Python) Bootstrap — Generate a Python Experiment Using
Produce a single self-contained Python experiment that uses the official
SDK. Output is either a
script or an
notebook. The generated code mirrors the patterns shown in DataDog's reference notebooks at
https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks.
The SDK handles lazy project/experiment creation, dataset push diffing, the 5 MB / 1000-record bulk threshold, eval metric streaming, and the status state machine on the user's behalf. This skill must therefore
never re-implement those primitives — it just imports
and trusts it.
Usage
/llm-obs-experiment-py-bootstrap [--format py|ipynb] [--dataset <path>] [--dataset-name <name>] [--dataset-version <int>] [--project-name <name>] [--evaluator-style function|class|remote] [--jobs <n>] [--output <path>]
Arguments: $ARGUMENTS
Inputs
All inputs are optional. If the user omits a flag, fall back to the default — never block on prompting for
,
, etc.
| Input | Default | Description |
|---|
| | (single file) or (Jupyter notebook with one cell per section). |
| none — emit a sample 3-record inline so the file is runnable as-is | Path to a local JSON or CSV. JSON → create_dataset(records=...)
; CSV → create_dataset_from_csv(...)
. Mutually exclusive with . |
| none | Name of an existing Datadog dataset to fetch at runtime via . Use this when the dataset already lives in Datadog (e.g. created in the UI or by a prior run) — no local file required. Mutually exclusive with . |
| none (latest) | Pin to a specific dataset version when using . Passed through as . Ignored if is not set. |
| experiment-<service-name>
— derived from the codebase (see Workflow step 1); falls back to only if nothing resolves | Datadog project name (visible in the LLM Experiments UI). The SDK's tag falls back to this automatically — no separate flag needed. |
| | (plain functions — notebook default), ( subclasses), or ( instances). |
| | Passed to . |
| ./experiments/experiment.<ext>
| File extension derives from : or . |
SDK Surface (Cited)
These are the public symbols the generated code uses. All come from
(the public package — never from
ddtrace.llmobs._experiment
or other underscore-prefixed modules).
| Import | Source | What it gives you |
|---|
| ddtrace/llmobs/__init__.py
re-exports | , , .create_dataset_from_csv()
, .pull_dataset(dataset_name, project_name, version)
, , |
| , | ddtrace/llmobs/__init__.py
| LLM-as-Judge that runs server-side; preferred over inline |
| , | ddtrace/llmobs/__init__.py
| Class-based evaluator path (advanced) |
| ddtrace/llmobs/_evaluators/llm_judge.py
(re-exported) | Inline LLM-as-Judge with prompt template support |
Canonical call signatures (must match the generated code exactly):
python
LLMObs.enable(
api_key=os.getenv("DD_API_KEY"),
app_key=os.getenv("DD_APPLICATION_KEY"),
site=os.getenv("DD_SITE", "datadoghq.com"), # required for non-prod sites (e.g. datad0g.com, datadoghq.eu)
project_name="<project>",
agentless_enabled=True, # required when not running behind the dd-agent
)
# Note: ml_app is not a separate input. The SDK derives it from project_name
# when not supplied. If a user really wants to override it later, they can
# add `ml_app="..."` to enable() themselves.
dataset = LLMObs.create_dataset(
dataset_name="<name>",
description="<optional>",
records=[
# Per-record `tags` MUST be a list of "key:value" strings (e.g. "env:smoke"),
# never bare strings — the SDK rejects malformed tags with a ValueError on append.
{"input_data": {"<k>": "<v>"}, "expected_output": "<v>", "metadata": {}, "tags": ["env:<env>"]},
# ...
],
)
# OR
dataset = LLMObs.create_dataset_from_csv(
csv_path="<path>",
dataset_name="<name>",
input_data_columns=["<col1>", "<col2>"],
expected_output_columns=["<col>"],
)
# OR pull an existing Datadog dataset by name (no local file needed)
dataset = LLMObs.pull_dataset(
dataset_name="<name>",
project_name="<project>", # optional — defaults to the project on enable()
version=2, # optional — pin a version; omit for the latest
)
def task_fn(input_data: dict, config: dict):
# TODO(user): replace with your actual LLM call
...
# Plain function evaluator (default style)
def exact_match(input_data, output_data, expected_output) -> bool:
return output_data == expected_output
experiment = LLMObs.experiment(
name="<experiment_name>",
dataset=dataset,
task=task_fn,
evaluators=[exact_match],
config={
"model": "gpt-4o-mini",
"temperature": 0.0,
# Provenance also lives in `config` so it renders in the
# experiment's Configuration view alongside model/temperature.
# `tags=` below only reaches metadata.tags, which the current UI
# does not surface as chips — config is what users actually see.
"generated_by": "claude-code",
"skill": "llm-obs-experiment-py-bootstrap",
},
description="<optional>",
tags={
# Same provenance, sent to experiment metadata.tags for any future
# tag-filter UI / API consumers. Always emitted alongside the
# config copy — never one without the other.
"generated_by": "claude-code",
"skill": "llm-obs-experiment-py-bootstrap",
},
)
experiment.run(jobs=10)
print(experiment.url)
Evaluator Styles
Generated code uses
one of three evaluator surfaces, picked by
. Whichever style is chosen,
prefer returning over a bare / whenever the evaluator has any signal beyond the raw value — see "Return EvaluatorResult, not bare values" below.
Return , not bare values
Plain functions are allowed to return
/
/
, and
is allowed to return raw
. The SDK accepts both — but
carries fields the Datadog UI surfaces in ways the raw value cannot:
| Field | Type | Used by Datadog UI for |
|---|
| / / / (JSONType) | The score itself — shown on the experiment metric. Required. |
| | Per-record explanation shown in the compare UI; lets reviewers see why an evaluator passed/failed without re-running the LLM. |
| (e.g. / / ) | Determines whether a metric trend going up vs. down is an improvement; the UI uses this to color baseline-vs-candidate comparisons. |
| | Free-form per-record context (e.g. ); shown in record drill-down. |
| | Used to slice experiment results in the UI (e.g. ). |
The generated code should default to
for any evaluator richer than a one-line equality check. The trivial
and
shown below are the only cases where a bare
is acceptable.
(default — what the notebooks use)
Plain Python functions with the signature
(input_data, output_data, expected_output)
. Always emit at least three: a trivial boolean (returns
), a richer rule-based one (returns
), and an LLM-as-Judge surrogate (a
reference or a placeholder).
python
from ddtrace.llmobs import EvaluatorResult
# Trivial check — bare bool is fine here, the result has no extra signal.
def exact_match(input_data, output_data, expected_output) -> bool:
return output_data == expected_output
# Richer check — use EvaluatorResult so reasoning/assessment surface in the UI.
def response_well_formed(input_data, output_data, expected_output) -> EvaluatorResult:
if not isinstance(output_data, str):
return EvaluatorResult(
value=False,
reasoning=f"output_data was {type(output_data).__name__}, expected str",
assessment="fail",
)
if len(output_data) > 500:
return EvaluatorResult(
value=False,
reasoning=f"output exceeded 500 chars (was {len(output_data)})",
assessment="fail",
metadata={"length": len(output_data)},
)
return EvaluatorResult(value=True, assessment="pass")
(advanced — for evaluators that need state or async I/O)
Always return
from
— never a bare value. State-bearing evaluators usually have richer reasoning to surface anyway.
python
from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult
class FaithfulnessJudge(BaseEvaluator):
def __init__(self):
super().__init__(name="faithfulness")
# TODO(user): initialize any client or state here
def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
# context exposes: input_data, output_data, expected_output, metadata
# TODO(user): replace placeholder logic with your faithfulness check
passed = context.output_data is not None
return EvaluatorResult(
value=1.0 if passed else 0.0,
reasoning="placeholder — replace with your faithfulness rubric",
assessment="pass" if passed else "fail",
metadata={"evaluator_version": "v1"},
)
(LLM-as-Judge running server-side)
python
from ddtrace.llmobs import RemoteEvaluator
# Create the judge in Datadog UI first: LLM Observability → Evaluations → New Evaluator
quality_judge = RemoteEvaluator(eval_name="<name-from-datadog-ui>")
# Optional: customize the payload the judge receives
custom_judge = RemoteEvaluator(
eval_name="<name>",
transform_fn=lambda ctx: {
"question": ctx.input_data.get("question"),
"answer": ctx.output_data,
"reference": ctx.expected_output,
},
)
Generated File Structure
The same section sequence in both formats. In
these become comment banners; in
each becomes one markdown cell + one code cell.
1. Env setup — load_dotenv(), os.getenv reads, hard assert keys present
2. LLMObs.enable() — explicit api_key/app_key/project_name/agentless_enabled
3. Dataset — inline records OR create_dataset_from_csv
4. Task function — placeholder OpenAI call with # TODO(user) marker
5. Evaluators — 2-3 in the requested style
6. Experiment — LLMObs.experiment(config={..., "generated_by": "claude-code", ...}, tags={"generated_by": "claude-code", ...})
7. Run — experiment.run(jobs=N); print(experiment.url)
8. Results inspection — experiment.as_dataframe() if pandas, else print
Workflow
-
Parse arguments. Default
. Resolve
extension from
.
If
is not provided, resolve a default of the form
experiment-<service-name>
by walking these sources in order, taking the first match:
- → (PEP 621) or .
- → .
- → first argument to .
- → (useful when the LLM app lives in a TS/JS monorepo Python package).
- The basename of the current working directory, lowercased and slugified ( — replace non-matching chars with ).
The final project name is
experiment-<service-name>
. Strip a leading
from
if it already starts with one (so a package literally named
yields
, not
experiment-experiment-foo
). If none of the five sources resolve to a non-empty string, fall back to
and emit a warning in the next-steps output that the user should set
explicitly.
Embed the resolved name as a string literal in the generated
line — don't emit runtime
lookups, since the user may run the file from a different directory than where the skill resolved it.
-
Resolve the dataset source. Error out if both
and
are passed — they're mutually exclusive.
-
(local file → inline records or CSV loader):
- Read the file. If JSON, validate top-level array of shape (, optional , , ). If CSV, parse header and auto-detect columns using the heuristics:
prompt|input|query|question
→ input, expected|gold|truth|answer
→ expected.
- Run a PII scrub (email/phone/SSN/API-key regexes) on all string values; replace matches with and surface a warning listing affected indices.
- For JSON datasets, embed the records inline in the generated file () so the user has a single self-contained artifact. For CSV datasets, emit
LLMObs.create_dataset_from_csv(csv_path="<absolute path>", ...)
and tell the user the CSV needs to be present at runtime.
-
(existing Datadog dataset → runtime pull):
- Emit
LLMObs.pull_dataset(dataset_name="<name>", project_name="<project>"[, version=<n>])
in place of any call. The fetch happens when the generated experiment runs — the skill itself does not call Datadog.
- Pass through only if was set; otherwise omit it so the SDK resolves the latest.
- Add a one-line comment above the call documenting what's being pulled, e.g.
# Pulled from Datadog: dataset_name="qa_v3", version=latest
.
- Skip the PII scrub and the inline-records emission — there are no local records to scrub.
-
Neither flag given:
- Fall back to the inline 3-record sample described under 's default, so the generated file remains runnable as-is.
Note on dataset IDs. The public SDK's
takes a name, not an ID — so there's no
flag. If a user only has a dataset ID from a Datadog UI URL (
), the workflow is: open that URL in the UI, copy the dataset name, and pass it as
. The skill must not import
ddtrace.llmobs._experiment
or any other underscore module to work around this.
-
Pick evaluator template based on
:
- : 3 plain functions — one trivial boolean (-style, bare OK), one richer rule-based check returning with + , and one LLM-as-Judge surrogate. If had structured , add a JSON-shape check (also returning ).
- : 2 subclasses with
evaluate(self, context: EvaluatorContext) -> EvaluatorResult
. Always return (never a bare value) — state-bearing evaluators have richer signal to surface.
- : 1-2
RemoteEvaluator(eval_name=...)
instances with a comment instructing the user to create the judge in the Datadog UI first.
In all styles: any evaluator with non-trivial logic must return
populating at minimum
+
+
(see the "Return
, not bare values" section). The compare UI uses
for per-record drill-downs and
to determine whether a metric trend is an improvement.
-
Emit the file.
For — single file, one blank line between sections, banner comments like:
python
# ─── 3. Dataset ───────────────────────────────────────────────────────────────
Use
from __future__ import annotations
and
from typing import Any, Dict
at the top. Type-hint task and evaluator function signatures.
For — valid Jupyter notebook JSON. Schema:
json
{
"cells": [
{"cell_type": "markdown", "metadata": {}, "source": ["## 1. Env setup\n", "..."]},
{"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["..."]},
...
],
"metadata": {
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
"language_info": {"name": "python", "version": "3.10"}
},
"nbformat": 4,
"nbformat_minor": 5
}
One markdown cell + one code cell per section. Keep each code cell self-contained enough that re-running it in isolation makes sense.
-
Best-effort syntax check via Bash. Don't fail the skill if the toolchain is missing — just report.
- :
python -m py_compile <path>
- :
python -c "import json; nb = json.load(open('<path>')); assert nb.get('cells'); print(f'cells={len(nb[\"cells\"])}')"
-
Print next-steps (see Output section).
What the Generated Code MUST NOT Do
A reviewer should be able to run these
checks against the generated file and get zero matches:
| pattern | Why it's wrong |
|---|
| , | is minted by the SDK on ; never client-generate. |
| , , | Status state machine and dataset diff are SDK responsibilities. |
| Private import paths. Always use from ddtrace.llmobs import ...
. |
| , (as dict keys in records) | The SDK owns them. |
DD_API_KEY = "<actual key>"
| Always read from . |
| , | The skill produces SDK-only code. Direct HTTP calls bypass the SDK's lazy creation, push-diff, and bulk-threshold handling. |
If any of those slip into the output, the skill is wrong — re-emit.
Output
After writing, print:
Generated SDK experiment: <format>
Path: <path>
Lines: <count> (or Cells: <count> for .ipynb)
SDK calls used:
✓ LLMObs.enable(...) (line/cell ~<N>)
✓ LLMObs.<create_dataset|create_dataset_from_csv|pull_dataset>(...) (line/cell ~<N>)
✓ task_fn(input_data, config) (line/cell ~<N>)
✓ <N> evaluators (style: <function|class|remote>)
✓ LLMObs.experiment(...).run(jobs=<N>) (line/cell ~<N>)
✓ Provenance (in config + tags): generated_by=claude-code, skill=llm-obs-experiment-py-bootstrap
Syntax check: <pass | skipped: toolchain missing | fail with details>
Install:
pip install "ddtrace>=4.7" python-dotenv openai
Environment variables (required at runtime):
export DD_API_KEY=...
export DD_APPLICATION_KEY=...
export DD_SITE=datadoghq.com
export OPENAI_API_KEY=... # only if you keep the placeholder task
Run:
python <path> # for --format py
jupyter notebook <path> # for --format ipynb
Next steps:
1. Replace the placeholder task_fn with your actual LLM call.
2. Adjust the evaluators (or wire up RemoteEvaluator names you created in the Datadog UI).
3. Run it. The script prints experiment.url at the end.
4. Watch the experiment: https://app.datadoghq.com/llm/experiments
Reference Notebook Patterns (use as templates)
The canonical set lives at
https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks and serves as the style reference — the generated code should feel like it could have come from this set.
| Notebook | Pattern demonstrated |
|---|
| Dataset create/append/push lifecycle |
01-basic-experiments.ipynb
| Minimum viable experiment — inline records, OpenAI task, 2 boolean evaluators |
| CSV-loaded dataset, multi-value task output, confidence-based evaluators |
04-multi-span-experiments.ipynb
| Two-step LLM pipelines inside a single |
07-remote-evaluators.ipynb
| with custom |
When
, lean toward the
style. When
is a CSV, lean toward
. Default (no
,
--evaluator-style function
) is the
style.
Datadog Documentation
These are the canonical reference pages on
https://docs.datadoghq.com/. Use them to ground answers about LLM Observability features and to look up details that aren't covered in this skill.
| Topic | URL | Use when |
|---|
| LLM Observability overview | https://docs.datadoghq.com/llm_observability/ | Establishing what the product covers, terminology |
| Setup | https://docs.datadoghq.com/llm_observability/setup/ | API/app key creation, project + ml_app setup, region/site selection |
| Instrumentation overview | https://docs.datadoghq.com/llm_observability/instrumentation/ | Auto-instrumentation, manual SDK usage, span model |
| Python SDK reference | https://docs.datadoghq.com/llm_observability/instrumentation/sdk/ | Public symbol list, decorator semantics, span kinds, annotate/enable signatures |
| Experiments | https://docs.datadoghq.com/llm_observability/experiments/ | , dataset lifecycle, eval streaming, status states |
| Evaluations | https://docs.datadoghq.com/llm_observability/evaluations/ | Evaluator concepts, managed vs custom evaluators |
| Custom LLM-as-a-judge evaluations | https://docs.datadoghq.com/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/ | payload shape and rubric design |
| Managed evaluations | https://docs.datadoghq.com/llm_observability/evaluations/managed_evaluations/ | Pre-built judges (faithfulness, toxicity, etc.) |
| Monitoring | https://docs.datadoghq.com/llm_observability/monitoring/ | Alerts, dashboards, span-level monitors |
| Terms / glossary | https://docs.datadoghq.com/llm_observability/terms/ | Span kinds, sessions, traces, ml_app |
| Evaluation developer guide | https://docs.datadoghq.com/llm_observability/guide/evaluation_developer_guide/ | Writing offline evaluators, validation strategy |
| Claude Code skills guide | https://docs.datadoghq.com/llm_observability/guide/claude_code_skills/ | How this skill fits alongside the rest of the set |
| MCP server | https://docs.datadoghq.com/llm_observability/mcp_server/ | Connecting MCP-compatible clients to LLM Obs data |
| Reference notebooks (GitHub) | https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks | Style-of-life examples for the generated / |
Researching features the skill does not cover
If the user asks about an LLM Observability feature the skill's body doesn't address (e.g., specific span kinds, dataset versioning semantics, an evaluator type not covered above), fetch the relevant page from
rather than guessing:
- Pick the most specific URL from the table above. Most LLM Obs questions resolve under
/llm_observability/{experiments,evaluations,instrumentation,monitoring}/
.
- Use on that URL with a focused query (e.g.,
"How does Dataset.push() handle the 5 MB threshold?"
). Prefer over generic web search — the canonical page is almost always under docs.datadoghq.com/llm_observability/
.
- Fall back to with
site:docs.datadoghq.com/llm_observability
if you don't know which subpage owns the topic.
- Cite the page in the answer with its URL so the user can verify and bookmark.
Never invent symbols or behaviors not present in this skill body or the docs above. If the docs don't cover the question either, say so explicitly and suggest filing an issue on
DataDog/llm-observability
rather than fabricating a workaround.
Operating Rules
- SDK only. No , no manual JSON:API envelope construction, no manual ID generation. If a feature seems to require those, you're solving the wrong problem — the SDK already covers it.
- Public imports only.
from ddtrace.llmobs import ...
. Never , , or any underscore-prefixed module.
- Env vars, not literals. Credentials always read from . The generated (or the env-setup cell) must they're set with a clear message.
- Always pass to . Read it from
os.getenv("DD_SITE", "datadoghq.com")
. Omitting silently defaults to US1 prod, which breaks every non-prod org (e.g. staging , ). The canonical signature already includes it — never drop it.
- Per-record are strings. When inlining records (whether from JSON, CSV, or the default sample), each entry in a record's list must be a string like , , . Bare strings (, ) trigger
ValueError: Tag '<name>' is malformed.
at time. If the source data has bare-string tags, namespace them — e.g. wrap as rather than dropping it.
- markers on the placeholder task and on at least one evaluator so reviewers can't ship the placeholder by accident.
- Match notebook conventions. Plain function evaluators by default; class-based only when the user opts in. Print at the end of every generated file.
- Tag every experiment with provenance — in both and . Every call must carry
"generated_by": "claude-code"
and "skill": "llm-obs-experiment-py-bootstrap"
as keys in both the dict (so they render in the experiment's Configuration view, which is where users actually look) and the dict (which the SDK serializes into for future tag-filter consumers). The path alone is not enough: the current LLM Experiments UI does not surface as filterable chips, so users won't see the provenance unless it's also in . If a user later edits the generated file to add their own keys, they extend both dicts — never replace the provenance keys silently.
- PII scrub at the door. If is given, scrub before inlining into the generated file. Never embed a record that contains an unmasked email/phone/SSN/API-key pattern.
- Don't generate or . Print the command in the next-steps message instead — most users already have a venv.
- No silent fallbacks. If is unsupported, error out with the valid choices.
- Python only. If a user passes (or any non-Python language flag), error out — this skill produces Python SDK code only.
- Research, don't invent. If the user asks about an LLM Observability feature, span kind, evaluator type, or SDK symbol that is not documented in this skill body, the relevant
docs.datadoghq.com/llm_observability/*
page (see the Datadog Documentation table above for the canonical URLs) before answering. Cite the page URL in the response. If the docs don't cover the topic, say so explicitly — never fabricate symbols, flags, or behaviors.