nemo-data-designer-plugin
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBefore You Start
开始之前
Do not explore the workspace first. The workflow's Learn step gives you everything you need.
不要先探索工作区。工作流的学习步骤会为你提供所需的全部内容。
Goal
目标
Build a synthetic dataset using the Data Designer library that matches this description:
$ARGUMENTS
使用Data Designer库构建一个符合以下描述的合成数据集:
$ARGUMENTS
Workflow
工作流
Use Autopilot mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use Interactive mode (default).
Read only the workflow file that matches the selected mode, then follow it:
- Interactive → read
workflows/interactive.md - Autopilot → read
workflows/autopilot.md
如果用户表示不想回答问题(例如,他们说“自行决定”“你来选择”“做出合理假设”“直接构建”“给我惊喜”等),请使用Autopilot模式。否则,使用默认的Interactive模式。
仅阅读与所选模式匹配的工作流文件,然后按照其操作:
- Interactive模式 → 阅读
workflows/interactive.md - Autopilot模式 → 阅读
workflows/autopilot.md
Rules
规则
- Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column.
- Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read .
references/seed-datasets.md - When the dataset requires person data (names, demographics, addresses), read .
references/person-sampling.md - If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one.
- For commands and context specific to this NeMo Platform plugin (e.g., sourcing model configs from IGW providers or in-script s, installing or publishing Nemotron Personas locales, platform-side resource pointers), read
ModelConfig.references/nemo-platform-plugin-additions.md
- 默认保留输出中的所有列。删除列的唯一例外情况是:(1) 用户明确要求删除;(2) 该列是仅用于派生其他列的辅助列(例如,用于提取姓名、城市等的抽样人员对象)。如有疑问,请保留该列。
- 不要建议或询问种子数据集。仅当用户明确提供种子数据或要求基于现有记录构建时才使用种子数据集。使用种子数据集时,请阅读。
references/seed-datasets.md - 当数据集需要人员数据(姓名、人口统计信息、地址等)时,请阅读。
references/person-sampling.md - 如果已存在与数据集描述匹配的数据集脚本,请询问用户是要编辑该脚本还是创建新脚本。
- 对于此NeMo Platform插件特有的命令和上下文(例如,从IGW提供商或脚本内中获取模型配置、安装或发布Nemotron Personas语言环境、平台端资源指针),请阅读
ModelConfig。references/nemo-platform-plugin-additions.md
Usage Tips and Common Pitfalls
使用提示与常见陷阱
- Sampler and validation columns need both a type and params. E.g., with
sampler_type="category".params=dd.CategorySamplerParams(...) - Jinja2 templates in ,
prompt, andsystem_promptfields: reference columns withexpr, nested fields with{{ column_name }}.{{ column_name.field }} - :** Takes
**SamplerColumnConfig, notparams.sampler_params - LLM judge score access: produces a nested dict where each score name maps to
LLMJudgeColumnConfig. To get the numeric score, use the{reasoning: str, score: int}attribute. For example, for a judge column named.scorewith a score namedquality, usecorrectness. Using{{ quality.correctness.score }}returns the full dict, not the numeric score.{{ quality.correctness }}
- 抽样器和验证列需要同时指定类型和参数。 例如,搭配
sampler_type="category"。params=dd.CategorySamplerParams(...) - Jinja2模板在、
prompt和system_prompt字段中的使用:使用expr引用列,使用{{ column_name }}引用嵌套字段。{{ column_name.field }} - :** 接收
**SamplerColumnConfig参数,而非params。sampler_params - LLM评分器分数获取: 会生成一个嵌套字典,其中每个分数名称对应
LLMJudgeColumnConfig。要获取数值分数,请使用{reasoning: str, score: int}属性。例如,对于名为.score的评分器列和名为quality的分数,使用correctness。使用{{ quality.correctness.score }}会返回完整字典,而非数值分数。{{ quality.correctness }}
Troubleshooting
故障排除
- CLI not found:** Tell the user that
**nemo data-designeris not installed in this environment (requires Python >= 3.11). Ask if they would like you to create a virtual environment and install it, or if they prefer to do it themselves. Do not install anything without the user's permission.nemo data-designer - Network errors during preview: A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves.
- 未找到命令行工具: 告知用户当前环境未安装
nemo data-designer(要求Python版本≥3.11)。询问用户是否希望创建虚拟环境并安装该工具,还是由用户自行安装。未经用户许可,请勿安装任何内容。nemo data-designer - 预览时出现网络错误: 沙箱环境可能阻止了出站请求。询问用户是否允许禁用沙箱后重试命令。仅当在沙箱外重试也失败时,才告知用户自行运行该命令。
Output Template
输出模板
Write a Python file to the current directory with a function returning a . Name the file descriptively (e.g., ). Use PEP 723 inline metadata for dependencies.
load_config_builder()DataDesignerConfigBuildercustomer_reviews.pypython
undefined在当前目录下编写一个Python文件,其中包含返回的函数。为文件起一个描述性的名称(例如)。使用PEP 723内联元数据声明依赖项。
DataDesignerConfigBuilderload_config_builder()customer_reviews.pypython
undefined/// script
/// script
dependencies = [
dependencies = [
"data-designer", # always required
"data-designer", # always required
"pydantic", # only if this script imports from pydantic
"pydantic", # only if this script imports from pydantic
# add additional dependencies here
# add additional dependencies here
]
]
///
///
import data_designer.config as dd
from pydantic import BaseModel, Field
import data_designer.config as dd
from pydantic import BaseModel, Field
Use Pydantic models when the output needs to conform to a specific schema
Use Pydantic models when the output needs to conform to a specific schema
class MyStructuredOutput(BaseModel):
field_one: str = Field(description="...")
field_two: int = Field(description="...")
class MyStructuredOutput(BaseModel):
field_one: str = Field(description="...")
field_two: int = Field(description="...")
Use custom generators when built-in column types aren't enough
Use custom generators when built-in column types aren't enough
@dd.custom_column_generator(
required_columns=["col_a"],
side_effect_columns=["extra_col"],
)
def generator_function(row: dict) -> dict:
# add custom logic here that depends on "col_a" and update row in place
row["name_in_custom_column_config"] = "custom value"
row["extra_col"] = "extra value"
return row
def load_config_builder() -> dd.DataDesignerConfigBuilder:
config_builder = dd.DataDesignerConfigBuilder(
# Declaring model configs programmatically here is the portable path:
# it works for both local and cluster , while the local
# YAML registry alternative only works for . The provider below
# is a common default created during — confirm it (or
# discover others) with . See
# references/nemo-platform-plugin-additions.md for the local-YAML alternative.
model_configs=[
dd.ModelConfig(
alias="text",
model="...",
provider="default/nvidia-build",
inference_parameters=dd.ChatCompletionInferenceParams(),
),
],
)
runsubmitrunnemo setupnemo inference providers list# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))
# config_builder.add_column(...)
# config_builder.add_processor(...)
return config_builder
Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them. Prefer including `model_configs` when the dataset uses LLM columns — declaring it in the script keeps the config portable between local `run` and cluster `submit`, while the local YAML registry alternative only works for `run`.@dd.custom_column_generator(
required_columns=["col_a"],
side_effect_columns=["extra_col"],
)
def generator_function(row: dict) -> dict:
# add custom logic here that depends on "col_a" and update row in place
row["name_in_custom_column_config"] = "custom value"
row["extra_col"] = "extra value"
return row
def load_config_builder() -> dd.DataDesignerConfigBuilder:
config_builder = dd.DataDesignerConfigBuilder(
# Declaring model configs programmatically here is the portable path:
# it works for both local and cluster , while the local
# YAML registry alternative only works for . The provider below
# is a common default created during — confirm it (or
# discover others) with . See
# references/nemo-platform-plugin-additions.md for the local-YAML alternative.
model_configs=[
dd.ModelConfig(
alias="text",
model="...",
provider="default/nvidia-build",
inference_parameters=dd.ChatCompletionInferenceParams(),
),
],
)
runsubmitrunnemo setupnemo inference providers list# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))
# config_builder.add_column(...)
# config_builder.add_processor(...)
return config_builder
仅在任务需要时才包含Pydantic模型、自定义生成器、种子数据集和额外依赖项。当数据集使用LLM列时,优先包含`model_configs`——在脚本中声明它可使配置在本地`run`和集群`submit`之间通用,而本地YAML注册表的替代方案仅适用于`run`。