nemo-data-designer-plugin

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Before You Start

开始之前

Do not explore the workspace first. The workflow's Learn step gives you everything you need.
不要先探索工作区。工作流的学习步骤会为你提供所需的全部内容。

Goal

目标

Build a synthetic dataset using the Data Designer library that matches this description:
$ARGUMENTS
使用Data Designer库构建一个符合以下描述的合成数据集:
$ARGUMENTS

Workflow

工作流

Use Autopilot mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use Interactive mode (default).
Read only the workflow file that matches the selected mode, then follow it:
  • Interactive → read
    workflows/interactive.md
  • Autopilot → read
    workflows/autopilot.md
如果用户表示不想回答问题(例如,他们说“自行决定”“你来选择”“做出合理假设”“直接构建”“给我惊喜”等),请使用Autopilot模式。否则,使用默认的Interactive模式。
仅阅读与所选模式匹配的工作流文件,然后按照其操作:
  • Interactive模式 → 阅读
    workflows/interactive.md
  • Autopilot模式 → 阅读
    workflows/autopilot.md

Rules

规则

  • Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column.
  • Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read
    references/seed-datasets.md
    .
  • When the dataset requires person data (names, demographics, addresses), read
    references/person-sampling.md
    .
  • If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one.
  • For commands and context specific to this NeMo Platform plugin (e.g., sourcing model configs from IGW providers or in-script
    ModelConfig
    s, installing or publishing Nemotron Personas locales, platform-side resource pointers), read
    references/nemo-platform-plugin-additions.md
    .
  • 默认保留输出中的所有列。删除列的唯一例外情况是:(1) 用户明确要求删除;(2) 该列是仅用于派生其他列的辅助列(例如,用于提取姓名、城市等的抽样人员对象)。如有疑问,请保留该列。
  • 不要建议或询问种子数据集。仅当用户明确提供种子数据或要求基于现有记录构建时才使用种子数据集。使用种子数据集时,请阅读
    references/seed-datasets.md
  • 当数据集需要人员数据(姓名、人口统计信息、地址等)时,请阅读
    references/person-sampling.md
  • 如果已存在与数据集描述匹配的数据集脚本,请询问用户是要编辑该脚本还是创建新脚本。
  • 对于此NeMo Platform插件特有的命令和上下文(例如,从IGW提供商或脚本内
    ModelConfig
    中获取模型配置、安装或发布Nemotron Personas语言环境、平台端资源指针),请阅读
    references/nemo-platform-plugin-additions.md

Usage Tips and Common Pitfalls

使用提示与常见陷阱

  • Sampler and validation columns need both a type and params. E.g.,
    sampler_type="category"
    with
    params=dd.CategorySamplerParams(...)
    .
  • Jinja2 templates in
    prompt
    ,
    system_prompt
    , and
    expr
    fields: reference columns with
    {{ column_name }}
    , nested fields with
    {{ column_name.field }}
    .
  • **SamplerColumnConfig
    :** Takes
    params
    , not
    sampler_params
    .
  • LLM judge score access:
    LLMJudgeColumnConfig
    produces a nested dict where each score name maps to
    {reasoning: str, score: int}
    . To get the numeric score, use the
    .score
    attribute. For example, for a judge column named
    quality
    with a score named
    correctness
    , use
    {{ quality.correctness.score }}
    . Using
    {{ quality.correctness }}
    returns the full dict, not the numeric score.
  • 抽样器和验证列需要同时指定类型和参数。 例如,
    sampler_type="category"
    搭配
    params=dd.CategorySamplerParams(...)
  • Jinja2模板
    prompt
    system_prompt
    expr
    字段中的使用:使用
    {{ column_name }}
    引用列,使用
    {{ column_name.field }}
    引用嵌套字段。
  • **SamplerColumnConfig
    :** 接收
    params
    参数,而非
    sampler_params
  • LLM评分器分数获取:
    LLMJudgeColumnConfig
    会生成一个嵌套字典,其中每个分数名称对应
    {reasoning: str, score: int}
    。要获取数值分数,请使用
    .score
    属性。例如,对于名为
    quality
    的评分器列和名为
    correctness
    的分数,使用
    {{ quality.correctness.score }}
    。使用
    {{ quality.correctness }}
    会返回完整字典,而非数值分数。

Troubleshooting

故障排除

  • **nemo data-designer
    CLI not found:** Tell the user that
    nemo data-designer
    is not installed in this environment (requires Python >= 3.11). Ask if they would like you to create a virtual environment and install it, or if they prefer to do it themselves. Do not install anything without the user's permission.
  • Network errors during preview: A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves.
  • 未找到
    nemo data-designer
    命令行工具:
    告知用户当前环境未安装
    nemo data-designer
    (要求Python版本≥3.11)。询问用户是否希望创建虚拟环境并安装该工具,还是由用户自行安装。未经用户许可,请勿安装任何内容。
  • 预览时出现网络错误: 沙箱环境可能阻止了出站请求。询问用户是否允许禁用沙箱后重试命令。仅当在沙箱外重试也失败时,才告知用户自行运行该命令。

Output Template

输出模板

Write a Python file to the current directory with a
load_config_builder()
function returning a
DataDesignerConfigBuilder
. Name the file descriptively (e.g.,
customer_reviews.py
). Use PEP 723 inline metadata for dependencies.
python
undefined
在当前目录下编写一个Python文件,其中包含返回
DataDesignerConfigBuilder
load_config_builder()
函数。为文件起一个描述性的名称(例如
customer_reviews.py
)。使用PEP 723内联元数据声明依赖项。
python
undefined

/// script

/// script

dependencies = [

dependencies = [

"data-designer", # always required

"data-designer", # always required

"pydantic", # only if this script imports from pydantic

"pydantic", # only if this script imports from pydantic

# add additional dependencies here

# add additional dependencies here

]

]

///

///

import data_designer.config as dd from pydantic import BaseModel, Field
import data_designer.config as dd from pydantic import BaseModel, Field

Use Pydantic models when the output needs to conform to a specific schema

Use Pydantic models when the output needs to conform to a specific schema

class MyStructuredOutput(BaseModel): field_one: str = Field(description="...") field_two: int = Field(description="...")
class MyStructuredOutput(BaseModel): field_one: str = Field(description="...") field_two: int = Field(description="...")

Use custom generators when built-in column types aren't enough

Use custom generators when built-in column types aren't enough

@dd.custom_column_generator( required_columns=["col_a"], side_effect_columns=["extra_col"], ) def generator_function(row: dict) -> dict: # add custom logic here that depends on "col_a" and update row in place row["name_in_custom_column_config"] = "custom value" row["extra_col"] = "extra value" return row
def load_config_builder() -> dd.DataDesignerConfigBuilder: config_builder = dd.DataDesignerConfigBuilder( # Declaring model configs programmatically here is the portable path: # it works for both local
run
and cluster
submit
, while the local # YAML registry alternative only works for
run
. The provider below # is a common default created during
nemo setup
— confirm it (or # discover others) with
nemo inference providers list
. See # references/nemo-platform-plugin-additions.md for the local-YAML alternative. model_configs=[ dd.ModelConfig( alias="text", model="...", provider="default/nvidia-build", inference_parameters=dd.ChatCompletionInferenceParams(), ), ], )
# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

# config_builder.add_column(...)
# config_builder.add_processor(...)

return config_builder

Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them. Prefer including `model_configs` when the dataset uses LLM columns — declaring it in the script keeps the config portable between local `run` and cluster `submit`, while the local YAML registry alternative only works for `run`.
@dd.custom_column_generator( required_columns=["col_a"], side_effect_columns=["extra_col"], ) def generator_function(row: dict) -> dict: # add custom logic here that depends on "col_a" and update row in place row["name_in_custom_column_config"] = "custom value" row["extra_col"] = "extra value" return row
def load_config_builder() -> dd.DataDesignerConfigBuilder: config_builder = dd.DataDesignerConfigBuilder( # Declaring model configs programmatically here is the portable path: # it works for both local
run
and cluster
submit
, while the local # YAML registry alternative only works for
run
. The provider below # is a common default created during
nemo setup
— confirm it (or # discover others) with
nemo inference providers list
. See # references/nemo-platform-plugin-additions.md for the local-YAML alternative. model_configs=[ dd.ModelConfig( alias="text", model="...", provider="default/nvidia-build", inference_parameters=dd.ChatCompletionInferenceParams(), ), ], )
# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

# config_builder.add_column(...)
# config_builder.add_processor(...)

return config_builder

仅在任务需要时才包含Pydantic模型、自定义生成器、种子数据集和额外依赖项。当数据集使用LLM列时,优先包含`model_configs`——在脚本中声明它可使配置在本地`run`和集群`submit`之间通用,而本地YAML注册表的替代方案仅适用于`run`。