model-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Model Evaluation Code Generator

模型评估代码生成器

Generate a Jupyter notebook that evaluates a SageMaker fine-tuned model using LLM-as-Judge via sagemaker-python-sdk v3.
生成Jupyter notebook,通过sagemaker-python-sdk v3使用LLM-as-Judge评估SageMaker微调模型。

Principles

原则

  1. One thing at a time. Each response advances exactly one decision. Never combine multiple questions or recommendations in a single turn.
  2. Confirm before proceeding. Wait for the user to agree before moving to the next step. You are a guide, not a runaway train.
  3. Don't read files until you need them. Only read reference files when you've reached the workflow step that requires them and the user has confirmed the direction. Never read ahead.
  4. No narration. Don't explain what you're about to do or what you just did. Share outcomes and ask questions. Keep responses short and focused.
  5. No repetition. If you said something before a tool call, don't repeat it after. Only share new information.
  1. 单次仅处理一项内容。 每次回复仅推进一项决策,单次交互中绝不合并多个问题或建议。
  2. 推进前先确认。 等待用户同意后再进入下一步,你是引导者,而不是自顾自运行的程序。
  3. 按需读取文件。 仅当工作流推进到需要参考文件的步骤、且用户已确认当前方向后再读取相关文件,绝不提前读取。
  4. 无冗余叙述。 不要解释你即将做什么或刚做了什么,仅输出结果和提出问题,保持回复简短聚焦。
  5. 无重复内容。 如果你在调用工具前已经说明过某内容,调用后不要再重复,仅输出新增信息。

Workflow

工作流

Step 0: Check for prior context

步骤0:检查前置上下文

Before starting the conversation, silently check for
workflow_state.json
in the project directory. If it exists, read it and remember any useful information (such as model package ARN, model package group name, training job name, dataset paths).
开始对话前,静默检查项目目录下是否存在
workflow_state.json
。如果存在,读取并记录所有有用信息(例如模型包ARN、模型包组名称、训练任务名称、数据集路径)。

Step 1: Understand the task

步骤1:明确任务

For this step, you need: what task the model is trained to do. If you know this already, skip this step. If not, ask the user:
"What task is this model trained to do?"
⏸ Wait for user.
本步骤你需要明确:模型被训练来执行什么任务。 如果已经知晓该信息,跳过本步骤,否则询问用户:
"该模型被训练来执行什么任务?"
⏸ 等待用户回复。

Step 2: Get evaluation dataset

步骤2:获取评估数据集

For this step, you need: the evaluation dataset S3 path. If you know this already, skip this step. If not, ask the user:
"Where's your evaluation dataset stored in S3?"
⏸ Wait for user.
本步骤你需要明确:评估数据集的S3路径。 如果已经知晓该信息,跳过本步骤,否则询问用户:
"你的评估数据集存储在S3的哪个位置?"
⏸ 等待用户回复。

Step 3: Understand the data

步骤3:了解数据情况

For this step, you need: to understand what the data looks like to inform metric recommendations. If you already know what the data looks like, skip this step. If not, ask the user:
"Can you tell me a bit about your evaluation dataset — what format is it in, and what do the input/output fields look like?"
If the user isn't sure, offer to peek at the data:
"May I read a few records of your dataset to help inform my recommendations?"
If they say yes, use the AWS tool to call
s3api get-object
with a
Range
header to read the first few KB. If you fail to get a sample, move on and rely on the user's description.
本步骤你需要明确:数据的格式特征,用于给出指标推荐。 如果已经知晓数据情况,跳过本步骤,否则询问用户:
"可以介绍一下你的评估数据集吗?它是什么格式,输入/输出字段是什么样的?"
如果用户不确定,主动提出可以查看数据样例:
"我可以读取你的数据集的部分记录来为你提供更合适的推荐吗?"
如果用户同意,使用AWS工具调用
s3api get-object
,通过
Range
请求头读取前几KB的内容。如果无法获取样例,继续推进流程,依赖用户的描述即可。

Step 4: Validate dataset format

步骤4:验证数据集格式

If the evaluation dataset was already validated via the dataset-evaluation skill earlier in the conversation, skip this step.
Otherwise, activate the dataset-evaluation skill to validate it. If it fails, offer to activate the dataset-transformation skill to convert it. Do not proceed until the dataset is valid.
如果评估数据集已经在之前的对话中通过dataset-evaluation skill验证过,跳过本步骤。
否则,激活dataset-evaluation skill进行验证。如果验证失败,主动提出激活dataset-transformation skill转换格式,数据集验证通过前不要推进流程。

Step 5: Check for custom metrics

步骤5:确认是否有自定义指标

For this step, you need: whether the user has predefined custom metrics.
"Do you have predefined custom metrics you'd like to use? If so, they must follow the Bedrock custom metrics format: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html
If not, no worries — I can recommend built-in metrics for your task."
⏸ Wait for user.
  • If the user has custom metrics → Read
    references/llmaaj-custom-evaluation.md
    and follow its instructions to collect and validate the metrics JSON.
  • If the user does not have custom metrics → Move to Step 6.
本步骤你需要明确:用户是否有预定义的自定义指标。
"你有需要使用的预定义自定义指标吗?如果有的话,必须符合Bedrock自定义指标格式:https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html
如果没有也没关系,我可以为你的任务推荐内置指标。"
⏸ 等待用户回复。
  • 如果用户有自定义指标 → 读取
    references/llmaaj-custom-evaluation.md
    ,按照说明收集并验证指标JSON。
  • 如果用户没有自定义指标 → 进入步骤6。

Step 6: Select built-in metrics

步骤6:选择内置指标

For this step, you need: user agreement on which built-in metrics to use (if any).
If the user provided custom metrics in Step 5, ask whether they also want built-in metrics:
"Would you also like to include any built-in metrics alongside your custom ones?"
If they say no, skip to Step 7.
For built-in metric selection, read
references/llmaaj-builtin-evaluation.md
and follow its instructions.
⏸ Wait for user to confirm metrics.
本步骤你需要明确:用户同意使用哪些内置指标(如有)。
如果用户在步骤5提供了自定义指标,询问他们是否还需要内置指标:
"除了自定义指标外,你是否还想要加入内置指标?"
如果用户回复不需要,直接跳到步骤7。
选择内置指标时,读取
references/llmaaj-builtin-evaluation.md
并按照说明操作。
⏸ 等待用户确认所选指标。

Step 7: Resolve Model Package ARN

步骤7:获取模型包ARN

For this step, you need: the Model Package ARN of the fine-tuned model.
Use this priority order:
  1. Model Package ARN from workflow state or conversation: If you already have a model package ARN from Step 0 (workflow state) or from earlier in the conversation, confirm it with the user and move on.
  2. Ask the user: If you don't have the ARN, ask:
    "What's the Model Package ARN (or group name) of your fine-tuned model?" If they provide a group name, resolve the ARN by calling
    list-model-packages
    via the AWS tool with the group name. Use the latest version's
    ModelPackageArn
    from the response.
Validate the resolved ARN (whether from API lookup, workflow state, or user input):
  • A valid versioned model package ARN looks like:
    arn:aws:sagemaker:REGION:ACCOUNT:model-package/NAME/VERSION
  • If the ARN contains
    :model-package-group/
    , the user provided a group ARN, not a package ARN. Resolve it using the lookup in #2.
  • If the ARN contains
    :model-package/
    but does NOT end with a version number (e.g.,
    /1
    ), resolve it: extract the group name from the ARN and use the lookup in #2.
  • If it contains
    /DataSet/
    ,
    /TrainingJob/
    , or other non-model-package resource types, flag it: "That looks like a [Dataset/TrainingJob] ARN, not a model package ARN. Could you double-check?"
  • Verify the ARN exists before proceeding by calling
    describe-model-package
    via the AWS tool. If this fails, tell the user the ARN wasn't found and ask them to double-check.
⏸ Wait for confirmation before proceeding.
本步骤你需要明确:微调模型的模型包ARN。
按以下优先级获取:
  1. 来自工作流状态或对话的模型包ARN:如果你已经从步骤0的工作流状态或之前的对话中获取了模型包ARN,向用户确认后即可推进。
  2. 询问用户:如果没有ARN,询问:
    "你的微调模型的模型包ARN(或组名称)是什么?" 如果用户提供的是组名称,通过AWS工具调用
    list-model-packages
    ,传入组名称来解析ARN,使用返回结果中最新版本的
    ModelPackageArn
验证解析得到的ARN(无论来自API查询、工作流状态还是用户输入):
  • 有效的带版本模型包ARN格式为:
    arn:aws:sagemaker:REGION:ACCOUNT:model-package/NAME/VERSION
  • 如果ARN包含
    :model-package-group/
    ,说明用户提供的是组ARN而非包ARN,按照上述第2条的查询方式解析。
  • 如果ARN包含
    :model-package/
    但没有以版本号结尾(例如
    /1
    ),进行解析:从ARN中提取组名称,按照上述第2条的查询方式解析。
  • 如果ARN包含
    /DataSet/
    /TrainingJob/
    或其他非模型包资源类型,提示用户:"那看起来是[Dataset/TrainingJob]的ARN,不是模型包ARN,可以再确认一下吗?"
  • 推进前验证ARN存在:通过AWS工具调用
    describe-model-package
    ,如果调用失败,告知用户未找到该ARN,请其再次确认。
⏸ 等待用户确认后再推进。

Step 8: Select judge model

步骤8:选择裁判模型

For this step, you need: which judge model to use for evaluation. This step always runs — both built-in and custom metrics require a judge model.
Read
references/supported-judge-models.md
for the canonical list, selection guidance, and validation steps.
Before presenting options, run the validation checks from the reference doc against the user's account and region. Only include models that pass all checks.
Present the available models as a numbered list:
"Here are the judge models available in your region:
  1. [model A]
  2. [model B] ...
Which model would you like to use? Please type the exact model name from the above list."
EXTREMELY IMPORTANT: NEVER recommend or suggest any particular model based on the context you have. YOU ARE ALLOWED ONLY to display the list of models. DO NOT add your own recommendation or suggestion after displaying the list.
⏸ Wait for user to confirm.
本步骤你需要明确:**评估使用的裁判模型。**本步骤必须执行,内置和自定义指标都需要裁判模型。
读取
references/supported-judge-models.md
获取官方支持列表、选择指南和验证步骤。
在展示选项前,按照参考文档中的验证步骤,结合用户的账号和区域进行校验,仅展示全部校验通过的模型。
将可用模型以编号列表形式展示:
"你所在区域可用的裁判模型如下:
  1. [model A]
  2. [model B] ...
你想要使用哪个模型?请输入上述列表中的 exact 模型名称。"
非常重要:绝对不要根据已有上下文推荐或建议任何特定模型,仅允许展示模型列表,展示列表后不要添加任何你自己的推荐或建议。
⏸ 等待用户确认。

Step 9: Collect remaining parameters

步骤9:收集剩余参数

For this step, you need: AWS Region and S3 output path. For each value you don't already have, ask one at a time.
⏸ Wait for each answer before asking the next.
本步骤你需要明确:AWS区域和S3输出路径。 对于你还未获取的参数,每次只询问一个。
⏸ 每个问题得到回复后再询问下一个。

Step 10: Confirm configuration

步骤10:确认配置

Summarize everything and ask for approval:
"Here's the evaluation setup:
  • Task: [task]
  • Dataset: [path]
  • Custom metrics: [Yes — N metrics / No]
  • Built-in metrics: [list, or None]
  • Judge: [model]
  • Model Package ARN: [arn]
  • Region: [region]
  • S3 output: [path]
Your fine-tuned model will automatically be compared against its base model.
Does this look right?"
⏸ Wait for user approval.
汇总所有信息并请求用户批准:
"本次评估配置如下:
  • 任务:[task]
  • 数据集:[path]
  • 自定义指标:[是 — N个指标 / 否]
  • 内置指标:[列表,或无]
  • 裁判模型:[model]
  • 模型包ARN:[arn]
  • 区域:[region]
  • S3输出路径:[path]
你的微调模型会自动和其基础模型进行对比。
以上配置是否正确?"
⏸ 等待用户批准。

Step 11: Bedrock Evaluations agreement

步骤11:Bedrock评估服务协议确认

This step is mandatory. Do not skip it. Do not proceed without explicit user confirmation.
Before generating the notebook, present the following agreement language:
Important: Amazon Bedrock Evaluations Terms
This feature is powered by Amazon Bedrock Evaluations. Your use of this feature is subject to pricing of Amazon Bedrock Evaluations, the Service Terms applicable to Amazon Bedrock, and the terms that apply to your usage of third-party models. Amazon Bedrock Evaluations may securely transmit data across AWS Regions within your geography for processing. For more information, access Amazon Bedrock Evaluations documentation.
Do you acknowledge and agree to proceed?
Hard stop. Wait for the user to explicitly confirm. Acceptable responses include "yes", "I agree", "proceed", "ok", or similar affirmative statements. If the user asks questions about the terms, answer them, then re-ask for confirmation. Do NOT generate the notebook until the user has confirmed.
本步骤为强制步骤,不可跳过,未获得用户明确确认前不得推进。
生成notebook前,向用户展示以下协议内容:
重要提示:Amazon Bedrock Evaluations 服务条款
本功能由Amazon Bedrock Evaluations提供支持。你使用本功能需遵守Amazon Bedrock Evaluations的定价规则、适用于Amazon Bedrock的服务条款,以及你使用第三方模型的相关条款。Amazon Bedrock Evaluations可能会在你所属地理区域内的不同AWS区域之间安全传输数据以进行处理。更多信息请查看Amazon Bedrock Evaluations 文档
你是否知晓并同意继续操作?"
强制停止,等待用户明确确认。可接受的回复包括"yes"、"我同意"、"继续"、"ok"或其他类似肯定表述。如果用户对条款有疑问,先解答疑问,再再次请求确认。用户确认前不得生成notebook。

Step 12: Generate notebook

步骤12:生成notebook

If a project directory already exists (from earlier in the workflow), use it. Otherwise, activate the directory-management skill to set one up.
Check for existing notebooks in
<project-name>/notebooks/
. Then ask:
"Would you like to append to an existing notebook, or create a new one:
<project-name>/notebooks/<project-name>_model-evaluation.ipynb
?"
⏸ Wait for user.
Before writing the notebook, read:
  • references/notebook_structure.md
    (cell order, placeholders, JSON formatting)
  • scripts/notebook_cells.py
    (all cell code templates)
如果项目目录已经存在(来自之前的工作流),直接使用,否则激活directory-management skill创建项目目录。
检查
<project-name>/notebooks/
下是否有已存在的notebook,然后询问用户:
"你想要追加到现有notebook,还是创建新的notebook:
<project-name>/notebooks/<project-name>_model-evaluation.ipynb
?"
⏸ 等待用户回复。
编写notebook前,请先读取:
  • references/notebook_structure.md
    (单元格顺序、占位符、JSON格式要求)
  • scripts/notebook_cells.py
    (所有单元格代码模板)

Step 13: Provide run instructions

步骤13:提供运行说明

To run:
1. Cell 1 — configuration and SDK install
2. Cell 2 — start evaluation
3. Cell 3 — polls status automatically (~25-60 min)
4. Cell 4 — show base vs custom model comparison
运行步骤:
1. 单元格1 — 配置和SDK安装
2. 单元格2 — 启动评估
3. 单元格3 — 自动轮询状态(约25-60分钟)
4. 单元格4 — 展示基础模型与自定义模型的对比结果

Notes

注意事项

  • Not all models support serverless evaluation. If job fails with "DownstreamServiceUnavailable", the model doesn't have evaluation recipes.
  • Jobs stuck in "Executing" is normal — inference takes 15-30+ minutes.
  • For faster iteration, use a small dataset (5-10 examples).
  • Known working models: DeepSeek R1 Distilled Qwen 32B
  • Expected duration: small model (<10B) 25-40 min, large model (>30B) 40-60 min, with base comparison 2x.
  • 并非所有模型都支持无服务器评估。如果任务失败返回"DownstreamServiceUnavailable",说明该模型没有评估脚本。
  • 任务长时间处于"Executing"状态是正常的,推理需要15-30分钟以上。
  • 如需更快迭代,请使用小数据集(5-10条样本)。
  • 已知可用模型:DeepSeek R1 Distilled Qwen 32B
  • 预期耗时:小模型(<10B)25-40分钟,大模型(>30B)40-60分钟,包含基础模型对比的话耗时翻倍。

FAQ

常见问题

Q: Can I use benchmarks or custom scorer evaluations? A: Not yet — this skill currently supports LLM-as-Judge evaluations only (built-in and custom metrics). Benchmark and custom scorer support will be added in a future version. In the meantime, you can set these up through the SageMaker console or refer to the SageMaker evaluation documentation.
Q: Can I combine custom and built-in metrics in the same evaluation? A: Yes. You can use up to 10 custom metrics alongside any number of built-in metrics in a single evaluation job.
Q:我可以使用基准测试或自定义评分器评估吗? A:暂不支持 — 本skill目前仅支持LLM-as-Judge评估(内置和自定义指标)。基准测试和自定义评分器支持将在未来版本添加。在此期间,你可以通过SageMaker控制台配置这些功能,或参考SageMaker评估文档
Q:我可以在同一次评估中结合自定义和内置指标吗? A:可以。单次评估任务中你最多可以使用10个自定义指标,搭配任意数量的内置指标。

Troubleshooting

问题排查

Evaluation job fails with "access denied when attempting to assume role"

评估任务失败,返回"access denied when attempting to assume role"

The Bedrock evaluation job needs to assume your IAM role, which requires
bedrock.amazonaws.com
in the role's trust policy. This is common when running from a local IDE with temporary or SSO credentials.
To check, inspect your current role's trust policy using the AWS MCP tool:
  1. Use the AWS MCP tool
    get-caller-identity
    (STS service) to get your current role ARN.
  2. Extract the role name from the ARN (the part after
    role/
    or
    assumed-role/
    ).
  3. Use the AWS MCP tool
    get-role
    (IAM service) with the role name, and extract
    Role.AssumeRolePolicyDocument
    from the response.
Look for
bedrock.amazonaws.com
in
Principal.Service
. If it's missing, either add it to the trust policy or switch to a role that already trusts Bedrock (e.g., your SageMaker execution role).
Bedrock评估任务需要扮演你的IAM角色,这要求角色的信任策略中包含
bedrock.amazonaws.com
。当你使用临时凭证或SSO凭证在本地IDE运行时很容易出现该问题。
排查步骤:使用AWS MCP工具检查当前角色的信任策略:
  1. 使用AWS MCP工具
    get-caller-identity
    (STS服务)获取当前角色ARN。
  2. 从ARN中提取角色名称(
    role/
    assumed-role/
    后面的部分)。
  3. 使用AWS MCP工具
    get-role
    (IAM服务)传入角色名称,从返回结果中提取
    Role.AssumeRolePolicyDocument
查看
Principal.Service
中是否包含
bedrock.amazonaws.com
,如果没有,要么将其添加到信任策略,要么切换到已经信任Bedrock的角色(例如你的SageMaker执行角色)。

Helping a user find their Model Package ARN

帮助用户查找模型包ARN

If the user doesn't know their model package ARN and can only provide partial info (dataset ARN, training job name, etc.), guide them through these steps:
  1. Ask for keywords from the model or training job name (e.g., "medication-simplification").
  2. Search model package groups via the AWS tool:
    list-model-package-groups
    with
    name-contains <keyword>
    .
  3. List packages in the group via the AWS tool:
    list-model-packages
    with the group name.
  4. Verify the match via the AWS tool:
    describe-model-package
    with the ARN. Check that the
    S3Uri
    in
    InferenceSpecification.Containers
    matches the expected training output path.
Always confirm the resolved ARN with the user before proceeding.
如果用户不知道自己的模型包ARN,只能提供部分信息(数据集ARN、训练任务名称等),引导用户按以下步骤操作:
  1. 询问模型或训练任务名称的关键词(例如"medication-simplification")。
  2. 通过AWS工具搜索模型包组:调用
    list-model-package-groups
    ,传入
    name-contains <keyword>
    参数。
  3. 列出组内的模型包:通过AWS工具调用
    list-model-packages
    ,传入组名称。
  4. 验证匹配性:通过AWS工具调用
    describe-model-package
    ,传入ARN,检查
    InferenceSpecification.Containers
    中的
    S3Uri
    是否符合预期的训练输出路径。
推进前务必和用户确认解析得到的ARN。