aws-bedrock-evals

Original：🇺🇸 English

Translated

Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.

8installs

Sourceantstackio/skills

Added on2026-02-17

NPX Install

npx skill4agent add antstackio/skills aws-bedrock-evals

SKILL.md Content

View Translation Comparison →

AWS Bedrock Evaluation Jobs

Overview

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.

Pre-computed Inference vs Live Inference

Mode	How it works	Use when
Live Inference	Bedrock generates responses during the eval job	Simple prompt-in/text-out, no tool calling
Pre-computed Inference	You pre-collect responses and supply them in a JSONL dataset	Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock

Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.

Pipeline

Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
       |                  |                  |               |               |               |
  scenarios.json    Your app's API     s3://bucket/     create-         s3 sync +       Fix prompt,
  (multi-turn)      → dataset JSONL    datasets/        evaluation-job  parse JSONL      retune metrics

Agent Behavior: Gather Inputs and Show Cost Estimate

Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:

AWS Region — Which region to use (default:
```
us-east-1
```
). Affects model availability and pricing.
Target model — The model their application uses (e.g.,
```
amazon.nova-lite-v1:0
```
,
```
anthropic.claude-3-haiku
```
).
Evaluator (judge) model — The model to score responses (e.g.,
```
amazon.nova-pro-v1:0
```
). Should be at least as capable as the target.
Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
S3 bucket — Existing bucket name or confirm creation of a new one.
IAM role — Existing role ARN or confirm creation of a new one.

Cost Estimate

After gathering inputs, you MUST display a cost estimate before proceeding:

## Estimated Cost Summary

| Item | Details | Est. Cost |
|------|---------|-----------|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| **Total per run** | | **~${X.XX}** |

Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.

Cost formulas:

Response collection:

num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price

Evaluation job:

num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price

Model pricing reference:

Model	Input (per 1M tokens)	Output (per 1M tokens)
amazon.nova-lite-v1:0	$0.06	$0.24
amazon.nova-pro-v1:0	$0.80	$3.20
anthropic.claude-3-haiku	$0.25	$1.25
anthropic.claude-3-sonnet	$3.00	$15.00

Prerequisites

bash

# AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
aws --version

# Verify target model access
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION

# Verify evaluator model access
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION

Good evaluator model choices:

amazon.nova-pro-v1:0

anthropic.claude-3-sonnet

anthropic.claude-3-haiku

. The evaluator should be at least as capable as your target model.

Step 1: Design Test Scenarios

List the application's functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.

Scenario JSON format:

json

[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]

The

context

field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.

Edge case coverage dimensions:

Happy path: standard usage that should work perfectly
Missing information: user omits required fields
Unavailable resources: requested item doesn't exist
Out-of-scope requests: user asks something the app shouldn't handle
Error recovery: bad input, invalid data
Tone stress tests: complaints, frustration

Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).

Step 2: Collect Responses

Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model's response, and metadata.

Example pattern: Converse API with tool-calling loop (TypeScript)

This applies when your application uses Bedrock with tool calling:

typescript

import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}

Multi-turn handling: Maintain the

messages

array across turns and build the dataset prompt field with conversation history:

typescript

const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}

Dataset JSONL Format

Each line must have this structure:

json

{
  "prompt": "User question or multi-turn history",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "The model's actual output text",
      "modelIdentifier": "my-app-v1"
    }
  ]
}

Field	Required	Notes
`prompt`	Yes	User input. For multi-turn, concatenate: `User: ...\nAssistant: ...\nUser: ...`
`referenceResponse`	No	Expected/ideal response. Can be empty string. Needed for `Builtin.Correctness` and `Builtin.Completeness` to work properly. Maps to `{{ground_truth}}` template variable
`modelResponses`	Yes	Array with exactly one entry for pre-computed inference
`modelResponses[0].response`	Yes	The model's actual output text
`modelResponses[0].modelIdentifier`	Yes	Any string label. Must match `inferenceSourceIdentifier` in inference-config.json

Constraints: One model response per prompt. One unique

modelIdentifier

per job. Max 1000 prompts per job.

Write JSONL:

typescript

const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

Step 3: Design Metrics

Built-In Metrics

Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:

Metric Name	What It Measures
`Builtin.Correctness`	Is the factual content accurate? (works best with `referenceResponse` )
`Builtin.Completeness`	Does the response fully cover the request? (works best with `referenceResponse` )
`Builtin.Faithfulness`	Is the response faithful to the provided context/source?
`Builtin.Helpfulness`	Is the response useful, actionable, and cooperative?
`Builtin.Coherence`	Is the response logically structured and easy to follow?
`Builtin.Relevance`	Does the response address the actual question?
`Builtin.FollowingInstructions`	Does the response follow explicit instructions in the prompt?
`Builtin.ProfessionalStyleAndTone`	Is spelling, grammar, and tone appropriate?
`Builtin.Harmfulness`	Does the response contain harmful content?
`Builtin.Stereotyping`	Does the response contain stereotypes or bias?
`Builtin.Refusal`	Does the response appropriately refuse harmful requests?

Score interpretation:

1.0

= best,

0.0

= worst,

null

= N/A (judge could not evaluate).

Note:

referenceResponse

is needed for

Builtin.Correctness

and

Builtin.Completeness

to produce meaningful scores, since the judge compares against a reference baseline.

When to Use Custom Metrics

Use custom metrics to check domain-specific behaviors the built-in metrics don't cover. If you find yourself thinking "this scored well on Helpfulness but violated a critical business rule" — that's a custom metric.

Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:

System prompt says:                          Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max"     → response_brevity
"Always greet returning users by name"    → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume"   → missing_info_followup

Custom Metric JSON Anatomy

json

{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

Field	Details
`metricName`	Snake_case identifier. Must appear in BOTH `customMetrics` array AND `metricNames` array
`instructions`	Full prompt sent to the judge. Must include `{{prompt}}` and `{{prediction}}` template variables. Can also use `{{ground_truth}}` (maps to `referenceResponse` ). Input variables must come last in the prompt.
`ratingScale`	Array of rating levels. Each has a `definition` (label, max 5 words / 100 chars) and `value` with either `floatValue` or `stringValue`

Official constraints:

Max 10 custom metrics per job
Instructions max 5000 characters
Rating
```
definition
```
max 5 words / 100 characters
Input variables (
```
{{prompt}}
```
,
```
{{prediction}}
```
,
```
{{ground_truth}}
```
) must come last in the instruction text

Complete Custom Metric Example

A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:

json

{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

When the judge selects N/A (

floatValue: -1

), Bedrock records

"result": null

. Your parser must handle

null

— treat as N/A and exclude from averages.

Rating Scale Design

3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
2 levels for binary checks (Poor/Good)
Add "N/A" level with
```
-1
```
for conditional metrics that only apply to certain prompt types
Rating values can use
```
floatValue
```
(numeric) or
```
stringValue
```
(text)

Tips for Writing Metric Instructions

Be explicit about what "good" and "bad" look like — include examples of phrases or behaviors
For conditional metrics, describe the N/A condition clearly so the judge doesn't score 0 when it should skip
Keep instructions under ~500 words to fit within context alongside prompt and response
Test with a few examples before running a full eval job

Step 4: AWS Infrastructure

S3 Bucket

bash

REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"

# us-east-1 does not accept LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" \
    --create-bucket-configuration LocationConstraint="${REGION}"
fi

Upload the dataset:

bash

aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"

IAM Role

Trust policy (must include

aws:SourceAccount

condition — Bedrock rejects the role without it):

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}

Permissions policy:

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}

Replace

YOUR_BUCKET

REGION

, and

EVALUATOR_MODEL_ID

with actual values.

Create the role:

bash

ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

Step 5: Configure and Run Eval Job

eval-config.json

json

{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}

Critical structure notes:

```
taskType
```
must be
```
"General"
```
(not "Generation" or any other value)
Custom metric names must appear in both
```
metricNames
```
array AND
```
customMetrics
```
array
```
evaluatorModelConfig
```
appears twice: once at the top level (for built-in metrics) and once inside
```
customMetricConfig
```
(for custom metrics) — both must specify the same evaluator model
```
modelIdentifier
```
must be the exact model ID string matching across all configs

inference-config.json

For pre-computed inference, this tells Bedrock that responses are already collected:

json

{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}

The

inferenceSourceIdentifier

must match the

modelIdentifier

in your JSONL dataset's

modelResponses

Running the Job

bash

aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1

CLI notes:

Required params:

--job-name

--role-arn

--evaluation-config

--inference-config

--output-data-config

Optional:
```
--application-type
```
(e.g.,
```
ModelEvaluation
```
)
--job-name
constraint:
```
[a-z0-9](-*[a-z0-9]){0,62}
```
— lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).
```
--evaluation-config
```
and
```
--inference-config
```
are document types — must use
```
file://
```
or inline JSON, no shorthand syntax
```
--output-data-config
```
is a structure — supports both inline JSON and shorthand (
```
s3Uri=string
```
)

Monitoring

bash

# List evaluation jobs (with optional filters)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1

# Get details for a specific job
aws bedrock get-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1

# Cancel a running job
aws bedrock stop-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1

Job statuses:

InProgress

Completed

Failed

Stopping

Stopped

Deleting

Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check

failureMessages

in the job details.

Step 6: Parse Results

S3 Output Directory Structure

Bedrock writes results to a deeply nested path:

s3://YOUR_BUCKET/results/
  └── <job-name>/
      └── <job-name>/
          ├── amazon-bedrock-evaluations-permission-check   ← empty sentinel
          └── <random-id>/
              ├── custom_metrics/                            ← metric definitions (NOT results)
              └── models/
                  └── <model-identifier>/
                      └── taskTypes/General/datasets/<dataset-name>/
                          └── <uuid>_output.jsonl            ← actual results

The job name is repeated twice. The random ID changes every run. Use

aws s3 sync

— do not construct paths manually.

Download Results

bash

aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1

Result JSONL Format

Each line:

json

{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}

```
result
```
is a number (score) or
```
null
```
(N/A)
```
evaluatorDetails[0].explanation
```
contains the judge's written reasoning

Parsing and Aggregation

typescript

interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}

Aggregation approach:

Overall averages per metric — exclude N/A entries
Per-category breakdown — group by category field, compute averages within each
Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)

Low-score alert format:

[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

Step 7: Eval-Fix-Reeval Loop

Common Fixes

Finding	Fix
Low brevity scores	Add hard constraint: "Respond in no more than 3 sentences."
Low confirmation_check	Add: "Before executing, summarize details and ask for confirmation."
Low missing_info_followup	Add: "If any required field is missing, ask for it. Do not assume."
Low tone on negative outcomes	Add empathy instructions for bad-news scenarios
Low Completeness on simple prompts	Metric/data issue — add `referenceResponse` or filter from Completeness

Metric Refinement

High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
All-high scores — instructions too lenient. Add specific failure criteria.
Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.

Run Comparison

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90

Track scores over time. The pipeline's value comes from repeated measurement.

Gotchas

taskType
must be
"General"
— not "Generation" or any other value. The job fails silently with other values.
Custom metric names in BOTH places — must appear in
```
metricNames
```
array AND
```
customMetrics
```
array. Missing from
```
metricNames
```
= silently ignored. Missing from
```
customMetrics
```
= job fails.

null
result means N/A, not 0 — when the judge determines a metric doesn't apply, Bedrock records

null

typescript

// WRONG — treats N/A as 0
const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;

// RIGHT — excludes N/A from average
const numericScores = scores.filter((s): s is number => s !== null);
const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;

evaluatorModelConfig
appears twice — once at top level (built-in metrics), once inside
```
customMetricConfig
```
(custom metrics). Omitting either causes those metrics to fail.
modelIdentifier
must match exactly — the string in JSONL
```
modelResponses
```
must be character-for-character identical to
```
inferenceSourceIdentifier
```
in inference-config.json. Mismatch = model mapping error.
AWS CLI 2.33+ required — older versions silently drop
```
customMetricConfig
```
and
```
precomputedInferenceSource
```
. Job creation succeeds but the job fails. Always check
```
aws --version
```
.
Job names: lowercase + hyphens, max 63 chars — pattern:
```
[a-z0-9](-*[a-z0-9]){0,62}
```
. Must be unique across all jobs. Use timestamps:
```
--job-name "my-eval-$(date +%Y%m%d-%H%M)"
```
.
S3 output is deeply nested —
```
<prefix>/<job-name>/<job-name>/<random-id>/models/...
```
. Use
```
aws s3 sync
```
and search for
```
_output.jsonl
```
. Do not construct paths manually.
referenceResponse
improves Correctness/Completeness — empty string is valid, but providing reference responses gives the judge a baseline for comparison.
<thinking>
tag leakage (model-specific) — some models (e.g., Amazon Nova Lite) may leak
```
<thinking>...</thinking>
```
blocks into responses. If present, strip before writing JSONL:
typescript
```
const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
```
us-east-1 S3 bucket creation — do NOT pass
```
LocationConstraint
```
for
```
us-east-1
```
. Other regions require it.

Cost Estimation

Formula:

Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price

Example: 30 prompts, 10 metrics, Nova Pro judge:

Response collection (Nova Lite): ~$0.02
Evaluation job (Nova Pro): ~$0.58
Total per run: ~$0.61

Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).

References

Model Evaluation Metrics — all 11 built-in metrics
Custom Metrics Prompt Formats —
```
metricName
```
, template variables, constraints
Prompt Datasets for Judge Evaluation — dataset JSONL format
CreateEvaluationJob API Reference — full API spec
AWS CLI create-evaluation-job — CLI command reference
Amazon Bedrock Pricing — model pricing

aws-bedrock-evals

NPX Install

Tags

SKILL.md Content

AWS Bedrock Evaluation Jobs

Overview

Agent Behavior: Gather Inputs and Show Cost Estimate

Cost Estimate

Prerequisites

Step 1: Design Test Scenarios

Step 2: Collect Responses

Dataset JSONL Format

Step 3: Design Metrics

Built-In Metrics

When to Use Custom Metrics

Custom Metric JSON Anatomy

Complete Custom Metric Example

Rating Scale Design

Tips for Writing Metric Instructions

Step 4: AWS Infrastructure

S3 Bucket

IAM Role

Step 5: Configure and Run Eval Job

eval-config.json

inference-config.json

Running the Job

Monitoring

Step 6: Parse Results

S3 Output Directory Structure

Download Results

Result JSONL Format

Parsing and Aggregation

Step 7: Eval-Fix-Reeval Loop

Common Fixes

Metric Refinement

Run Comparison

Gotchas

Cost Estimation

References