aws-bedrock-evals

Original🇺🇸 English
Translated

Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.

8installs
Added on

NPX Install

npx skill4agent add antstackio/skills aws-bedrock-evals

AWS Bedrock Evaluation Jobs


Overview

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
Pre-computed Inference vs Live Inference
ModeHow it worksUse when
Live InferenceBedrock generates responses during the eval jobSimple prompt-in/text-out, no tool calling
Pre-computed InferenceYou pre-collect responses and supply them in a JSONL datasetTool calling, multi-turn conversations, custom orchestration, models outside Bedrock
Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.
Pipeline
Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
       |                  |                  |               |               |               |
  scenarios.json    Your app's API     s3://bucket/     create-         s3 sync +       Fix prompt,
  (multi-turn)      → dataset JSONL    datasets/        evaluation-job  parse JSONL      retune metrics

Agent Behavior: Gather Inputs and Show Cost Estimate

Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:
  1. AWS Region — Which region to use (default:
    us-east-1
    ). Affects model availability and pricing.
  2. Target model — The model their application uses (e.g.,
    amazon.nova-lite-v1:0
    ,
    anthropic.claude-3-haiku
    ).
  3. Evaluator (judge) model — The model to score responses (e.g.,
    amazon.nova-pro-v1:0
    ). Should be at least as capable as the target.
  4. Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
  5. Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
  6. Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
  7. Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
  8. S3 bucket — Existing bucket name or confirm creation of a new one.
  9. IAM role — Existing role ARN or confirm creation of a new one.

Cost Estimate

After gathering inputs, you MUST display a cost estimate before proceeding:
## Estimated Cost Summary

| Item | Details | Est. Cost |
|------|---------|-----------|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| **Total per run** | | **~${X.XX}** |

Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.
Cost formulas:
  • Response collection:
    num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price
  • Evaluation job:
    num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price
Model pricing reference:
ModelInput (per 1M tokens)Output (per 1M tokens)
amazon.nova-lite-v1:0$0.06$0.24
amazon.nova-pro-v1:0$0.80$3.20
anthropic.claude-3-haiku$0.25$1.25
anthropic.claude-3-sonnet$3.00$15.00

Prerequisites

bash
# AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
aws --version

# Verify target model access
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION

# Verify evaluator model access
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION
Good evaluator model choices:
amazon.nova-pro-v1:0
,
anthropic.claude-3-sonnet
,
anthropic.claude-3-haiku
. The evaluator should be at least as capable as your target model.

Step 1: Design Test Scenarios

List the application's functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.
Scenario JSON format:
json
[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]
The
context
field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.
Edge case coverage dimensions:
  • Happy path: standard usage that should work perfectly
  • Missing information: user omits required fields
  • Unavailable resources: requested item doesn't exist
  • Out-of-scope requests: user asks something the app shouldn't handle
  • Error recovery: bad input, invalid data
  • Tone stress tests: complaints, frustration
Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).

Step 2: Collect Responses

Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model's response, and metadata.
Example pattern: Converse API with tool-calling loop (TypeScript)
This applies when your application uses Bedrock with tool calling:
typescript
import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}
Multi-turn handling: Maintain the
messages
array across turns and build the dataset prompt field with conversation history:
typescript
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}

Dataset JSONL Format

Each line must have this structure:
json
{
  "prompt": "User question or multi-turn history",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "The model's actual output text",
      "modelIdentifier": "my-app-v1"
    }
  ]
}
FieldRequiredNotes
prompt
YesUser input. For multi-turn, concatenate:
User: ...\nAssistant: ...\nUser: ...
referenceResponse
NoExpected/ideal response. Can be empty string. Needed for
Builtin.Correctness
and
Builtin.Completeness
to work properly. Maps to
{{ground_truth}}
template variable
modelResponses
YesArray with exactly one entry for pre-computed inference
modelResponses[0].response
YesThe model's actual output text
modelResponses[0].modelIdentifier
YesAny string label. Must match
inferenceSourceIdentifier
in inference-config.json
Constraints: One model response per prompt. One unique
modelIdentifier
per job. Max 1000 prompts per job.
Write JSONL:
typescript
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

Step 3: Design Metrics

Built-In Metrics

Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:
Metric NameWhat It Measures
Builtin.Correctness
Is the factual content accurate? (works best with
referenceResponse
)
Builtin.Completeness
Does the response fully cover the request? (works best with
referenceResponse
)
Builtin.Faithfulness
Is the response faithful to the provided context/source?
Builtin.Helpfulness
Is the response useful, actionable, and cooperative?
Builtin.Coherence
Is the response logically structured and easy to follow?
Builtin.Relevance
Does the response address the actual question?
Builtin.FollowingInstructions
Does the response follow explicit instructions in the prompt?
Builtin.ProfessionalStyleAndTone
Is spelling, grammar, and tone appropriate?
Builtin.Harmfulness
Does the response contain harmful content?
Builtin.Stereotyping
Does the response contain stereotypes or bias?
Builtin.Refusal
Does the response appropriately refuse harmful requests?
Score interpretation:
1.0
= best,
0.0
= worst,
null
= N/A (judge could not evaluate).
Note:
referenceResponse
is needed for
Builtin.Correctness
and
Builtin.Completeness
to produce meaningful scores, since the judge compares against a reference baseline.

When to Use Custom Metrics

Use custom metrics to check domain-specific behaviors the built-in metrics don't cover. If you find yourself thinking "this scored well on Helpfulness but violated a critical business rule" — that's a custom metric.
Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:
System prompt says:                          Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max"     → response_brevity
"Always greet returning users by name"    → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume"   → missing_info_followup

Custom Metric JSON Anatomy

json
{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}
FieldDetails
metricName
Snake_case identifier. Must appear in BOTH
customMetrics
array AND
metricNames
array
instructions
Full prompt sent to the judge. Must include
{{prompt}}
and
{{prediction}}
template variables. Can also use
{{ground_truth}}
(maps to
referenceResponse
). Input variables must come last in the prompt.
ratingScale
Array of rating levels. Each has a
definition
(label, max 5 words / 100 chars) and
value
with either
floatValue
or
stringValue
Official constraints:
  • Max 10 custom metrics per job
  • Instructions max 5000 characters
  • Rating
    definition
    max 5 words / 100 characters
  • Input variables (
    {{prompt}}
    ,
    {{prediction}}
    ,
    {{ground_truth}}
    ) must come last in the instruction text

Complete Custom Metric Example

A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:
json
{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}
When the judge selects N/A (
floatValue: -1
), Bedrock records
"result": null
. Your parser must handle
null
— treat as N/A and exclude from averages.

Rating Scale Design

  • 3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
  • 2 levels for binary checks (Poor/Good)
  • Add "N/A" level with
    -1
    for conditional metrics that only apply to certain prompt types
  • Rating values can use
    floatValue
    (numeric) or
    stringValue
    (text)

Tips for Writing Metric Instructions

  • Be explicit about what "good" and "bad" look like — include examples of phrases or behaviors
  • For conditional metrics, describe the N/A condition clearly so the judge doesn't score 0 when it should skip
  • Keep instructions under ~500 words to fit within context alongside prompt and response
  • Test with a few examples before running a full eval job

Step 4: AWS Infrastructure

S3 Bucket

bash
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"

# us-east-1 does not accept LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" \
    --create-bucket-configuration LocationConstraint="${REGION}"
fi
Upload the dataset:
bash
aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"

IAM Role

Trust policy (must include
aws:SourceAccount
condition — Bedrock rejects the role without it):
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}
Permissions policy:
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}
Replace
YOUR_BUCKET
,
REGION
, and
EVALUATOR_MODEL_ID
with actual values.
Create the role:
bash
ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

Step 5: Configure and Run Eval Job

eval-config.json

json
{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}
Critical structure notes:
  1. taskType
    must be
    "General"
    (not "Generation" or any other value)
  2. Custom metric names must appear in both
    metricNames
    array AND
    customMetrics
    array
  3. evaluatorModelConfig
    appears twice: once at the top level (for built-in metrics) and once inside
    customMetricConfig
    (for custom metrics) — both must specify the same evaluator model
  4. modelIdentifier
    must be the exact model ID string matching across all configs

inference-config.json

For pre-computed inference, this tells Bedrock that responses are already collected:
json
{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}
The
inferenceSourceIdentifier
must match the
modelIdentifier
in your JSONL dataset's
modelResponses
.

Running the Job

bash
aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1
CLI notes:
  • Required params:
    --job-name
    ,
    --role-arn
    ,
    --evaluation-config
    ,
    --inference-config
    ,
    --output-data-config
  • Optional:
    --application-type
    (e.g.,
    ModelEvaluation
    )
  • --job-name
    constraint:
    [a-z0-9](-*[a-z0-9]){0,62}
    — lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).
  • --evaluation-config
    and
    --inference-config
    are document types — must use
    file://
    or inline JSON, no shorthand syntax
  • --output-data-config
    is a structure — supports both inline JSON and shorthand (
    s3Uri=string
    )

Monitoring

bash
# List evaluation jobs (with optional filters)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1

# Get details for a specific job
aws bedrock get-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1

# Cancel a running job
aws bedrock stop-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1
Job statuses:
InProgress
,
Completed
,
Failed
,
Stopping
,
Stopped
,
Deleting
Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check
failureMessages
in the job details.

Step 6: Parse Results

S3 Output Directory Structure

Bedrock writes results to a deeply nested path:
s3://YOUR_BUCKET/results/
  └── <job-name>/
      └── <job-name>/
          ├── amazon-bedrock-evaluations-permission-check   ← empty sentinel
          └── <random-id>/
              ├── custom_metrics/                            ← metric definitions (NOT results)
              └── models/
                  └── <model-identifier>/
                      └── taskTypes/General/datasets/<dataset-name>/
                          └── <uuid>_output.jsonl            ← actual results
The job name is repeated twice. The random ID changes every run. Use
aws s3 sync
— do not construct paths manually.

Download Results

bash
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1

Result JSONL Format

Each line:
json
{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}
  • result
    is a number (score) or
    null
    (N/A)
  • evaluatorDetails[0].explanation
    contains the judge's written reasoning

Parsing and Aggregation

typescript
interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}
Aggregation approach:
  1. Overall averages per metric — exclude N/A entries
  2. Per-category breakdown — group by category field, compute averages within each
  3. Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)
Low-score alert format:
[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

Step 7: Eval-Fix-Reeval Loop

Common Fixes

FindingFix
Low brevity scoresAdd hard constraint: "Respond in no more than 3 sentences."
Low confirmation_checkAdd: "Before executing, summarize details and ask for confirmation."
Low missing_info_followupAdd: "If any required field is missing, ask for it. Do not assume."
Low tone on negative outcomesAdd empathy instructions for bad-news scenarios
Low Completeness on simple promptsMetric/data issue — add
referenceResponse
or filter from Completeness

Metric Refinement

  • High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
  • All-high scores — instructions too lenient. Add specific failure criteria.
  • Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.

Run Comparison

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90
Track scores over time. The pipeline's value comes from repeated measurement.

Gotchas

  1. taskType
    must be
    "General"
    — not "Generation" or any other value. The job fails silently with other values.
  2. Custom metric names in BOTH places — must appear in
    metricNames
    array AND
    customMetrics
    array. Missing from
    metricNames
    = silently ignored. Missing from
    customMetrics
    = job fails.
  3. null
    result means N/A, not 0
    — when the judge determines a metric doesn't apply, Bedrock records
    null
    :
    typescript
    // WRONG — treats N/A as 0
    const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;
    
    // RIGHT — excludes N/A from average
    const numericScores = scores.filter((s): s is number => s !== null);
    const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;
  4. evaluatorModelConfig
    appears twice
    — once at top level (built-in metrics), once inside
    customMetricConfig
    (custom metrics). Omitting either causes those metrics to fail.
  5. modelIdentifier
    must match exactly
    — the string in JSONL
    modelResponses
    must be character-for-character identical to
    inferenceSourceIdentifier
    in inference-config.json. Mismatch = model mapping error.
  6. AWS CLI 2.33+ required — older versions silently drop
    customMetricConfig
    and
    precomputedInferenceSource
    . Job creation succeeds but the job fails. Always check
    aws --version
    .
  7. Job names: lowercase + hyphens, max 63 chars — pattern:
    [a-z0-9](-*[a-z0-9]){0,62}
    . Must be unique across all jobs. Use timestamps:
    --job-name "my-eval-$(date +%Y%m%d-%H%M)"
    .
  8. S3 output is deeply nested
    <prefix>/<job-name>/<job-name>/<random-id>/models/...
    . Use
    aws s3 sync
    and search for
    _output.jsonl
    . Do not construct paths manually.
  9. referenceResponse
    improves Correctness/Completeness
    — empty string is valid, but providing reference responses gives the judge a baseline for comparison.
  10. <thinking>
    tag leakage (model-specific)
    — some models (e.g., Amazon Nova Lite) may leak
    <thinking>...</thinking>
    blocks into responses. If present, strip before writing JSONL:
    typescript
    const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
  11. us-east-1 S3 bucket creation — do NOT pass
    LocationConstraint
    for
    us-east-1
    . Other regions require it.

Cost Estimation

Formula:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price
Example: 30 prompts, 10 metrics, Nova Pro judge:
  • Response collection (Nova Lite): ~$0.02
  • Evaluation job (Nova Pro): ~$0.58
  • Total per run: ~$0.61
Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).

References