AWS Bedrock Evaluation Jobs
Overview
Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
Pre-computed Inference vs Live Inference
| Mode | How it works | Use when |
|---|
| Live Inference | Bedrock generates responses during the eval job | Simple prompt-in/text-out, no tool calling |
| Pre-computed Inference | You pre-collect responses and supply them in a JSONL dataset | Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock |
Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.
Pipeline
Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
| | | | | |
scenarios.json Your app's API s3://bucket/ create- s3 sync + Fix prompt,
(multi-turn) → dataset JSONL datasets/ evaluation-job parse JSONL retune metrics
Agent Behavior: Gather Inputs and Show Cost Estimate
Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:
- AWS Region — Which region to use (default: ). Affects model availability and pricing.
- Target model — The model their application uses (e.g., , ).
- Evaluator (judge) model — The model to score responses (e.g., ). Should be at least as capable as the target.
- Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
- Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
- Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
- Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
- S3 bucket — Existing bucket name or confirm creation of a new one.
- IAM role — Existing role ARN or confirm creation of a new one.
Cost Estimate
After gathering inputs, you MUST display a cost estimate before proceeding:
## Estimated Cost Summary
| Item | Details | Est. Cost |
|------|---------|-----------|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| **Total per run** | | **~${X.XX}** |
Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.
Cost formulas:
- Response collection:
num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price
- Evaluation job:
num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price
Model pricing reference:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |
Prerequisites
bash
# AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
aws --version
# Verify target model access
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION
# Verify evaluator model access
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION
Good evaluator model choices:
,
anthropic.claude-3-sonnet
,
. The evaluator should be at least as capable as your target model.
Step 1: Design Test Scenarios
List the application's functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.
Scenario JSON format:
json
[
{
"id": "greeting-known-user",
"category": "greeting",
"context": { "userId": "user-123" },
"turns": ["hello"]
},
{
"id": "multi-step-flow",
"category": "core-flow",
"context": { "userId": "user-456" },
"turns": [
"hello",
"I need help with X",
"yes, proceed with that",
"thanks"
]
}
]
The
field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.
Edge case coverage dimensions:
- Happy path: standard usage that should work perfectly
- Missing information: user omits required fields
- Unavailable resources: requested item doesn't exist
- Out-of-scope requests: user asks something the app shouldn't handle
- Error recovery: bad input, invalid data
- Tone stress tests: complaints, frustration
Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).
Step 2: Collect Responses
Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model's response, and metadata.
Example pattern: Converse API with tool-calling loop (TypeScript)
This applies when your application uses Bedrock with tool calling:
typescript
import {
BedrockRuntimeClient,
ConverseCommand,
type Message,
type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
async function converseLoop(
messages: Message[],
systemPrompt: SystemContentBlock[],
tools: any[]
): Promise<string> {
const MAX_TOOL_ROUNDS = 10;
for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
const response = await client.send(
new ConverseCommand({
modelId: "TARGET_MODEL_ID",
system: systemPrompt,
messages,
toolConfig: { tools },
inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
})
);
const assistantContent = response.output?.message?.content as any[];
if (!assistantContent) return "[No response from model]";
messages.push({ role: "assistant", content: assistantContent });
const toolUseBlocks = assistantContent.filter(
(block: any) => block.toolUse != null
);
if (toolUseBlocks.length === 0) {
return assistantContent
.filter((block: any) => block.text != null)
.map((block: any) => block.text as string)
.join("\n") || "[Empty response]";
}
const toolResultBlocks: any[] = [];
for (const block of toolUseBlocks) {
const { toolUseId, name, input } = block.toolUse;
const result = await executeTool(name, input);
toolResultBlocks.push({
toolResult: { toolUseId, content: [{ json: result }] },
});
}
messages.push({ role: "user", content: toolResultBlocks } as Message);
}
return "[Max tool rounds exceeded]";
}
Multi-turn handling: Maintain the
array across turns and build the dataset prompt field with conversation history:
typescript
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];
for (let i = 0; i < scenario.turns.length; i++) {
const userTurn = scenario.turns[i];
messages.push({ role: "user", content: [{ text: userTurn }] });
const assistantText = await converseLoop(messages, systemPrompt, tools);
conversationHistory.push({ role: "user", text: userTurn });
conversationHistory.push({ role: "assistant", text: assistantText });
let prompt: string;
if (i === 0) {
prompt = userTurn;
} else {
prompt = conversationHistory
.map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
.join("\n");
}
entries.push({
prompt,
category: scenario.category,
referenceResponse: "",
modelResponses: [
{ response: assistantText, modelIdentifier: "my-app-v1" },
],
});
}
Dataset JSONL Format
Each line must have this structure:
json
{
"prompt": "User question or multi-turn history",
"referenceResponse": "",
"modelResponses": [
{
"response": "The model's actual output text",
"modelIdentifier": "my-app-v1"
}
]
}
| Field | Required | Notes |
|---|
| Yes | User input. For multi-turn, concatenate: User: ...\nAssistant: ...\nUser: ...
|
| No | Expected/ideal response. Can be empty string. Needed for and to work properly. Maps to template variable |
| Yes | Array with exactly one entry for pre-computed inference |
modelResponses[0].response
| Yes | The model's actual output text |
modelResponses[0].modelIdentifier
| Yes | Any string label. Must match inferenceSourceIdentifier
in inference-config.json |
Constraints: One model response per prompt. One unique
per job. Max 1000 prompts per job.
Write JSONL:
typescript
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");
Step 3: Design Metrics
Built-In Metrics
Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:
| Metric Name | What It Measures |
|---|
| Is the factual content accurate? (works best with ) |
| Does the response fully cover the request? (works best with ) |
| Is the response faithful to the provided context/source? |
| Is the response useful, actionable, and cooperative? |
| Is the response logically structured and easy to follow? |
| Does the response address the actual question? |
Builtin.FollowingInstructions
| Does the response follow explicit instructions in the prompt? |
Builtin.ProfessionalStyleAndTone
| Is spelling, grammar, and tone appropriate? |
| Does the response contain harmful content? |
| Does the response contain stereotypes or bias? |
| Does the response appropriately refuse harmful requests? |
Score interpretation: = best,
= worst,
= N/A (judge could not evaluate).
Note: is needed for
and
to produce meaningful scores, since the judge compares against a reference baseline.
When to Use Custom Metrics
Use custom metrics to check domain-specific behaviors the built-in metrics don't cover. If you find yourself thinking "this scored well on Helpfulness but violated a critical business rule" — that's a custom metric.
Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:
System prompt says: Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max" → response_brevity
"Always greet returning users by name" → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume" → missing_info_followup
Custom Metric JSON Anatomy
json
{
"customMetricDefinition": {
"metricName": "my_metric_name",
"instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
| Field | Details |
|---|
| Snake_case identifier. Must appear in BOTH array AND array |
| Full prompt sent to the judge. Must include and template variables. Can also use (maps to ). Input variables must come last in the prompt. |
| Array of rating levels. Each has a (label, max 5 words / 100 chars) and with either or |
Official constraints:
- Max 10 custom metrics per job
- Instructions max 5000 characters
- Rating max 5 words / 100 characters
- Input variables (, , ) must come last in the instruction text
Complete Custom Metric Example
A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:
json
{
"customMetricDefinition": {
"metricName": "confirmation_check",
"instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "N/A", "value": { "floatValue": -1 } },
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
When the judge selects N/A (
), Bedrock records
. Your parser must handle
— treat as N/A and exclude from averages.
Rating Scale Design
- 3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
- 2 levels for binary checks (Poor/Good)
- Add "N/A" level with for conditional metrics that only apply to certain prompt types
- Rating values can use (numeric) or (text)
Tips for Writing Metric Instructions
- Be explicit about what "good" and "bad" look like — include examples of phrases or behaviors
- For conditional metrics, describe the N/A condition clearly so the judge doesn't score 0 when it should skip
- Keep instructions under ~500 words to fit within context alongside prompt and response
- Test with a few examples before running a full eval job
Step 4: AWS Infrastructure
S3 Bucket
bash
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"
# us-east-1 does not accept LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" \
--create-bucket-configuration LocationConstraint="${REGION}"
fi
Upload the dataset:
bash
aws s3 cp datasets/collected-responses.jsonl \
"s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"
IAM Role
Trust policy (must include
condition — Bedrock rejects the role without it):
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "bedrock.amazonaws.com" },
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "YOUR_ACCOUNT_ID"
}
}
}
]
}
Permissions policy:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DatasetRead",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::YOUR_BUCKET",
"arn:aws:s3:::YOUR_BUCKET/datasets/*"
]
},
{
"Sid": "S3ResultsWrite",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
},
{
"Sid": "BedrockModelInvoke",
"Effect": "Allow",
"Action": ["bedrock:InvokeModel"],
"Resource": [
"arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
]
}
]
}
Replace
,
, and
with actual values.
Create the role:
bash
ROLE_NAME="BedrockEvalRole"
ROLE_ARN=$(aws iam create-role \
--role-name "${ROLE_NAME}" \
--assume-role-policy-document file://trust-policy.json \
--description "Allows Bedrock to run evaluation jobs" \
--query "Role.Arn" --output text)
aws iam put-role-policy \
--role-name "${ROLE_NAME}" \
--policy-name "BedrockEvalPolicy" \
--policy-document file://permissions-policy.json
Step 5: Configure and Run Eval Job
eval-config.json
json
{
"automated": {
"datasetMetricConfigs": [
{
"taskType": "General",
"dataset": {
"name": "my-eval-dataset",
"datasetLocation": {
"s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
}
},
"metricNames": [
"Builtin.Helpfulness",
"Builtin.FollowingInstructions",
"Builtin.ProfessionalStyleAndTone",
"Builtin.Relevance",
"Builtin.Completeness",
"Builtin.Correctness",
"my_custom_metric_1",
"my_custom_metric_2"
]
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
},
"customMetricConfig": {
"customMetrics": [
{
"customMetricDefinition": {
"metricName": "my_custom_metric_1",
"instructions": "... {{prompt}} ... {{prediction}} ...",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
}
}
}
}
Critical structure notes:
- must be (not "Generation" or any other value)
- Custom metric names must appear in both array AND array
- appears twice: once at the top level (for built-in metrics) and once inside (for custom metrics) — both must specify the same evaluator model
- must be the exact model ID string matching across all configs
inference-config.json
For pre-computed inference, this tells Bedrock that responses are already collected:
json
{
"models": [
{
"precomputedInferenceSource": {
"inferenceSourceIdentifier": "my-app-v1"
}
}
]
}
The
inferenceSourceIdentifier
must match the
in your JSONL dataset's
.
Running the Job
bash
aws bedrock create-evaluation-job \
--job-name "my-eval-$(date +%Y%m%d-%H%M)" \
--role-arn "${ROLE_ARN}" \
--evaluation-config file://eval-config.json \
--inference-config file://inference-config.json \
--output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
--region us-east-1
CLI notes:
- Required params: , , , ,
- Optional: (e.g., )
- constraint:
[a-z0-9](-*[a-z0-9]){0,62}
— lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).
- and are document types — must use or inline JSON, no shorthand syntax
- is a structure — supports both inline JSON and shorthand ()
Monitoring
bash
# List evaluation jobs (with optional filters)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1
# Get details for a specific job
aws bedrock get-evaluation-job \
--job-identifier "JOB_ARN" \
--region us-east-1
# Cancel a running job
aws bedrock stop-evaluation-job \
--job-identifier "JOB_ARN" \
--region us-east-1
Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check
in the job details.
Step 6: Parse Results
S3 Output Directory Structure
Bedrock writes results to a deeply nested path:
s3://YOUR_BUCKET/results/
└── <job-name>/
└── <job-name>/
├── amazon-bedrock-evaluations-permission-check ← empty sentinel
└── <random-id>/
├── custom_metrics/ ← metric definitions (NOT results)
└── models/
└── <model-identifier>/
└── taskTypes/General/datasets/<dataset-name>/
└── <uuid>_output.jsonl ← actual results
The job name is repeated twice. The random ID changes every run. Use
— do not construct paths manually.
Download Results
bash
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1
Result JSONL Format
Each line:
json
{
"automatedEvaluationResult": {
"scores": [
{
"metricName": "Builtin.Helpfulness",
"result": 0.6667,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "The response provides useful information..."
}
]
},
{
"metricName": "confirmation_check",
"result": null,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "This conversation does not involve any consequential action..."
}
]
}
]
},
"inputRecord": {
"prompt": "hello",
"referenceResponse": "",
"modelResponses": [
{ "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
]
}
}
- is a number (score) or (N/A)
evaluatorDetails[0].explanation
contains the judge's written reasoning
Parsing and Aggregation
typescript
interface PromptResult {
prompt: string;
category: string;
modelResponse: string;
scores: Record<string, {
score: string;
reasoning?: string;
rawScore?: number;
}>;
}
for (const s of entry.automatedEvaluationResult.scores) {
scores[s.metricName] = {
score: s.result === null ? "N/A" : String(s.result),
reasoning: s.evaluatorDetails?.[0]?.explanation,
rawScore: typeof s.result === "number" ? s.result : undefined,
};
}
Aggregation approach:
- Overall averages per metric — exclude N/A entries
- Per-category breakdown — group by category field, compute averages within each
- Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)
Low-score alert format:
[Builtin.Relevance] score=0.50 | "hello..."
Reason: The response does not directly address the greeting...
[confirmation_check] score=0.00 | "User: proceed with X..."
Reason: The assistant executed the action without asking for confirmation...
Step 7: Eval-Fix-Reeval Loop
Common Fixes
| Finding | Fix |
|---|
| Low brevity scores | Add hard constraint: "Respond in no more than 3 sentences." |
| Low confirmation_check | Add: "Before executing, summarize details and ask for confirmation." |
| Low missing_info_followup | Add: "If any required field is missing, ask for it. Do not assume." |
| Low tone on negative outcomes | Add empathy instructions for bad-news scenarios |
| Low Completeness on simple prompts | Metric/data issue — add or filter from Completeness |
Metric Refinement
- High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
- All-high scores — instructions too lenient. Add specific failure criteria.
- Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.
Run Comparison
Run 1 (baseline): response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes): response_brevity avg=0.85, custom_tone avg=0.90
Track scores over time. The pipeline's value comes from repeated measurement.
Gotchas
-
must be — not "Generation" or any other value. The job fails silently with other values.
-
Custom metric names in BOTH places — must appear in
array AND
array. Missing from
= silently ignored. Missing from
= job fails.
-
result means N/A, not 0 — when the judge determines a metric doesn't apply, Bedrock records
:
typescript
// WRONG — treats N/A as 0
const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;
// RIGHT — excludes N/A from average
const numericScores = scores.filter((s): s is number => s !== null);
const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;
-
appears twice — once at top level (built-in metrics), once inside
(custom metrics). Omitting either causes those metrics to fail.
-
must match exactly — the string in JSONL
must be character-for-character identical to
inferenceSourceIdentifier
in inference-config.json. Mismatch = model mapping error.
-
AWS CLI 2.33+ required — older versions silently drop
and
precomputedInferenceSource
. Job creation succeeds but the job fails. Always check
.
-
Job names: lowercase + hyphens, max 63 chars — pattern:
[a-z0-9](-*[a-z0-9]){0,62}
. Must be unique across all jobs. Use timestamps:
--job-name "my-eval-$(date +%Y%m%d-%H%M)"
.
-
S3 output is deeply nested —
<prefix>/<job-name>/<job-name>/<random-id>/models/...
. Use
and search for
. Do not construct paths manually.
-
improves Correctness/Completeness — empty string is valid, but providing reference responses gives the judge a baseline for comparison.
-
tag leakage (model-specific) — some models (e.g., Amazon Nova Lite) may leak
blocks into responses. If present, strip before writing JSONL:
typescript
const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
-
us-east-1 S3 bucket creation — do NOT pass
for
. Other regions require it.
Cost Estimation
Formula:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price
Example: 30 prompts, 10 metrics, Nova Pro judge:
- Response collection (Nova Lite): ~$0.02
- Evaluation job (Nova Pro): ~$0.58
- Total per run: ~$0.61
Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).
References