aws-bedrock-evals
Original:🇺🇸 English
Translated
Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.
8installs
Sourceantstackio/skills
Added on
NPX Install
npx skill4agent add antstackio/skills aws-bedrock-evalsTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →AWS Bedrock Evaluation Jobs
Overview
Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
Pre-computed Inference vs Live Inference
| Mode | How it works | Use when |
|---|---|---|
| Live Inference | Bedrock generates responses during the eval job | Simple prompt-in/text-out, no tool calling |
| Pre-computed Inference | You pre-collect responses and supply them in a JSONL dataset | Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock |
Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.
Pipeline
Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
| | | | | |
scenarios.json Your app's API s3://bucket/ create- s3 sync + Fix prompt,
(multi-turn) → dataset JSONL datasets/ evaluation-job parse JSONL retune metricsAgent Behavior: Gather Inputs and Show Cost Estimate
Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:
- AWS Region — Which region to use (default: ). Affects model availability and pricing.
us-east-1 - Target model — The model their application uses (e.g., ,
amazon.nova-lite-v1:0).anthropic.claude-3-haiku - Evaluator (judge) model — The model to score responses (e.g., ). Should be at least as capable as the target.
amazon.nova-pro-v1:0 - Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
- Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
- Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
- Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
- S3 bucket — Existing bucket name or confirm creation of a new one.
- IAM role — Existing role ARN or confirm creation of a new one.
Cost Estimate
After gathering inputs, you MUST display a cost estimate before proceeding:
## Estimated Cost Summary
| Item | Details | Est. Cost |
|------|---------|-----------|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| **Total per run** | | **~${X.XX}** |
Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.Cost formulas:
- Response collection:
num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price - Evaluation job:
num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price
Model pricing reference:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |
Prerequisites
bash
# AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
aws --version
# Verify target model access
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION
# Verify evaluator model access
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGIONGood evaluator model choices: , , . The evaluator should be at least as capable as your target model.
amazon.nova-pro-v1:0anthropic.claude-3-sonnetanthropic.claude-3-haikuStep 1: Design Test Scenarios
List the application's functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.
Scenario JSON format:
json
[
{
"id": "greeting-known-user",
"category": "greeting",
"context": { "userId": "user-123" },
"turns": ["hello"]
},
{
"id": "multi-step-flow",
"category": "core-flow",
"context": { "userId": "user-456" },
"turns": [
"hello",
"I need help with X",
"yes, proceed with that",
"thanks"
]
}
]The field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.
contextEdge case coverage dimensions:
- Happy path: standard usage that should work perfectly
- Missing information: user omits required fields
- Unavailable resources: requested item doesn't exist
- Out-of-scope requests: user asks something the app shouldn't handle
- Error recovery: bad input, invalid data
- Tone stress tests: complaints, frustration
Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).
Step 2: Collect Responses
Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model's response, and metadata.
Example pattern: Converse API with tool-calling loop (TypeScript)
This applies when your application uses Bedrock with tool calling:
typescript
import {
BedrockRuntimeClient,
ConverseCommand,
type Message,
type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
async function converseLoop(
messages: Message[],
systemPrompt: SystemContentBlock[],
tools: any[]
): Promise<string> {
const MAX_TOOL_ROUNDS = 10;
for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
const response = await client.send(
new ConverseCommand({
modelId: "TARGET_MODEL_ID",
system: systemPrompt,
messages,
toolConfig: { tools },
inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
})
);
const assistantContent = response.output?.message?.content as any[];
if (!assistantContent) return "[No response from model]";
messages.push({ role: "assistant", content: assistantContent });
const toolUseBlocks = assistantContent.filter(
(block: any) => block.toolUse != null
);
if (toolUseBlocks.length === 0) {
return assistantContent
.filter((block: any) => block.text != null)
.map((block: any) => block.text as string)
.join("\n") || "[Empty response]";
}
const toolResultBlocks: any[] = [];
for (const block of toolUseBlocks) {
const { toolUseId, name, input } = block.toolUse;
const result = await executeTool(name, input);
toolResultBlocks.push({
toolResult: { toolUseId, content: [{ json: result }] },
});
}
messages.push({ role: "user", content: toolResultBlocks } as Message);
}
return "[Max tool rounds exceeded]";
}Multi-turn handling: Maintain the array across turns and build the dataset prompt field with conversation history:
messagestypescript
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];
for (let i = 0; i < scenario.turns.length; i++) {
const userTurn = scenario.turns[i];
messages.push({ role: "user", content: [{ text: userTurn }] });
const assistantText = await converseLoop(messages, systemPrompt, tools);
conversationHistory.push({ role: "user", text: userTurn });
conversationHistory.push({ role: "assistant", text: assistantText });
let prompt: string;
if (i === 0) {
prompt = userTurn;
} else {
prompt = conversationHistory
.map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
.join("\n");
}
entries.push({
prompt,
category: scenario.category,
referenceResponse: "",
modelResponses: [
{ response: assistantText, modelIdentifier: "my-app-v1" },
],
});
}Dataset JSONL Format
Each line must have this structure:
json
{
"prompt": "User question or multi-turn history",
"referenceResponse": "",
"modelResponses": [
{
"response": "The model's actual output text",
"modelIdentifier": "my-app-v1"
}
]
}| Field | Required | Notes |
|---|---|---|
| Yes | User input. For multi-turn, concatenate: |
| No | Expected/ideal response. Can be empty string. Needed for |
| Yes | Array with exactly one entry for pre-computed inference |
| Yes | The model's actual output text |
| Yes | Any string label. Must match |
Constraints: One model response per prompt. One unique per job. Max 1000 prompts per job.
modelIdentifierWrite JSONL:
typescript
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");Step 3: Design Metrics
Built-In Metrics
Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:
| Metric Name | What It Measures |
|---|---|
| Is the factual content accurate? (works best with |
| Does the response fully cover the request? (works best with |
| Is the response faithful to the provided context/source? |
| Is the response useful, actionable, and cooperative? |
| Is the response logically structured and easy to follow? |
| Does the response address the actual question? |
| Does the response follow explicit instructions in the prompt? |
| Is spelling, grammar, and tone appropriate? |
| Does the response contain harmful content? |
| Does the response contain stereotypes or bias? |
| Does the response appropriately refuse harmful requests? |
Score interpretation: = best, = worst, = N/A (judge could not evaluate).
1.00.0nullNote: is needed for and to produce meaningful scores, since the judge compares against a reference baseline.
referenceResponseBuiltin.CorrectnessBuiltin.CompletenessWhen to Use Custom Metrics
Use custom metrics to check domain-specific behaviors the built-in metrics don't cover. If you find yourself thinking "this scored well on Helpfulness but violated a critical business rule" — that's a custom metric.
Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:
System prompt says: Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max" → response_brevity
"Always greet returning users by name" → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume" → missing_info_followupCustom Metric JSON Anatomy
json
{
"customMetricDefinition": {
"metricName": "my_metric_name",
"instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}| Field | Details |
|---|---|
| Snake_case identifier. Must appear in BOTH |
| Full prompt sent to the judge. Must include |
| Array of rating levels. Each has a |
Official constraints:
- Max 10 custom metrics per job
- Instructions max 5000 characters
- Rating max 5 words / 100 characters
definition - Input variables (,
{{prompt}},{{prediction}}) must come last in the instruction text{{ground_truth}}
Complete Custom Metric Example
A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:
json
{
"customMetricDefinition": {
"metricName": "confirmation_check",
"instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "N/A", "value": { "floatValue": -1 } },
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}When the judge selects N/A (), Bedrock records . Your parser must handle — treat as N/A and exclude from averages.
floatValue: -1"result": nullnullRating Scale Design
- 3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
- 2 levels for binary checks (Poor/Good)
- Add "N/A" level with for conditional metrics that only apply to certain prompt types
-1 - Rating values can use (numeric) or
floatValue(text)stringValue
Tips for Writing Metric Instructions
- Be explicit about what "good" and "bad" look like — include examples of phrases or behaviors
- For conditional metrics, describe the N/A condition clearly so the judge doesn't score 0 when it should skip
- Keep instructions under ~500 words to fit within context alongside prompt and response
- Test with a few examples before running a full eval job
Step 4: AWS Infrastructure
S3 Bucket
bash
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"
# us-east-1 does not accept LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" \
--create-bucket-configuration LocationConstraint="${REGION}"
fiUpload the dataset:
bash
aws s3 cp datasets/collected-responses.jsonl \
"s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"IAM Role
Trust policy (must include condition — Bedrock rejects the role without it):
aws:SourceAccountjson
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "bedrock.amazonaws.com" },
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "YOUR_ACCOUNT_ID"
}
}
}
]
}Permissions policy:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DatasetRead",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::YOUR_BUCKET",
"arn:aws:s3:::YOUR_BUCKET/datasets/*"
]
},
{
"Sid": "S3ResultsWrite",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
},
{
"Sid": "BedrockModelInvoke",
"Effect": "Allow",
"Action": ["bedrock:InvokeModel"],
"Resource": [
"arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
]
}
]
}Replace , , and with actual values.
YOUR_BUCKETREGIONEVALUATOR_MODEL_IDCreate the role:
bash
ROLE_NAME="BedrockEvalRole"
ROLE_ARN=$(aws iam create-role \
--role-name "${ROLE_NAME}" \
--assume-role-policy-document file://trust-policy.json \
--description "Allows Bedrock to run evaluation jobs" \
--query "Role.Arn" --output text)
aws iam put-role-policy \
--role-name "${ROLE_NAME}" \
--policy-name "BedrockEvalPolicy" \
--policy-document file://permissions-policy.jsonStep 5: Configure and Run Eval Job
eval-config.json
json
{
"automated": {
"datasetMetricConfigs": [
{
"taskType": "General",
"dataset": {
"name": "my-eval-dataset",
"datasetLocation": {
"s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
}
},
"metricNames": [
"Builtin.Helpfulness",
"Builtin.FollowingInstructions",
"Builtin.ProfessionalStyleAndTone",
"Builtin.Relevance",
"Builtin.Completeness",
"Builtin.Correctness",
"my_custom_metric_1",
"my_custom_metric_2"
]
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
},
"customMetricConfig": {
"customMetrics": [
{
"customMetricDefinition": {
"metricName": "my_custom_metric_1",
"instructions": "... {{prompt}} ... {{prediction}} ...",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
}
}
}
}Critical structure notes:
- must be
taskType(not "Generation" or any other value)"General" - Custom metric names must appear in both array AND
metricNamesarraycustomMetrics - appears twice: once at the top level (for built-in metrics) and once inside
evaluatorModelConfig(for custom metrics) — both must specify the same evaluator modelcustomMetricConfig - must be the exact model ID string matching across all configs
modelIdentifier
inference-config.json
For pre-computed inference, this tells Bedrock that responses are already collected:
json
{
"models": [
{
"precomputedInferenceSource": {
"inferenceSourceIdentifier": "my-app-v1"
}
}
]
}The must match the in your JSONL dataset's .
inferenceSourceIdentifiermodelIdentifiermodelResponsesRunning the Job
bash
aws bedrock create-evaluation-job \
--job-name "my-eval-$(date +%Y%m%d-%H%M)" \
--role-arn "${ROLE_ARN}" \
--evaluation-config file://eval-config.json \
--inference-config file://inference-config.json \
--output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
--region us-east-1CLI notes:
- Required params: ,
--job-name,--role-arn,--evaluation-config,--inference-config--output-data-config - Optional: (e.g.,
--application-type)ModelEvaluation - constraint:
--job-name— lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).[a-z0-9](-*[a-z0-9]){0,62} - and
--evaluation-configare document types — must use--inference-configor inline JSON, no shorthand syntaxfile:// - is a structure — supports both inline JSON and shorthand (
--output-data-config)s3Uri=string
Monitoring
bash
# List evaluation jobs (with optional filters)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1
# Get details for a specific job
aws bedrock get-evaluation-job \
--job-identifier "JOB_ARN" \
--region us-east-1
# Cancel a running job
aws bedrock stop-evaluation-job \
--job-identifier "JOB_ARN" \
--region us-east-1Job statuses: , , , , ,
InProgressCompletedFailedStoppingStoppedDeletingJobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check in the job details.
failureMessagesStep 6: Parse Results
S3 Output Directory Structure
Bedrock writes results to a deeply nested path:
s3://YOUR_BUCKET/results/
└── <job-name>/
└── <job-name>/
├── amazon-bedrock-evaluations-permission-check ← empty sentinel
└── <random-id>/
├── custom_metrics/ ← metric definitions (NOT results)
└── models/
└── <model-identifier>/
└── taskTypes/General/datasets/<dataset-name>/
└── <uuid>_output.jsonl ← actual resultsThe job name is repeated twice. The random ID changes every run. Use — do not construct paths manually.
aws s3 syncDownload Results
bash
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1Result JSONL Format
Each line:
json
{
"automatedEvaluationResult": {
"scores": [
{
"metricName": "Builtin.Helpfulness",
"result": 0.6667,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "The response provides useful information..."
}
]
},
{
"metricName": "confirmation_check",
"result": null,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "This conversation does not involve any consequential action..."
}
]
}
]
},
"inputRecord": {
"prompt": "hello",
"referenceResponse": "",
"modelResponses": [
{ "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
]
}
}- is a number (score) or
result(N/A)null - contains the judge's written reasoning
evaluatorDetails[0].explanation
Parsing and Aggregation
typescript
interface PromptResult {
prompt: string;
category: string;
modelResponse: string;
scores: Record<string, {
score: string;
reasoning?: string;
rawScore?: number;
}>;
}
for (const s of entry.automatedEvaluationResult.scores) {
scores[s.metricName] = {
score: s.result === null ? "N/A" : String(s.result),
reasoning: s.evaluatorDetails?.[0]?.explanation,
rawScore: typeof s.result === "number" ? s.result : undefined,
};
}Aggregation approach:
- Overall averages per metric — exclude N/A entries
- Per-category breakdown — group by category field, compute averages within each
- Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)
Low-score alert format:
[Builtin.Relevance] score=0.50 | "hello..."
Reason: The response does not directly address the greeting...
[confirmation_check] score=0.00 | "User: proceed with X..."
Reason: The assistant executed the action without asking for confirmation...Step 7: Eval-Fix-Reeval Loop
Common Fixes
| Finding | Fix |
|---|---|
| Low brevity scores | Add hard constraint: "Respond in no more than 3 sentences." |
| Low confirmation_check | Add: "Before executing, summarize details and ask for confirmation." |
| Low missing_info_followup | Add: "If any required field is missing, ask for it. Do not assume." |
| Low tone on negative outcomes | Add empathy instructions for bad-news scenarios |
| Low Completeness on simple prompts | Metric/data issue — add |
Metric Refinement
- High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
- All-high scores — instructions too lenient. Add specific failure criteria.
- Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.
Run Comparison
Run 1 (baseline): response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes): response_brevity avg=0.85, custom_tone avg=0.90Track scores over time. The pipeline's value comes from repeated measurement.
Gotchas
-
must be
taskType— not "Generation" or any other value. The job fails silently with other values."General" -
Custom metric names in BOTH places — must appear inarray AND
metricNamesarray. Missing fromcustomMetrics= silently ignored. Missing frommetricNames= job fails.customMetrics -
result means N/A, not 0 — when the judge determines a metric doesn't apply, Bedrock records
null:nulltypescript// WRONG — treats N/A as 0 const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length; // RIGHT — excludes N/A from average const numericScores = scores.filter((s): s is number => s !== null); const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length; -
appears twice — once at top level (built-in metrics), once inside
evaluatorModelConfig(custom metrics). Omitting either causes those metrics to fail.customMetricConfig -
must match exactly — the string in JSONL
modelIdentifiermust be character-for-character identical tomodelResponsesin inference-config.json. Mismatch = model mapping error.inferenceSourceIdentifier -
AWS CLI 2.33+ required — older versions silently dropand
customMetricConfig. Job creation succeeds but the job fails. Always checkprecomputedInferenceSource.aws --version -
Job names: lowercase + hyphens, max 63 chars — pattern:. Must be unique across all jobs. Use timestamps:
[a-z0-9](-*[a-z0-9]){0,62}.--job-name "my-eval-$(date +%Y%m%d-%H%M)" -
S3 output is deeply nested —. Use
<prefix>/<job-name>/<job-name>/<random-id>/models/...and search foraws s3 sync. Do not construct paths manually._output.jsonl -
improves Correctness/Completeness — empty string is valid, but providing reference responses gives the judge a baseline for comparison.
referenceResponse -
tag leakage (model-specific) — some models (e.g., Amazon Nova Lite) may leak
<thinking>blocks into responses. If present, strip before writing JSONL:<thinking>...</thinking>typescriptconst clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim(); -
us-east-1 S3 bucket creation — do NOT passfor
LocationConstraint. Other regions require it.us-east-1
Cost Estimation
Formula:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_priceExample: 30 prompts, 10 metrics, Nova Pro judge:
- Response collection (Nova Lite): ~$0.02
- Evaluation job (Nova Pro): ~$0.58
- Total per run: ~$0.61
Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).
References
- Model Evaluation Metrics — all 11 built-in metrics
- Custom Metrics Prompt Formats — , template variables, constraints
metricName - Prompt Datasets for Judge Evaluation — dataset JSONL format
- CreateEvaluationJob API Reference — full API spec
- AWS CLI create-evaluation-job — CLI command reference
- Amazon Bedrock Pricing — model pricing