sf-ai-agentforce-testing: Agentforce Test Execution & Coverage Analysis

Expert testing engineer specializing in Agentforce agent testing via dual-track workflow: multi-turn Agent Runtime API testing (primary) and CLI Testing Center (secondary). Execute multi-turn conversations, analyze topic/action/context coverage, and automatically fix issues via sf-ai-agentscript.

Core Responsibilities

Multi-Turn API Testing (PRIMARY): Execute multi-turn conversations via Agent Runtime API
CLI Test Execution (SECONDARY): Run single-utterance tests via
```
sf agent test run
```
Test Spec / Scenario Generation: Create YAML test specifications and multi-turn scenarios
Coverage Analysis: Track topic, action, context preservation, and re-matching coverage
Preview Testing: Interactive simulated and live agent testing
Agentic Fix Loop: Automatically fix failing agents and re-test
Cross-Skill Orchestration: Delegate fixes to sf-ai-agentscript, data to sf-data
Observability Integration: Guide to sf-ai-agentforce-observability for STDM analysis

📚 Document Map

Need	Document	Description
Agent Runtime API	agent-api-reference.md	REST endpoints for multi-turn testing
ECA Setup	eca-setup-guide.md	External Client App for API authentication
Multi-Turn Testing	multi-turn-testing-guide.md	Multi-turn test design and execution
Test Patterns	multi-turn-test-patterns.md	6 multi-turn test patterns with examples
CLI commands	cli-commands.md	Complete sf agent test/preview reference
Test spec format	test-spec-reference.md	YAML specification format and examples
Auto-fix workflow	agentic-fix-loops.md	Automated test-fix cycles (10 failure categories)
Auth guide	connected-app-setup.md	Authentication for preview and API testing
Coverage metrics	coverage-analysis.md	Topic/action/multi-turn coverage analysis
Fix decision tree	agentic-fix-loop.md	Detailed fix strategies
Agent Script testing	agentscript-testing-patterns.md	5 patterns for testing Agent Script agents

⚡ Quick Links:

Deterministic Interview Flow - Rule-based setup (7 steps)
Credential Convention - Persistent ECA storage
Swarm Execution Rules - Parallel team testing
Test Plan Format - Reusable YAML plans
Phase A: Multi-Turn API Testing - Primary workflow
Phase B: CLI Testing Center - Secondary workflow
Agent Script Testing - Agent Script-specific patterns
Scoring System - 7-category validation
Agentic Fix Loop - Auto-fix workflow

Script Location (MANDATORY)

SKILL_PATH:

~/.claude/skills/sf-ai-agentforce-testing

All Python scripts live at absolute paths under

{SKILL_PATH}/hooks/scripts/

. NEVER recreate these scripts. They already exist. Use them as-is.

All scripts in
hooks/scripts/
are pre-approved for execution. Do NOT ask the user for permission to run them.

Script	Absolute Path
`agent_api_client.py`	`{SKILL_PATH}/hooks/scripts/agent_api_client.py`
`agent_discovery.py`	`{SKILL_PATH}/hooks/scripts/agent_discovery.py`
`credential_manager.py`	`{SKILL_PATH}/hooks/scripts/credential_manager.py`
`generate_multi_turn_scenarios.py`	`{SKILL_PATH}/hooks/scripts/generate_multi_turn_scenarios.py`
`generate-test-spec.py`	`{SKILL_PATH}/hooks/scripts/generate-test-spec.py`
`multi_turn_test_runner.py`	`{SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py`
`multi_turn_fix_loop.py`	`{SKILL_PATH}/hooks/scripts/multi_turn_fix_loop.py`
`run-automated-tests.py`	`{SKILL_PATH}/hooks/scripts/run-automated-tests.py`
`parse-agent-test-results.py`	`{SKILL_PATH}/hooks/scripts/parse-agent-test-results.py`
`rich_test_report.py`	`{SKILL_PATH}/hooks/scripts/rich_test_report.py`

Variable resolution: At runtime, resolve
SKILL_PATH
from the
${SKILL_HOOKS}
environment variable (strip
/hooks
suffix). Hardcoded fallback:
~/.claude/skills/sf-ai-agentforce-testing
.

⚠️ CRITICAL: Orchestration Order

sf-metadata → sf-apex → sf-flow → sf-deploy → sf-ai-agentscript → sf-deploy → sf-ai-agentforce-testing (you are here)

Why testing is LAST:

Agent must be published before running automated tests
Agent must be activated for preview mode and API access
All dependencies (Flows, Apex) must be deployed first
Test data (via sf-data) should exist before testing actions

⚠️ MANDATORY Delegation:

Fixes: ALWAYS use
```
Skill(skill="sf-ai-agentscript")
```
for agent script fixes
Test Data: Use
```
Skill(skill="sf-data")
```
for action test data
OAuth Setup (multi-turn API testing only): Use
```
Skill(skill="sf-connected-apps")
```
for ECA — NOT needed for
```
sf agent preview
```
or CLI tests
Observability: Use
```
Skill(skill="sf-ai-agentforce-observability")
```
for STDM analysis of test sessions

Architecture: Dual-Track Testing Workflow

Deterministic Interview (I-1 → I-7)
    │  Agent Name → Org Alias → Metadata → Credentials → Scenarios → Partition → Confirm
    │  (skip if test-plan-{agent}.yaml provided)
    │
    ▼
Phase 0: Prerequisites & Agent Discovery
    │
    ├──► Phase A: Multi-Turn API Testing (PRIMARY — requires ECA)
    │    A1: ECA Credential Setup (via credential_manager.py)
    │    A2: Agent Discovery & Metadata Retrieval
    │    A3: Test Scenario Planning (generate_multi_turn_scenarios.py --categorized)
    │    A4: Multi-Turn Execution (Agent Runtime API)
    │        ├─ Sequential: single multi_turn_test_runner.py process
    │        └─ Swarm: TeamCreate → N workers (--worker-id N)
    │    A5: Results & Scoring (rich Unicode output)
    │
    └──► Phase B: CLI Testing Center (SECONDARY)
         B1: Test Spec Creation
         B2: Test Execution (sf agent test run)
         B3: Results Analysis
    │
Phase C: Agentic Fix Loop (shared)
Phase D: Coverage Improvement (shared)
Phase E: Observability Integration (STDM analysis)

When to use which track:

Condition	Use
Agent Testing Center NOT available	Phase A only
Need multi-turn conversation testing	Phase A
Need topic re-matching validation	Phase A
Need context preservation testing	Phase A
Agent Testing Center IS available + single-utterance tests	Phase B
CI/CD pipeline integration	Phase A (Python scripts) or Phase B (sf CLI)
Quick smoke test	Phase B
Quick manual validation (no ECA setup)	`sf agent preview` (no Phase A/B needed)
No ECA available	`sf agent preview` or Phase B (CLI tests)

Phase 0: Prerequisites & Agent Discovery

Step 1: Gather User Information

Use AskUserQuestion to gather:

AskUserQuestion:
  questions:
    - question: "Which agent do you want to test?"
      header: "Agent"
      options:
        - label: "Let me discover agents in the org"
          description: "Query BotDefinition to find available agents"
        - label: "I know the agent name"
          description: "Provide agent name/API name directly"

    - question: "What is your target org alias?"
      header: "Org"
      options:
        - label: "vivint-DevInt"
          description: "Development integration org"
        - label: "Other"
          description: "Specify a different org alias"

    - question: "What type of testing do you need?"
      header: "Test Type"
      options:
        - label: "Multi-turn API testing (Recommended)"
          description: "Full conversation testing via Agent Runtime API — tests topic switching, context retention, escalation cascades"
        - label: "CLI single-utterance testing"
          description: "Traditional sf agent test run — requires Agent Testing Center feature"
        - label: "Both"
          description: "Run both multi-turn and CLI tests for comprehensive coverage"

Step 2: Agent Discovery

bash

# Auto-discover active agents in the org
sf data query --use-tooling-api \
  --query "SELECT Id, DeveloperName, MasterLabel FROM BotDefinition WHERE IsActive=true" \
  --result-format json --target-org [alias]

Step 3: Agent Metadata Retrieval

bash

# Retrieve agent configuration (topics, actions, instructions)
sf project retrieve start \
  --metadata "GenAiPlannerBundle:[AgentDeveloperName]" \
  --output-dir retrieve-temp --target-org [alias]

Claude reads the GenAiPlannerBundle to understand:

All topics and their
```
classificationDescription
```
values
All actions and their configurations
System instructions and guardrails
Escalation paths

Step 4: Check Agent Testing Center Availability

bash

# This determines if Phase B is available
sf agent test list --target-org [alias]

# If error: "INVALID_TYPE: Cannot use: AiEvaluationDefinition"
# → Agent Testing Center NOT enabled → Phase A only
# If success: → Both Phase A and Phase B available

Step 5: Prerequisites Checklist

Check	Command	Why
Agent exists	`sf data query --use-tooling-api --query "SELECT Id FROM BotDefinition WHERE DeveloperName='X'"`	Can't test non-existent agent
Agent published	`sf agent validate authoring-bundle --api-name X`	Must be published to test
Agent activated	Check activation status	Required for API access
Dependencies deployed	Flows and Apex in org	Actions will fail without them
ECA configured (Phase A only)	Token request test	Multi-turn API testing only. NOT needed for preview or CLI tests
Agent Testing Center (Phase B)	`sf agent test list`	Required for CLI testing

Deterministic Multi-Turn Interview Flow

When the testing skill is invoked, follow these interview steps in order. Each step has deterministic rules with fallbacks. The goal: gather all inputs needed to execute multi-turn tests without ambiguity.

Skip the interview if the user provides a
test-plan-{agent}.yaml
file — load it directly and jump to Swarm Execution Rules.

Step	Rule	Fallback
I-0: Skill Path	Resolve `SKILL_PATH` from `${SKILL_HOOKS}` env var (strip `/hooks` suffix). If unset → hardcoded `~/.claude/skills/sf-ai-agentforce-testing` . Verify directory exists. All subsequent script references use `{SKILL_PATH}/hooks/scripts/` .	Hardcoded path
I-1: Agent Name	User provided → use it. Else walk up from CWD looking for `sfdx-project.json` → run `python3 {SKILL_PATH}/hooks/scripts/agent_discovery.py local --project-dir .` . Multiple agents → present numbered list via AskUserQuestion. None found → ask user.	AskUserQuestion
I-2: Org Alias	User provided → use it. Else parse `sfdx-project.json` → read `sfdx-config.json` for `target-org` . Else ask user. Note: org aliases are case-sensitive (e.g., `Vivint-DevInt` ≠ `vivint-devint` ).	AskUserQuestion
I-3: Metadata	ALWAYS run `python3 {SKILL_PATH}/hooks/scripts/agent_discovery.py live --target-org {org} --agent-name {agent}` . Extract topics, actions, type, agent_id. This step is mandatory — never skip.	Required (fail if no agent found)
I-4: Credentials	Skip if test type is CLI-only or Preview-only — standard org auth suffices (no ECA needed). For multi-turn API testing: Run `python3 {SKILL_PATH}/hooks/scripts/credential_manager.py discover --org-alias {org}` . Found ECA → `validate` . Valid → use. Invalid → ask user for new credentials → `save` → re-validate. No ECAs found → ask user → offer to save via `credential_manager.py save` .	AskUserQuestion for credentials (multi-turn API only)
I-4b: Session Variables	ALWAYS ask. Extract known context variables from agent metadata ( `attributeMappings` where `mappingType=ContextVariable` in GenAiPlannerBundle). WARN if `User_Authentication` topic exists — the agent likely requires `$Context.RoutableId` and `$Context.CaseId` to authenticate the customer. Present discovered variables and ask user for values.	AskUserQuestion
I-5: Scenarios	Pipe discovery metadata to `python3 {SKILL_PATH}/hooks/scripts/generate_multi_turn_scenarios.py --metadata - --output {dir} --categorized --cross-topic` . Present summary: N scenarios across M categories.	Required
I-6: Partition	Ask user how to split work across workers.	AskUserQuestion (see below)
I-7: Confirm	Present test plan summary. Save as `test-plan-{agent}.yaml` using template. User confirms to proceed.	AskUserQuestion

I-4b: Session Variables

Context variables are MANDATORY for agents that use authentication flows (e.g.,

User_Authentication

topic). Without them, the agent's authentication flow fails and the session ends on Turn 1.

Extract context variables from agent metadata:

Run

python3 {SKILL_PATH}/hooks/scripts/agent_discovery.py local --project-dir {project}

and look for

context_variables

in the GenAiPlannerBundle output.

Common variables:
```
$Context.RoutableId
```
(MessagingSession ID),
```
$Context.CaseId
```
(Case record ID).

AskUserQuestion:
  question: "The agent requires context variables for testing. Which values should we use?"
  header: "Variables"
  options:
    - label: "Use test record IDs (Recommended)"
      description: "I'll provide real MessagingSession and Case IDs from the org for testing"
    - label: "Skip variables"
      description: "Run without context variables — WARNING: authentication topics will likely fail"
    - label: "Auto-discover from org"
      description: "Query the org for recent MessagingSession and Case records to use as test values"
  multiSelect: false

⚠️ WARNING: If the agent has a
User_Authentication
topic that runs
Bot_User_Verification
, you MUST provide
$Context.RoutableId
and
$Context.CaseId
. Without them, the verification flow fails → agent escalates →
SessionEnded
on Turn 1.

I-6: Partition Strategy

DEFAULT RULE: If total generated scenarios > 4, default to "2 workers by category". If ≤ 4, default to "Sequential". ALWAYS default — only change if the user explicitly requests otherwise.

AskUserQuestion:
  question: "How should test scenarios be distributed across workers?"
  header: "Partition"
  options:
    - label: "2 workers by category (Recommended)"
      description: "Group test patterns into 2 balanced buckets — best balance of parallelism and readability. DEFAULT when > 4 scenarios."
    - label: "Sequential"
      description: "Run all scenarios in a single process — no team needed, simpler but slower. DEFAULT when ≤ 4 scenarios."
  multiSelect: false

I-7: Confirmation Summary Format

Present this to the user before execution:

📋 TEST PLAN SUMMARY
════════════════════════════════════════════════════════════════
Agent:        {agent_name} ({agent_id})
Org:          {org_alias}
Credentials:  ~/.sfagent/{org_alias}/{eca_name}/credentials.env ✅
Scenarios:    {total_count} across {category_count} categories
Partition:    {strategy} with {worker_count} worker(s)
Variables:    {var_count} session variable(s)

📂 Scenario Breakdown:
  topic_routing:        {n} scenarios
  context_preservation: {n} scenarios
  escalation_flows:     {n} scenarios
  guardrail_testing:    {n} scenarios
  action_chain:         {n} scenarios
  error_recovery:       {n} scenarios
  cross_topic_switch:   {n} scenarios

💾 Saved: test-plan-{agent_name}.yaml
════════════════════════════════════════════════════════════════
Proceed? [Confirm / Edit / Cancel]

⚡ MANDATORY: Phase A4 Execution Protocol

This protocol is NON-NEGOTIABLE. After I-7 confirmation, you MUST follow EXACTLY these steps based on the partition strategy. DO NOT improvise, skip steps, or run sequentially when the plan says swarm.

Path A: Sequential Execution (worker_count == 1)

Run a single

multi_turn_test_runner.py

process. No team needed.

bash

set -a; source ~/.sfagent/{org_alias}/{eca_name}/credentials.env; set +a
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --scenarios {scenario_file} \
  --agent-id {agent_id} \
  --var '$Context.RoutableId={routable_id}' \
  --var '$Context.CaseId={case_id}' \
  --output {working_dir}/results.json \
  --report-file {working_dir}/report.ansi \
  --verbose

Path B: Swarm Execution (worker_count == 2) — MANDATORY CHECKLIST

YOU MUST EXECUTE EVERY STEP BELOW IN ORDER. DO NOT SKIP ANY STEP.

☐ Step 1: Split scenarios into 2 partitions Group the generated category YAML files into 2 balanced buckets by total scenario count. Write

{working_dir}/scenarios-part1.yaml

and

{working_dir}/scenarios-part2.yaml

. Each partition file must be valid YAML with a

scenarios:

key containing its subset.

☐ Step 2: Create team

TeamCreate(team_name="sf-test-{agent_name}")

☐ Step 3: Create 2 tasks (one per partition)

TaskCreate(subject="Run partition 1", description="Execute scenarios-part1.yaml")
TaskCreate(subject="Run partition 2", description="Execute scenarios-part2.yaml")

☐ Step 4: Spawn 2 workers IN PARALLEL (single message with 2 Task tool calls) Use the Worker Agent Prompt Template below. CRITICAL: Both Task calls MUST be in the SAME message.

Task(subagent_type="general-purpose", team_name="sf-test-{agent_name}", name="worker-1", prompt=WORKER_PROMPT_1)
Task(subagent_type="general-purpose", team_name="sf-test-{agent_name}", name="worker-2", prompt=WORKER_PROMPT_2)

☐ Step 5: Wait for both workers to report (they SendMessage when done) Do NOT proceed until both workers have sent their results via SendMessage.

☐ Step 6: Aggregate results

bash

python3 {SKILL_PATH}/hooks/scripts/rich_test_report.py \
  --results {working_dir}/worker-1-results.json {working_dir}/worker-2-results.json

☐ Step 7: Present unified report to the user

☐ Step 8: Offer fix loop if any failures detected

☐ Step 9: Shutdown workers

SendMessage(type="shutdown_request", recipient="worker-1")
SendMessage(type="shutdown_request", recipient="worker-2")

☐ Step 10: Clean up

TeamDelete

Credential Convention (~/.sfagent/)

Persistent ECA credential storage managed by

hooks/scripts/credential_manager.py

Directory Structure

~/.sfagent/
├── .gitignore          ("*" — auto-created, prevents accidental commits)
├── {Org-Alias}/        (org alias — case-sensitive, e.g. Vivint-DevInt)
│   └── {ECA-Name}/     (ECA app name — use `discover` to find actual name)
│       └── credentials.env
└── Other-Org/
    └── My_ECA/
        └── credentials.env

File Format

env

# credentials.env — managed by credential_manager.py
# 'export' prefix allows direct `source credentials.env` in shell
export SF_MY_DOMAIN=yourdomain.my.salesforce.com
export SF_CONSUMER_KEY=3MVG9...
export SF_CONSUMER_SECRET=ABC123...

Security Rules

Rule	Implementation
Directory permissions	`0700` (owner only)
File permissions	`0600` (owner only)
Git protection	`.gitignore` with `*` auto-created in `~/.sfagent/`
Secret display	NEVER show full secrets — mask as `ABC...XYZ` (first 3 + last 3)
Credential passing	Export as env vars for subprocesses, never write to temp files

CLI Reference

bash

# Discover orgs and ECAs
python3 {SKILL_PATH}/hooks/scripts/credential_manager.py discover
python3 {SKILL_PATH}/hooks/scripts/credential_manager.py discover --org-alias Vivint-DevInt

# Load credentials (secrets masked in output)
python3 {SKILL_PATH}/hooks/scripts/credential_manager.py load --org-alias {org} --eca-name {eca}

# Save new credentials
python3 {SKILL_PATH}/hooks/scripts/credential_manager.py save \
  --org-alias {org} --eca-name {eca} \
  --domain yourdomain.my.salesforce.com \
  --consumer-key 3MVG9... --consumer-secret ABC123...

# Validate OAuth flow
python3 {SKILL_PATH}/hooks/scripts/credential_manager.py validate --org-alias {org} --eca-name {eca}

# Source credentials for shell use (set -a auto-exports all vars)
set -a; source ~/.sfagent/{org}/{eca}/credentials.env; set +a

Swarm Execution Rules (Native Claude Code Teams)

When

worker_count > 1

in the test plan, use Claude Code's native team orchestration for parallel test execution. When

worker_count == 1

, run sequentially without creating a team.

Team Lead Rules (Claude Code)

RULE: Create team via TeamCreate("sf-test-{agent_name}")
RULE: Create one TaskCreate per partition (category or count split)
RULE: Spawn one Task(subagent_type="general-purpose") per worker
RULE: Each worker gets credentials as env vars in its prompt (NEVER in files)
RULE: Wait for all workers to report via SendMessage
RULE: After all workers complete, run rich_test_report.py to render unified results
RULE: Present unified beautiful report aggregating all worker results
RULE: Offer fix loop if any failures detected
RULE: Shutdown all workers via SendMessage(type="shutdown_request")
RULE: Clean up via TeamDelete when done
RULE: NEVER spawn more than 2 workers.
RULE: When categories > 2, group into 2 balanced buckets.
RULE: Queue remaining work to existing workers after they complete first batch.

Worker Agent Prompt Template

Each worker receives this prompt (team lead fills in the variables):

You are a multi-turn test worker for Agentforce agent testing.

YOUR TASK:
1. Claim your task via TaskUpdate(status="in_progress", owner=your_name)

2. Load credentials and run the test:
   set -a; source ~/.sfagent/{org_alias}/{eca_name}/credentials.env; set +a

   python3 {skill_path}/hooks/scripts/multi_turn_test_runner.py \
     --scenarios {scenario_file} \
     --agent-id {agent_id} \
     --var '$Context.RoutableId={routable_id}' \
     --var '$Context.CaseId={case_id}' \
     --output {working_dir}/worker-{N}-results.json \
     --report-file {working_dir}/worker-{N}-report.ansi \
     --worker-id {N} --verbose

3. IMPORTANT — RENDER RICH TUI REPORT IN YOUR PANE:
   After the test runner completes, render the results visually so they appear
   in your conversation pane (the tmux panel the user can see):

   python3 -c "
   import sys, json
   sys.path.insert(0, '{skill_path}/hooks/scripts')
   from multi_turn_test_runner import format_results_rich
   with open('{working_dir}/worker-{N}-results.json') as f:
       results = json.load(f)
   print(format_results_rich(results, worker_id={N}, scenario_file='{scenario_file}'))
   "

   Then copy-paste that output into your conversation as a text message so it
   renders in your Claude Code pane for the user to see.

4. Analyze: which scenarios passed, which failed, and WHY

5. SendMessage to team lead with:
   - Pass/fail summary (counts + percentages)
   - For each failure: scenario name, turn number, what went wrong, suggested fix
   - Total execution time
   - Any patterns noticed (e.g., "all context_preservation tests failed — may be a systemic issue")

6. Mark your task as completed via TaskUpdate

IMPORTANT:
- If a test fails with an auth error (exit code 2), report it immediately — do NOT retry
- If a test fails with scenario failures (exit code 1), analyze and report all failures
- You CAN communicate with other workers if you discover related issues
- The --report-file flag writes a persistent ANSI report file viewable with `cat` or `bat`

Partition Strategies

Strategy	How It Works	Best For
`by_category`	One worker per test pattern (topic_routing, context, etc.)	Most runs — natural isolation
`by_count`	Split N scenarios evenly across W workers	Large scenario counts
`sequential`	Single process, no team	Quick runs, debugging

Team Lead Aggregation

After all workers report, the team lead:

Aggregates all worker result JSON files via

rich_test_report.py

bash

python3 {SKILL_PATH}/hooks/scripts/rich_test_report.py \
  --results /tmp/sf-test-{session}/worker-*-results.json

Deduplicates any shared failure patterns across workers
Presents the unified Rich report (colored Panels, Tables, Tree) to the user
Calculates aggregate scoring across the 7 categories
Offers fix loop: if failures exist, ask user whether to auto-fix via
```
sf-ai-agentscript
```
Shuts down all workers and deletes the team

Test Plan File Format

Test plans (

test-plan-{agent}.yaml

) capture the full interview output for reuse. See

templates/test-plan-template.yaml

for the complete schema.

Key Sections

Section	Purpose
`metadata`	Agent name, ID, org alias, timestamps
`credentials`	Path to `~/.sfagent/` credentials.env or `use_env: true`
`agent_metadata`	Topics, actions, type — populated by `agent_discovery.py`
`scenarios`	List of YAML scenario files + pattern filters
`partition`	Strategy ( `by_category` / `by_count` / `sequential` ) + worker count
`session_variables`	Context variables injected into every session
`execution`	Timeout, retry, verbose, rich output settings

Re-Running from a Saved Plan

When a user provides a test plan file, skip the interview entirely:

1. Load test-plan-{agent}.yaml
2. Validate credentials: credential_manager.py validate --org-alias {org} --eca-name {eca}
3. If invalid → ask user to update credentials only (skip other interview steps)
4. Load scenario files from plan
5. Apply partition strategy from plan
6. Execute (team or sequential based on worker_count)

This enables rapid re-runs after fixing agent issues — the user just says "re-run" and the skill picks up the saved plan.

Phase A: Multi-Turn API Testing (PRIMARY)

⚠️ NEVER use
curl
for OAuth token validation. Domains containing
--
(e.g.,
my-org--devint.sandbox.my.salesforce.com
) cause shell expansion failures with curl's
--
argument parsing. Use
credential_manager.py validate
instead.

A1: ECA Credential Setup

Why ECA? Multi-turn API testing uses the Agent Runtime API (
/einstein/ai-agent/v1
), which requires OAuth Client Credentials. If you only need interactive testing, use
sf agent preview
instead — no ECA needed, just
sf org login web
(v2.121.7+). See connected-app-setup.md.

AskUserQuestion:
  question: "Do you have an External Client App (ECA) with Client Credentials flow configured?"
  header: "ECA Setup"
  options:
    - label: "Yes, I have credentials"
      description: "I have Consumer Key, Secret, and My Domain URL ready"
    - label: "No, I need to create one"
      description: "Delegate to sf-connected-apps skill to create ECA"

If YES: Collect credentials (kept in conversation context only, NEVER written to files):

Consumer Key
Consumer Secret
My Domain URL (e.g.,
```
your-domain.my.salesforce.com
```
)

If NO: Delegate to sf-connected-apps:

Skill(skill="sf-connected-apps", args="Create External Client App with Client Credentials flow for Agent Runtime API testing. Scopes: api, chatbot_api, sfap_api, refresh_token, offline_access. Name: Agent_API_Testing")

Verify credentials work:

bash

# Validate OAuth credentials via credential_manager.py (handles token request internally)
python3 {SKILL_PATH}/hooks/scripts/credential_manager.py \
  validate --org-alias {org} --eca-name {eca}

See ECA Setup Guide for complete instructions.

A2: Agent Discovery & Metadata Retrieval

bash

# Get agent ID for API calls
AGENT_ID=$(sf data query --use-tooling-api \
  --query "SELECT Id, DeveloperName, MasterLabel FROM BotDefinition WHERE DeveloperName='[AgentName]' AND IsActive=true LIMIT 1" \
  --result-format json --target-org [alias] | jq -r '.result.records[0].Id')

# Retrieve full agent configuration
sf project retrieve start \
  --metadata "GenAiPlannerBundle:[AgentName]" \
  --output-dir retrieve-temp --target-org [alias]

Claude reads the GenAiPlannerBundle to understand:

Topics: Names, classificationDescriptions, instructions
Actions: Types (flow, apex), triggers, inputs/outputs
System Instructions: Global rules and guardrails
Escalation Paths: When and how the agent escalates

This metadata drives automatic test scenario generation in A3.

A3: Test Scenario Planning

AskUserQuestion:
  question: "What testing do you need?"
  header: "Scenarios"
  options:
    - label: "Comprehensive coverage (Recommended)"
      description: "All 6 test patterns: topic routing, context preservation, escalation, guardrails, action chaining, variable injection"
    - label: "Topic routing accuracy"
      description: "Test that utterances route to correct topics, including mid-conversation topic switches"
    - label: "Context preservation"
      description: "Test that the agent retains information across turns"
    - label: "Specific bug reproduction"
      description: "Reproduce a known issue with targeted multi-turn scenario"
  multiSelect: true

Claude uses the agent metadata from A2 to auto-generate multi-turn scenarios tailored to the specific agent:

Generates topic switching scenarios based on actual topic names
Creates context preservation tests using actual action inputs/outputs
Builds escalation tests based on actual escalation configuration
Creates guardrail tests based on system instructions

Available templates (see templates/):

Template	Pattern	Scenarios
`multi-turn-topic-routing.yaml`	Topic switching	4
`multi-turn-context-preservation.yaml`	Context retention	4
`multi-turn-escalation-flows.yaml`	Escalation cascades	4
`multi-turn-comprehensive.yaml`	All 6 patterns	6

A4: Multi-Turn Execution

Execute conversations via Agent Runtime API using the reusable Python scripts in

hooks/scripts/

⚠️ Agent API is NOT supported for agents of type "Agentforce (Default)". Only custom agents created via Agentforce Builder are supported.

Option 1: Run Test Scenarios from YAML Templates (Recommended)

Use the multi-turn test runner to execute entire scenario suites:

bash

# Run comprehensive test suite against an agent
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --my-domain "${SF_MY_DOMAIN}" \
  --consumer-key "${CONSUMER_KEY}" \
  --consumer-secret "${CONSUMER_SECRET}" \
  --agent-id "${AGENT_ID}" \
  --scenarios templates/multi-turn-comprehensive.yaml \
  --verbose

# Run specific scenario within a suite
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --my-domain "${SF_MY_DOMAIN}" \
  --consumer-key "${CONSUMER_KEY}" \
  --consumer-secret "${CONSUMER_SECRET}" \
  --agent-id "${AGENT_ID}" \
  --scenarios templates/multi-turn-topic-routing.yaml \
  --scenario-filter topic_switch_natural \
  --verbose

# With context variables and JSON output for fix loop
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --my-domain "${SF_MY_DOMAIN}" \
  --consumer-key "${CONSUMER_KEY}" \
  --consumer-secret "${CONSUMER_SECRET}" \
  --agent-id "${AGENT_ID}" \
  --scenarios templates/multi-turn-comprehensive.yaml \
  --var '$Context.AccountId=001XXXXXXXXXXXX' \
  --var '$Context.EndUserLanguage=en_US' \
  --output results.json \
  --verbose

Exit codes:

= all passed,

= some failed (fix loop should process),

= execution error

Option 2: Use Environment Variables (cleaner for repeated runs)

bash

export SF_MY_DOMAIN="your-domain.my.salesforce.com"
export SF_CONSUMER_KEY="your_key"
export SF_CONSUMER_SECRET="your_secret"
export SF_AGENT_ID="0XxRM0000004ABC"

# Now run without credential flags
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --scenarios templates/multi-turn-comprehensive.yaml \
  --verbose

Option 3: Python API for Ad-Hoc Testing

For custom scenarios or debugging, use the client directly:

python

from hooks.scripts.agent_api_client import AgentAPIClient

client = AgentAPIClient(
    my_domain="your-domain.my.salesforce.com",
    consumer_key="...",
    consumer_secret="..."
)

# Context manager auto-ends session
with client.session(agent_id="0XxRM000...") as session:
    r1 = session.send("I need to cancel my appointment")
    print(f"Turn 1: {r1.agent_text}")

    r2 = session.send("Actually, reschedule instead")
    print(f"Turn 2: {r2.agent_text}")

    r3 = session.send("What was my original request?")
    print(f"Turn 3: {r3.agent_text}")
    # Check context preservation
    if "cancel" in r3.agent_text.lower():
        print("✅ Context preserved")

# With initial variables
variables = [
    {"name": "$Context.AccountId", "type": "Id", "value": "001XXXXXXXXXXXX"},
    {"name": "$Context.EndUserLanguage", "type": "Text", "value": "en_US"},
]
with client.session(agent_id="0Xx...", variables=variables) as session:
    r1 = session.send("What orders do I have?")

Connectivity Test:

bash

# Verify ECA credentials and API connectivity
python3 {SKILL_PATH}/hooks/scripts/agent_api_client.py
# Reads SF_MY_DOMAIN, SF_CONSUMER_KEY, SF_CONSUMER_SECRET from env

Per-Turn Analysis Checklist:

The test runner automatically evaluates each turn against expectations defined in the YAML template:


messages[0].message
topic_contains: "cancel"
result
response_contains: "reschedule"
context_retained: true
guardrail_triggered: true
Escalation
response_not_contains: "error"

#	Check	YAML Key	How Evaluated
1	Response non-empty?	`response_not_empty: true`	`messages[0].message` has content
2	Correct topic matched?	`topic_contains: "cancel"`	Heuristic: inferred from response text
3	Expected actions invoked?	`action_invoked: true`	Checks for `result` array entries
4	Response content?	`response_contains: "reschedule"`	Substring match on response
5	Context preserved?	`context_retained: true`	Heuristic: checks for prior-turn references
6	Guardrail respected?	`guardrail_triggered: true`	Regex patterns for refusal language
7	Escalation triggered?	`escalation_triggered: true`	Checks for `Escalation` message type
8	Response excludes?	`response_not_contains: "error"`	Substring exclusion check

See Agent API Reference for complete response format.

A5: Results & Scoring

Claude generates a terminal-friendly results report:

📊 MULTI-TURN TEST RESULTS
════════════════════════════════════════════════════════════════

Agent: Customer_Support_Agent
Org: vivint-DevInt
Mode: Agent Runtime API (multi-turn)

SCENARIO RESULTS
───────────────────────────────────────────────────────────────
✅ topic_switch_natural        3/3 turns passed
✅ context_user_identity       3/3 turns passed
❌ escalation_frustration      2/3 turns passed (Turn 3: no escalation)
✅ guardrail_mid_conversation  3/3 turns passed
✅ action_chain_identify       3/3 turns passed
⚠️ variable_injection          2/3 turns passed (Turn 3: re-asked for account)

SUMMARY
───────────────────────────────────────────────────────────────
Scenarios: 6 total | 4 passed | 1 failed | 1 partial
Turns: 18 total | 16 passed | 2 failed
Topic Re-matching: 100% ✅
Context Preservation: 83% ⚠️
Escalation Accuracy: 67% ❌

FAILED TURNS
───────────────────────────────────────────────────────────────
❌ escalation_frustration → Turn 3
   Input: "Nothing is working! I need a human NOW"
   Expected: Escalation triggered
   Actual: Agent continued troubleshooting
   Category: MULTI_TURN_ESCALATION_FAILURE
   Fix: Add frustration keywords to escalation triggers

⚠️ variable_injection → Turn 3
   Input: "Create a new case for a billing issue"
   Expected: Uses pre-set $Context.AccountId
   Actual: "Which account is this for?"
   Category: CONTEXT_PRESERVATION_FAILURE
   Fix: Wire $Context.AccountId to CreateCase action input

SCORING
───────────────────────────────────────────────────────────────
Topic Selection Coverage          13/15
Action Invocation                 14/15
Multi-Turn Topic Re-matching      15/15  ✅
Context Preservation              10/15  ⚠️
Edge Case & Guardrail Coverage    12/15
Test Spec / Scenario Quality       9/10
Agentic Fix Success               --/15  (pending)

TOTAL: 73/85 (86%) + Fix Loop pending

Phase B: CLI Testing Center (SECONDARY)

Availability: Requires Agent Testing Center feature enabled in org. If unavailable, use Phase A exclusively.

⚡ Agent Script Agents (AiAuthoringBundle)

Agent Script agents (

.agent

files in

aiAuthoringBundles/

) deploy as

BotDefinition

and use the same

sf agent test

CLI commands. However, they have unique testing challenges:

Two-Level Action System:

Level 1 (Definition):
```
topic.actions:
```
block — defines actions with
```
target: "apex://ClassName"
```
Level 2 (Invocation):
```
reasoning.actions:
```
block — invokes via
```
@actions.<name>
```
with variable bindings

Single-Utterance Limitation: Multi-topic Agent Script agents with

start_agent

routing have a "1 action per reasoning cycle" budget in CLI tests. The first cycle is consumed by the transition action (

go_<topic>

). The actual business action (e.g.,

get_order_status

) fires in a second cycle that single-utterance tests don't reach.

Solution — Use
conversationHistory
:

yaml

testCases:
  # ROUTING TEST — captures transition action only
  - utterance: "I want to check my order status"
    expectedTopic: order_status
    expectedActions:
      - go_order_status          # Transition action from start_agent

  # ACTION TEST — use conversationHistory to skip routing
  - utterance: "The order ID is 801ak00001g59JlAAI"
    conversationHistory:
      - role: "user"
        message: "I want to check my order status"
      - role: "agent"
        topic: "order_status"    # Pre-positions agent in target topic
        message: "I'd be happy to help! Could you provide the Order ID?"
    expectedTopic: order_status
    expectedActions:
      - get_order_status         # Level 1 DEFINITION name (NOT invocation name)
    expectedOutcome: "Agent retrieves and displays order details"

Key Rules for Agent Script CLI Tests:

```
expectedActions
```
uses the Level 1 definition name (e.g.,
```
get_order_status
```
), NOT the Level 2 invocation name (e.g.,
```
check_status
```
)
Agent Script topic names may differ in org — use the topic name discovery workflow
Agents with
```
WITH USER_MODE
```
Apex require the Einstein Agent User to have object permissions — missing permissions cause silent failures (0 rows, no error)
```
subjectName
```
in the YAML spec maps to
```
config.developer_name
```
in the
```
.agent
```
file

⚠️ Agent Script API Testing Caveat:

Agent Script agents embed action results differently via the Agent Runtime API:

Agent Builder agents: Return separate
```
ActionResult
```
message types with structured data
Agent Script agents: Embed action outputs within
```
Inform
```
text messages — no separate
```
ActionResult
```
type

This means:

```
action_invoked: true
```
(boolean) may fail even when the action runs — use
```
response_contains
```
to verify action output instead

action_invoked: "action_name"

uses

plannerSurfaces

fallback parsing but is less reliable

For robust testing, prefer

response_contains

response_contains_any

checks over

action_invoked

Agent Script Templates & Docs:

Template: agentscript-test-spec.yaml — 5 test patterns (CLI)
Template: multi-turn-agentscript-comprehensive.yaml — 6 multi-turn API scenarios
Guide: agentscript-testing-patterns.md — detailed patterns with worked examples

Automated Test Spec Generation:

bash

python3 {SKILL_PATH}/hooks/scripts/generate-test-spec.py \
  --agent-file /path/to/Agent.agent \
  --output tests/agent-spec.yaml --verbose

# Generates both routing tests (with transition actions) and
# action tests (with conversationHistory for apex:// targets)

Agent Discovery:

bash

# Discover Agent Script agents alongside XML-based agents
python3 {SKILL_PATH}/hooks/scripts/agent_discovery.py local \
  --project-dir /path/to/project --agent-name MyAgent
# Returns type: "AiAuthoringBundle" for .agent files

B1: Test Spec Creation

⚠️ CRITICAL: YAML Schema

The CLI YAML spec uses a FLAT structure parsed by

@salesforce/agents

— NOT the fabricated

apiVersion

kind

metadata

format. See test-spec-guide.md for the correct schema.

Required top-level fields:

```
name:
```
— Display name (MasterLabel). Deploy FAILS without this.
```
subjectType: AGENT
```
```
subjectName:
```
— Agent BotDefinition DeveloperName

Test case fields (flat, NOT nested):

```
utterance:
```
— User message
```
expectedTopic:
```
— NOT
```
expectation.topic
```
```
expectedActions:
```
— Flat list of strings, NOT objects with
```
name
```
/
```
invoked
```
/
```
outputs
```
```
expectedOutcome:
```
— Optional natural language description

yaml

# ✅ Correct CLI YAML format
name: "My Agent Tests"
subjectType: AGENT
subjectName: My_Agent

testCases:
  - utterance: "Where is my order?"
    expectedTopic: order_lookup
    expectedActions:
      - get_order_status
    expectedOutcome: "Agent should provide order status information"

Option A: Interactive Generation (no automation)

bash

# Interactive test spec generation
sf agent generate test-spec --output-file ./tests/agent-spec.yaml
# ⚠️ NOTE: No --api-name flag! Interactive-only.

Option B: Automated Generation (Python script)

bash

python3 {SKILL_PATH}/hooks/scripts/generate-test-spec.py \
  --agent-file /path/to/Agent.agent \
  --output tests/agent-spec.yaml \
  --verbose

Create Test in Org:

bash

sf agent test create --spec ./tests/agent-spec.yaml --api-name MyAgentTest --target-org [alias]

See Test Spec Reference for complete YAML format guide.

B1.5: Topic Name Resolution

Topic name format in

expectedTopic

depends on the topic type:

Topic Type	YAML Value	Resolution
Standard (Escalation, Off_Topic)	`localDeveloperName` (e.g., `Escalation` )	Framework resolves automatically
Promoted (p_16j... prefix)	Full runtime `developerName` with hash	Must be exact match

Standard topics like

Escalation

can use the short name — the CLI framework resolves to the hash-suffixed runtime name.

Promoted topics (custom topics created in Setup UI) MUST use the full runtime

developerName

including hash suffix. The short

localDeveloperName

does NOT resolve.

Discovery workflow:

Write spec with best guesses for topic names

Deploy and run:

sf agent test run --api-name X --wait 10 --result-format json --json

Extract actual names:

jq '.result.testCases[].generatedData.topic'

Update spec with actual runtime names
Re-deploy with
```
--force-overwrite
```
and re-run

See topic-name-resolution.md for the complete guide.

B1.6: Known CLI Gotchas

Gotcha	Detail
`name:` mandatory	Deploy fails: "Required fields are missing: [MasterLabel]"
`expectedActions` is flat strings	`- action_name` NOT `- name: action_name, invoked: true`
Empty `expectedActions: []`	Means "not testing" — PASS even when actions invoked
Missing `expectedOutcome`	`output_validation` reports ERROR — harmless
No MessagingSession context	Flows needing `recordId` error (agent handles gracefully)
`--use-most-recent` broken	Always use `--job-id` for `sf agent test results`
contextVariables `name` prefix	Use `RoutableId` NOT `$Context.RoutableId` — framework adds prefix
customEvaluations RETRY bug	⚠️ Spring '26: Server returns RETRY → REST API 500. See Known Issues.
`conciseness` metric broken	Returns score=0, empty explanation — platform bug
`instruction_following` threshold	Labels FAILURE even at score=1 — use score value, ignore label

B1.7: Context Variables

Context variables inject session-level data (record IDs, user info) into CLI test cases. Without them, action flows receive the topic's internal name as

recordId

. With them, they receive a real record ID.

When to use: Any test case where action flows need real record IDs (e.g., updating a MessagingSession, creating a Case).

YAML syntax:

yaml

contextVariables:
  - name: RoutableId            # Bare name — NOT $Context.RoutableId
    value: "0Mwbb000007MGoTCAW"
  - name: CaseId
    value: "500XX0000000001"

Key rules:

```
name
```
uses bare variable name (e.g.,
```
RoutableId
```
), NOT
```
$Context.RoutableId
```
— the CLI adds the prefix

Maps to

<contextVariable><variableName>

<variableValue>

in XML metadata

Discovery — find valid IDs:

bash

sf data query --query "SELECT Id FROM MessagingSession WHERE Status='Active' LIMIT 1" --target-org [alias]
sf data query --query "SELECT Id FROM Case ORDER BY CreatedDate DESC LIMIT 1" --target-org [alias]

Verified effect (IRIS testing, 2026-02-09):

Without

RoutableId

: action receives

recordId: "p_16jPl000000GwEX_Field_Support_Routing_16j8eeef13560aa"

(topic name)

With
```
RoutableId
```
: action receives
```
recordId: "0Mwbb000007MGoTCAW"
```
(real MessagingSession ID)

Note: Context variables do NOT unlock authentication-gated topics. Injecting
RoutableId
+
CaseId
does not satisfy
User_Authentication
flows.

See context-vars-test-spec.yaml for a dedicated template.

B1.8: Metrics

Metrics add platform quality scoring to test cases. Specify as a flat list of metric names in the YAML.

YAML syntax:

yaml

metrics:
  - coherence
  - instruction_following
  - output_latency_milliseconds

Available metrics (observed behavior from IRIS testing, 2026-02-09):

Metric	Score Range	Status	Notes
`coherence`	1-5	✅ Works	Scores 4-5 for clear responses. Recommended.
`completeness`	1-5	⚠️ Misleading	Penalizes triage/routing agents for "not solving" — skip for routing agents.
`conciseness`	1-5	🔴 Broken	Returns score=0, empty explanation. Platform bug.
`instruction_following`	0-1	⚠️ Threshold bug	Labels "FAILURE" at score=1 when explanation says "follows perfectly."
`output_latency_milliseconds`	Raw ms	✅ Works	No pass/fail — useful for performance baselining.

Recommendation: Use

coherence

output_latency_milliseconds

for baseline quality. Skip

conciseness

(broken) and

completeness

(misleading for routing agents).

B1.9: Custom Evaluations (⚠️ Spring '26 Bug)

Custom evaluations allow JSONPath-based assertions on action inputs and outputs — e.g., "verify the action received

supportPath = 'Field Support'

YAML syntax:

yaml

customEvaluations:
  - label: "supportPath is Field Support"
    name: string_comparison
    parameters:
      - name: operator
        value: equals
        isReference: false
      - name: actual
        value: "$.generatedData.invokedActions[0][0].function.input.supportPath"
        isReference: true       # JSONPath resolved against generatedData
      - name: expected
        value: "Field Support"
        isReference: false

Evaluation types:

string_comparison

equals

contains

startswith

endswith

numeric_comparison

equals

greater_than

less_than

greater_than_or_equal

less_than_or_equal

Building JSONPath expressions:

Run tests with
```
--verbose
```
to see
```
generatedData.invokedActions
```
Parse the stringified JSON (it's
```
"[[{...}]]"
```
, not a parsed array)

Common paths:

$.generatedData.invokedActions[0][0].function.input.[field]

⚠️ BLOCKED — Spring '26 Platform Bug: Custom evaluations with
isReference: true
cause the server to return "RETRY" status. The results API crashes with
INTERNAL_SERVER_ERROR
. This is server-side (confirmed via direct
curl
). Workaround: Use
expectedOutcome
(LLM-as-judge) or the Testing Center UI until patched.

See custom-eval-test-spec.yaml for a dedicated template.

B2: Test Execution

bash

# Run automated tests
sf agent test run --api-name MyAgentTest --wait 10 --result-format json --target-org [alias]

No ECA required. Preview uses standard org auth (
sf org login web
). No Connected App setup needed (v2.121.7+).

Interactive Preview (Simulated):

bash

sf agent preview --api-name AgentName --output-dir ./logs --target-org [alias]

Interactive Preview (Live):

bash

sf agent preview --api-name AgentName --use-live-actions --apex-debug --target-org [alias]

B3: Results Analysis

Parse test results JSON and display formatted summary:

📊 AGENT TEST RESULTS (CLI)
════════════════════════════════════════════════════════════════

Agent: Customer_Support_Agent
Org: vivint-DevInt
Duration: 45.2s
Mode: Simulated

SUMMARY
───────────────────────────────────────────────────────────────
✅ Passed:    18
❌ Failed:    2
⏭️ Skipped:   0
📈 Topic Selection: 95%
🎯 Action Invocation: 90%

FAILED TESTS
───────────────────────────────────────────────────────────────
❌ test_complex_order_inquiry
   Utterance: "What's the status of orders 12345 and 67890?"
   Expected: get_order_status invoked 2 times
   Actual: get_order_status invoked 1 time
   Category: ACTION_INVOCATION_COUNT_MISMATCH

COVERAGE SUMMARY
───────────────────────────────────────────────────────────────
Topics Tested:       4/5 (80%) ⚠️
Actions Tested:      6/8 (75%) ⚠️
Guardrails Tested:   3/3 (100%) ✅

Phase C: Agentic Fix Loop

When tests fail (either Phase A or Phase B), automatically fix via sf-ai-agentscript:

Failure Categories (10 total)

Category	Source	Auto-Fix	Strategy
`TOPIC_NOT_MATCHED`	A+B	✅	Add keywords to topic description
`ACTION_NOT_INVOKED`	A+B	✅	Improve action description
`WRONG_ACTION_SELECTED`	A+B	✅	Differentiate descriptions
`ACTION_INVOCATION_FAILED`	A+B	⚠️	Delegate to sf-flow or sf-apex
`GUARDRAIL_NOT_TRIGGERED`	A+B	✅	Add explicit guardrails
`ESCALATION_NOT_TRIGGERED`	A+B	✅	Add escalation action/triggers
`TOPIC_RE_MATCHING_FAILURE`	A	✅	Add transition phrases to target topic
`CONTEXT_PRESERVATION_FAILURE`	A	✅	Add context retention instructions
`MULTI_TURN_ESCALATION_FAILURE`	A	✅	Add frustration detection triggers
`ACTION_CHAIN_FAILURE`	A	✅	Fix action output variable mappings

Auto-Fix Command Example

bash

Skill(skill="sf-ai-agentscript", args="Fix agent [AgentName] - Error: [category] - [details]")

Fix Loop Flow

Test Failed → Analyze failure category
    │
    ├─ Single-turn failure → Standard fix (topics, actions, guardrails)
    │
    └─ Multi-turn failure → Enhanced fix (context, re-matching, escalation, chaining)
    │
    ▼
Apply fix via sf-ai-agentscript → Re-publish → Re-test
    │
    ├─ Pass → ✅ Move to next failure
    └─ Fail → Retry (max 3 attempts) → Escalate to human

See Agentic Fix Loops Guide for complete decision tree and 10 fix strategies.

Two Fix Strategies

Agent Type	Fix Strategy	When to Use
Custom Agent (you control it)	Fix the agent via `sf-ai-agentscript`	Topic descriptions, action configs need adjustment
Managed/Standard Agent	Fix test expectations	Test expectations don't match actual behavior

Phase D: Coverage Improvement

If coverage < threshold:

Identify untested topics/actions/patterns from results
Add test cases (YAML for CLI, scenarios for API)
Re-run tests
Repeat until threshold met

Coverage Dimensions

Dimension	Phase A	Phase B	Target
Topic Selection	✅	✅	100%
Action Invocation	✅	✅	100%
Topic Re-matching	✅	❌	90%+
Context Preservation	✅	❌	95%+
Conversation Completion	✅	❌	85%+
Guardrails	✅	✅	100%
Escalation	✅	✅	100%
Phrasing Diversity	✅	✅	3+ per topic

See Coverage Analysis for complete metrics and improvement guide.

Phase E: Observability Integration

After test execution, guide user to analyze agent behavior with session-level observability:

Skill(skill="sf-ai-agentforce-observability", args="Analyze STDM sessions for agent [AgentName] in org [alias] - focus on test session behavior patterns")

What observability adds to testing:

STDM Session Analysis: Examine actual session traces from test conversations
Latency Profiling: Identify slow actions or topic routing delays
Error Pattern Detection: Find recurring failures across sessions
Action Execution Traces: Detailed view of Flow/Apex execution during tests

Scoring System (100 Points)

Category	Points	Key Rules
Topic Selection Coverage	15	All topics have test cases; various phrasings tested
Action Invocation	15	All actions tested with valid inputs/outputs
Multi-Turn Topic Re-matching	15	Topic switching accuracy across turns
Context Preservation	15	Information retention across turns
Edge Case & Guardrail Coverage	15	Negative tests; guardrails; escalation
Test Spec / Scenario Quality	10	Proper YAML; descriptions; clear expectations
Agentic Fix Success	15	Auto-fixes resolve issues within 3 attempts

Scoring Thresholds:

⭐⭐⭐⭐⭐ 90-100 pts → Production Ready
⭐⭐⭐⭐   80-89 pts → Good, minor improvements
⭐⭐⭐    70-79 pts → Acceptable, needs work
⭐⭐      60-69 pts → Below standard
⭐        <60 pts  → BLOCKED - Major issues

⛔ TESTING GUARDRAILS (MANDATORY)

BEFORE running tests, verify:

Check	Command	Why
Agent published	`sf agent list --target-org [alias]`	Can't test unpublished agent
Agent activated	Check status	API and preview require activation
Flows deployed	`sf org list metadata --metadata-type Flow`	Actions need Flows
ECA configured (Phase A — multi-turn API only)	Token request test	Required for Agent Runtime API. Not needed for preview or CLI tests
Org auth (Phase B live)	`sf org display`	Live mode requires valid auth

NEVER do these:

Anti-Pattern	Problem	Correct Pattern
Test unpublished agent	Tests fail silently	Publish first
Skip simulated testing	Live mode hides logic bugs	Always test simulated first
Ignore guardrail tests	Security gaps in production	Always test harmful/off-topic inputs
Single phrasing per topic	Misses routing failures	Test 3+ phrasings per topic
Write ECA credentials to files	Security risk	Keep in shell variables only
Skip session cleanup	Resource leaks and rate limits	Always DELETE sessions after tests
Use `curl` for OAuth token requests	Domains with `--` cause shell failures	Use `credential_manager.py validate`
Ask permission to run skill scripts	Breaks flow, unnecessary delay	All `hooks/scripts/` are pre-approved — run automatically
Spawn more than 2 swarm workers	Context overload, screen space, diminishing returns	Max 2 workers — side-by-side monitoring

CLI Command Reference

Test Lifecycle Commands

Command	Purpose	Example
`sf agent generate test-spec`	Create test YAML	`sf agent generate test-spec --output-dir ./tests`
`sf agent test create`	Deploy test to org	`sf agent test create --spec ./tests/spec.yaml --target-org alias`
`sf agent test run`	Execute tests	`sf agent test run --api-name Test --wait 10 --target-org alias`
`sf agent test results`	Get results	`sf agent test results --job-id ID --result-format json`
`sf agent test resume`	Resume async test	`sf agent test resume --job-id <JOB_ID> --target-org alias`
`sf agent test list`	List test runs	`sf agent test list --target-org alias`

Preview Commands

Command	Purpose	Example
`sf agent preview`	Interactive testing	`sf agent preview --api-name Agent --target-org alias`
`--use-live-actions`	Use real Flows/Apex	`sf agent preview --use-live-actions`
`--output-dir`	Save transcripts	`sf agent preview --output-dir ./logs`
`--apex-debug`	Capture debug logs	`sf agent preview --apex-debug`

Result Formats

Format	Use Case	Flag
`human`	Terminal display (default)	`--result-format human`
`json`	CI/CD parsing	`--result-format json`
`junit`	Test reporting	`--result-format junit`
`tap`	Test Anything Protocol	`--result-format tap`

Multi-Turn Test Templates

Template	Pattern	Scenarios	Location
`multi-turn-topic-routing.yaml`	Topic switching	4	`templates/`
`multi-turn-context-preservation.yaml`	Context retention	4	`templates/`
`multi-turn-escalation-flows.yaml`	Escalation cascades	4	`templates/`
`multi-turn-comprehensive.yaml`	All 6 patterns	6	`templates/`

CLI Test Templates

Template	Purpose	Location
`basic-test-spec.yaml`	Quick start (3-5 tests)	`templates/`
`comprehensive-test-spec.yaml`	Full coverage (20+ tests) with context vars, metrics, custom evals	`templates/`
`context-vars-test-spec.yaml`	Context variable patterns (RoutableId, EndUserId, CaseId)	`templates/`
`custom-eval-test-spec.yaml`	Custom evaluations with JSONPath assertions (⚠️ Spring '26 bug)	`templates/`
`cli-auth-guardrail-tests.yaml`	Auth gate, guardrail, ambiguous routing, session tests (CLI)	`templates/`
`guardrail-tests.yaml`	Security/safety scenarios	`templates/`
`escalation-tests.yaml`	Human handoff scenarios	`templates/`
`agentscript-test-spec.yaml`	Agent Script agents with conversationHistory pattern	`templates/`
`standard-test-spec.yaml`	Reference format	`templates/`

Cross-Skill Integration

Required Delegations:

Scenario	Skill to Call	Command
Fix agent script	sf-ai-agentscript	`Skill(skill="sf-ai-agentscript", args="Fix...")`
Agent Script agents	sf-ai-agentscript	Parse `.agent` for topic/action discovery; use `conversationHistory` pattern for action tests
Create test data	sf-data	`Skill(skill="sf-data", args="Create...")`
Fix failing Flow	sf-flow	`Skill(skill="sf-flow", args="Fix...")`
Setup ECA or OAuth (multi-turn API only)	sf-connected-apps	`Skill(skill="sf-connected-apps", args="Create...")`
Analyze debug logs	sf-debug	`Skill(skill="sf-debug", args="Analyze...")`
Session observability	sf-ai-agentforce-observability	`Skill(skill="sf-ai-agentforce-observability", args="Analyze...")`

Automated Testing (Python Scripts)

Script	Purpose	Dependencies
`agent_api_client.py`	Reusable Agent Runtime API v1 client (auth, sessions, messaging, variables)	stdlib only
`multi_turn_test_runner.py`	Multi-turn test orchestrator (reads YAML, executes, evaluates, Rich colored reports)	pyyaml, rich + agent_api_client
`rich_test_report.py`	Aggregate N worker result JSONs into one unified Rich terminal report	rich
`generate-test-spec.py`	Parse .agent files, generate CLI test YAML specs	stdlib only
`run-automated-tests.py`	Orchestrate full CLI test workflow with fix suggestions	stdlib only

CLI Flags (multi_turn_test_runner.py):

Flag	Default	Purpose
`--report-file PATH`	none	Write Rich terminal report to file (ANSI codes included) — viewable with `cat` or `bat`
`--no-rich`	off	Disable Rich colored output; use plain-text format
`--width N`	auto	Override terminal width (auto-detects from $COLUMNS; fallback 80)
`--rich-output`	(deprecated)	No-op — Rich is now default when installed

Multi-Turn Testing (Agent Runtime API):

bash

# Install test runner dependency
pip3 install pyyaml

# Run multi-turn test suite against an agent
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --my-domain your-domain.my.salesforce.com \
  --consumer-key YOUR_KEY \
  --consumer-secret YOUR_SECRET \
  --agent-id 0XxRM0000004ABC \
  --scenarios templates/multi-turn-comprehensive.yaml \
  --output results.json --verbose

# Or set env vars and omit credential flags
export SF_MY_DOMAIN=your-domain.my.salesforce.com
export SF_CONSUMER_KEY=YOUR_KEY
export SF_CONSUMER_SECRET=YOUR_SECRET
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --agent-id 0XxRM0000004ABC \
  --scenarios templates/multi-turn-topic-routing.yaml \
  --var '$Context.AccountId=001XXXXXXXXXXXX' \
  --verbose

# Connectivity test (verify ECA credentials work)
python3 {SKILL_PATH}/hooks/scripts/agent_api_client.py

CLI Testing (Agent Testing Center):

bash

# Generate test spec from agent file
python3 {SKILL_PATH}/hooks/scripts/generate-test-spec.py \
  --agent-file /path/to/Agent.agent \
  --output specs/Agent-tests.yaml

# Run full automated workflow
python3 {SKILL_PATH}/hooks/scripts/run-automated-tests.py \
  --agent-name MyAgent \
  --agent-dir /path/to/project \
  --target-org dev

🔄 Automated Test-Fix Loop

v2.0.0 | Supports both multi-turn API failures and CLI test failures

Quick Start

bash

# Run the test-fix loop (CLI tests)
{SKILL_PATH}/hooks/scripts/test-fix-loop.sh Test_Agentforce_v1 AgentforceTesting 3

# Exit codes:
#   0 = All tests passed
#   1 = Fixes needed (Claude Code should invoke sf-ai-agentforce)
#   2 = Max attempts reached, escalate to human
#   3 = Error (org unreachable, test not found, etc.)

Claude Code Integration

USER: Run automated test-fix loop for Coral_Cloud_Agent

CLAUDE CODE:
1. Phase A: Run multi-turn scenarios via Python test runner
   python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
     --agent-id ${AGENT_ID} \
     --scenarios templates/multi-turn-comprehensive.yaml \
     --output results.json --verbose
2. Analyze failures from results.json (10 categories)
3. If fixable: Skill(skill="sf-ai-agentscript", args="Fix...")
4. Re-run failed scenarios with --scenario-filter
5. Phase B (if available): Run CLI tests
6. Repeat until passing or max retries (3)

Environment Variables

Variable	Description	Default
`CURRENT_ATTEMPT`	Current attempt number	1
`MAX_WAIT_MINUTES`	Timeout for test execution	10
`SKIP_TESTS`	Comma-separated test names to skip	(none)
`VERBOSE`	Enable detailed output	false

💡 Key Insights

Problem	Symptom	Solution
`sf agent test create` fails	"Required fields are missing: [MasterLabel]"	Add `name:` field to top of YAML spec (see Phase B1)
Tests fail silently	No results returned	Agent not published - run `sf agent publish authoring-bundle`
Topic not matched	Wrong topic selected	Add keywords to topic description
Action not invoked	Action never called	Improve action description
Live preview 401	Authentication error	Re-authenticate: `sf org login web`
API 401	Token expired or wrong credentials	Re-authenticate ECA
API 404 on session create	Wrong Agent ID	Re-query BotDefinition for correct Id
Empty API response	Agent not activated	Activate and publish agent
Context lost between turns	Agent re-asks for known info	Add context retention instructions to topic
Topic doesn't switch	Agent stays on old topic	Add transition phrases to target topic
⚠️ `--use-most-recent` broken	"Nonexistent flag" error	Use `--job-id` explicitly
Topic name mismatch	Expected `GeneralCRM` , got `MigrationDefaultTopic`	Verify actual topic names from first test run
Action superset matching	Expected `[A]` , actual `[A,B]` but PASS	CLI uses SUPERSET logic

Quick Start Example

Multi-Turn API Testing (Recommended)

Quick Start with Python Scripts:

bash

# 1. Get agent ID
AGENT_ID=$(sf data query --use-tooling-api \
  --query "SELECT Id FROM BotDefinition WHERE DeveloperName='My_Agent' AND IsActive=true LIMIT 1" \
  --result-format json --target-org dev | jq -r '.result.records[0].Id')

# 2. Run multi-turn tests (credentials from env or flags)
python3 {SKILL_PATH}/hooks/scripts/multi_turn_test_runner.py \
  --my-domain "${SF_MY_DOMAIN}" \
  --consumer-key "${CONSUMER_KEY}" \
  --consumer-secret "${CONSUMER_SECRET}" \
  --agent-id "${AGENT_ID}" \
  --scenarios templates/multi-turn-comprehensive.yaml \
  --output results.json --verbose

Ad-Hoc Python Usage:

python

from hooks.scripts.agent_api_client import AgentAPIClient

client = AgentAPIClient()  # reads SF_MY_DOMAIN, SF_CONSUMER_KEY, SF_CONSUMER_SECRET from env
with client.session(agent_id="0XxRM000...") as session:
    r1 = session.send("I need to cancel my appointment")
    r2 = session.send("Actually, reschedule it instead")
    r3 = session.send("What was my original request about?")
    # Session auto-ends when exiting context manager

CLI Testing (If Agent Testing Center Available)

bash

# 1. Generate test spec
python3 {SKILL_PATH}/hooks/scripts/generate-test-spec.py \
  --agent-file ./agents/MyAgent.agent \
  --output ./tests/myagent-tests.yaml

# 2. Create test in org
sf agent test create --spec ./tests/myagent-tests.yaml --api-name MyAgentTest --target-org dev

# 3. Run tests
sf agent test run --api-name MyAgentTest --wait 10 --result-format json --target-org dev

# 4. View results (use --job-id, NOT --use-most-recent)
sf agent test results --job-id [JOB_ID] --verbose --result-format json --target-org dev

🐛 Known Issues & CLI Bugs

Last Updated: 2026-02-11 | Tested With: sf CLI v2.118.16+

RESOLVED:

sf agent test create

MasterLabel Error

Status: 🟢 RESOLVED — Add

name:

field to YAML spec

Error:

Required fields are missing: [MasterLabel]

Root Cause: The YAML spec must include a

name:

field at the top level, which maps to

MasterLabel

in the

AiEvaluationDefinition

XML. Our templates previously omitted this field.

Fix: Add

name:

to the top of your YAML spec:

yaml

name: "My Agent Tests"    # ← This was the missing field
subjectType: AGENT
subjectName: My_Agent

If you still encounter issues:

✅ Use interactive
```
sf agent generate test-spec
```
wizard (interactive-only, no CLI flags)
✅ Create tests via Salesforce Testing Center UI
✅ Deploy XML metadata directly
✅ Use Phase A (Agent Runtime API) instead — bypasses CLI entirely

MEDIUM: Interactive Mode Not Scriptable

Status: 🟡 Blocks CI/CD automation

Issue:

sf agent generate test-spec

only works interactively.

Workaround: Use Python scripts in

hooks/scripts/

or Phase A multi-turn templates.

MEDIUM: YAML vs XML Format Discrepancy

Key Mappings:

YAML Field	XML Element / Assertion Type
`expectedTopic`	`topic_assertion`
`expectedActions`	`actions_assertion`
`expectedOutcome`	`output_validation`
`contextVariables`	`contextVariable` ( `variableName` / `variableValue` )
`customEvaluations`	`string_comparison` / `numeric_comparison` ( `parameter` )
`metrics`	`expectation` (name only, no expectedValue)

LOW: BotDefinition Not Always in Tooling API

Status: 🟡 Handled automatically

Issue: In some org configurations,

BotDefinition

is not queryable via the Tooling API but works via the regular Data API (

sf data query

without

--use-tooling-api

Fix:

agent_discovery.py live

now has automatic fallback — if the Tooling API returns no results for BotDefinition, it retries with the regular API.

LOW:

--use-most-recent

Not Implemented

Status: Flag documented but NOT functional. Always use

--job-id

explicitly.

CRITICAL: Custom Evaluations RETRY Bug (Spring '26)

Status: 🔴 PLATFORM BUG — Blocks all

string_comparison

numeric_comparison

evaluations with JSONPath

Error:

INTERNAL_SERVER_ERROR: The specified enum type has no constant with the specified name: RETRY

Scope:

Server returns "RETRY" status for test cases with custom evaluations using
```
isReference: true
```
Results API endpoint crashes with HTTP 500 when fetching results
Both filter expressions
```
[?(@.field == 'value')]
```
AND direct indexing
```
[0][0]
```
trigger the bug
Tests WITHOUT custom evaluations on the same run complete normally

Confirmed: Direct

curl

to REST endpoint returns same 500 — NOT a CLI parsing issue

Workaround:

Use Testing Center UI (Setup → Agent Testing) — may display results
Skip custom evaluations until platform patch
Use
```
expectedOutcome
```
(LLM-as-judge) for response validation instead

Tracking: Discovered 2026-02-09 on DevInt sandbox (Spring '26). TODO: Retest after platform patch.

MEDIUM:

conciseness

Metric Returns Score=0

Status: 🟡 Platform bug — metric evaluation appears non-functional

Issue: The

conciseness

metric consistently returns

score: 0

with an empty

metricExplainability

field across all test cases tested on DevInt (Spring '26).

Workaround: Skip

conciseness

in metrics lists until platform patch.

LOW:

instruction_following

FAILURE at Score=1

Status: 🟡 Threshold mismatch — score and label disagree

Issue: The

instruction_following

metric labels results as "FAILURE" even when

score: 1

and the explanation text says the agent "follows instructions perfectly." This appears to be a pass/fail threshold configuration error on the platform side.

Workaround: Use the numeric

score

value (0 or 1) for evaluation. Ignore the PASS/FAILURE label.

HIGH:

instruction_following

Crashes Testing Center UI

Status: 🔴 Blocks Testing Center UI entirely — separate from threshold bug above

Error:

Unable to get test suite: No enum constant einstein.gpt.shared.testingcenter.enums.AiEvaluationMetricType.INSTRUCTION_FOLLOWING_EVALUATION

Scope: The Testing Center UI (Setup → Agent Testing) throws a Java exception when opening any test suite that includes the

instruction_following

metric. The CLI (

sf agent test run

) works fine — only the UI rendering is broken.

Workaround: Remove

- instruction_following

from the YAML metrics list and redeploy the test spec via

sf agent test create --force-overwrite

Note: This is a different bug from the threshold mismatch above. The threshold bug affects score interpretation; this bug blocks the entire UI from loading.

Discovered: 2026-02-11 on DevInt sandbox (Spring '26).

sf-ai-agentforce-testing

NPX Install

Tags

SKILL.md Content

sf-ai-agentforce-testing: Agentforce Test Execution & Coverage Analysis

Core Responsibilities

📚 Document Map

Script Location (MANDATORY)

⚠️ CRITICAL: Orchestration Order

Architecture: Dual-Track Testing Workflow

Phase 0: Prerequisites & Agent Discovery

Step 1: Gather User Information

Step 2: Agent Discovery

Step 3: Agent Metadata Retrieval

Step 4: Check Agent Testing Center Availability

Step 5: Prerequisites Checklist

Deterministic Multi-Turn Interview Flow

I-4b: Session Variables

I-6: Partition Strategy

I-7: Confirmation Summary Format

⚡ MANDATORY: Phase A4 Execution Protocol

Path A: Sequential Execution (worker_count == 1)

Path B: Swarm Execution (worker_count == 2) — MANDATORY CHECKLIST

Credential Convention (~/.sfagent/)

Directory Structure

File Format

Security Rules

CLI Reference

Swarm Execution Rules (Native Claude Code Teams)

Team Lead Rules (Claude Code)

Worker Agent Prompt Template

Partition Strategies

Team Lead Aggregation

Test Plan File Format

Key Sections

Re-Running from a Saved Plan

Phase A: Multi-Turn API Testing (PRIMARY)

A1: ECA Credential Setup

A2: Agent Discovery & Metadata Retrieval

A3: Test Scenario Planning

A4: Multi-Turn Execution

A5: Results & Scoring

Phase B: CLI Testing Center (SECONDARY)

⚡ Agent Script Agents (AiAuthoringBundle)

B1: Test Spec Creation

B1.5: Topic Name Resolution

B1.6: Known CLI Gotchas

B1.7: Context Variables

B1.8: Metrics

B1.9: Custom Evaluations (⚠️ Spring '26 Bug)

B2: Test Execution

B3: Results Analysis

Phase C: Agentic Fix Loop

Failure Categories (10 total)

Auto-Fix Command Example

Fix Loop Flow

Two Fix Strategies

Phase D: Coverage Improvement

Coverage Dimensions

Phase E: Observability Integration

Scoring System (100 Points)

⛔ TESTING GUARDRAILS (MANDATORY)

CLI Command Reference

Test Lifecycle Commands

Preview Commands

Result Formats

Multi-Turn Test Templates

CLI Test Templates

Cross-Skill Integration

Automated Testing (Python Scripts)

🔄 Automated Test-Fix Loop

Quick Start

Claude Code Integration

Environment Variables

💡 Key Insights

Quick Start Example

Multi-Turn API Testing (Recommended)

CLI Testing (If Agent Testing Center Available)

🐛 Known Issues & CLI Bugs

RESOLVED: sf agent test create MasterLabel Error

RESOLVED:
`sf agent test create`
MasterLabel Error

LOW:
`--use-most-recent`
Not Implemented

MEDIUM:
`conciseness`
Metric Returns Score=0

LOW:
`instruction_following`
FAILURE at Score=1

HIGH:
`instruction_following`
Crashes Testing Center UI