Midscene YAML Generator

Typical Workflow

User Requirement → [Generator] Generate YAML
                → [Generator] Auto dry-run validation
                → Validation failed? → [Generator] Auto-fix
                → [Runner] Execute
                → Execution failed? → [Runner] Analyze + Fix YAML → Re-execute
                → Success → Display report summary

Trigger Conditions

Use this when users describe a browser automation requirement (in natural language) and need to generate a Midscene YAML file.

Common trigger phrases:

"Generate a YAML to..."
"Help me write an automation script..."
"Create a Midscene test case..."
"I want to automate the XXX operation..."
"Convert this requirement to YAML..."
"Write a Midscene config file..."

English trigger phrases:

"Generate a YAML for..."
"Write an automation script to..."
"Create a test case for..."
"Automate the login flow"
"Convert this requirement to YAML"
"Write a Midscene config file for..."

Workflow

Step 1: Analyze Requirement Complexity

Determine the required mode based on the user's description:

Choose Native Mode — When the requirement only involves:

Open web pages / Launch applications
Basic interactions like clicking, hovering, inputting, scrolling, keyboard operations
AI automatic planning and execution (
```
ai
```
)
Data extraction (
```
aiQuery
```
)
Validation assertions (
```
aiAssert
```
)
Wait conditions (
```
aiWaitFor
```
)
Tool operations (
```
sleep
```
,
```
javascript
```
,
```
recordToReport
```
)
Platform-specific operations (
```
runAdbShell
```
,
```
runWdaRequest
```
,
```
launch
```
)

Choose Extended Mode — When the requirement involves any of the following:

Conditional judgment ("If...then...")
Loop operations ("Repeat", "Traverse", "Pagination")
Variables and dynamic data ("Define variables", "Parameterization")
External API calls ("Call API")
Error handling and retries ("If failed...", "Retry")
Parallel tasks ("Do...simultaneously")
Data transformation processing ("Filter", "Sort", "Map")
Import and reuse sub-flows ("Reuse", "Import")

Rule of Thumb: Start with Native mode, switch to Extended mode when you find you need

if

for

or variables.

Step 2: Determine Target Platform

Determine platform configuration based on user description:

User Description	Platform	YAML Configuration
"Open web page/website/URL"	Web	`web: { url: "...", headless: false }`
"Test Android app"	Android	`android: { deviceId: "..." }` + `launch: "package name"`
"Test iOS app"	iOS	`ios: { wdaPort: 8100 }` + `launch: "bundleId"`
"Desktop automation"	Computer	`computer: { ... }`

Additional Web Platform Configuration Options:

```
headless: true/false
```
— Run in headless mode (default false)
```
viewportWidth
```
/
```
viewportHeight
```
— Viewport size (default 1280×720)
```
userAgent
```
— Custom User-Agent
```
deviceScaleFactor
```
— Device pixel ratio (e.g., set to 2 for Retina screens)

waitForNetworkIdle

— Network idle wait configuration, supports

true

or object format

{ timeout: 2000, continueOnNetworkIdleError: true }

```
cookie
```
— Path to Cookie JSON file (enables login-free session recovery)
```
bridgeMode
```
— Bridge mode:
```
false
```
(default) |
```
'newTabWithUrl'
```
|
```
'currentTab'
```
, reuses logged-in desktop browser
```
chromeArgs
```
— Array of custom Chrome launch arguments (e.g.,
```
['--disable-gpu', '--proxy-server=...']
```
)
```
serve
```
— Local static file directory, starts built-in server
```
acceptInsecureCerts
```
— Ignore HTTPS certificate errors (default false)
```
forceSameTabNavigation
```
— Restrict navigation to current tab (default true)

Step 3: Natural Language → YAML Conversion

Action Selection Priority (Important)

Prefer
ai:
— Describe the entire intent in natural language, let AI automatically plan and execute multi-step operations. Suitable for most scenarios with the highest success rate
When precise control is needed — Use specific actions like
```
aiTap
```
,
```
aiInput
```
(e.g., filling in specific form fields)
When data extraction is needed — Must use
```
aiQuery
```
(
```
ai:
```
cannot return structured data)
When state validation is needed — Use
```
aiAssert
```
or
```
aiWaitFor
```

Rule of Thumb: If the user's requirement can be described in a single natural language sentence, prioritize using one

ai:

step instead of splitting into multiple

aiInput

aiTap

steps.

Golden Path - Minimal Working Example:

yaml

web:
  url: "https://www.baidu.com"

tasks:
  - name: "Search for Midscene"
    flow:
      - ai: "Enter Midscene in the search box and click search"
      - sleep: 3000
      - aiAssert: "The page displays search results"

Native Mode YAML Format Specification (Important)

Native mode supports two formats for action parameters:

Flat Format (Recommended, concise): Action keyword followed by string value, additional parameters as sibling keys.

yaml

- aiInput: "Search box"
  value: "Keyword"
- aiWaitFor: "Page loaded completely"
  timeout: 10000
- aiTap: "Button description"
  deepThink: true
- aiAssert: "Page contains expected content"
  errorMessage: "Content validation failed"

Nested Format (Also valid, suitable for complex parameters):

yaml

- aiInput:
    locator: "Search box"
    value: "Keyword"
- aiQuery:
    query: "Extract product list"
    name: "products"

Use the following mapping rule table to convert user requirements to YAML:

Native Action Mapping

Natural Language Pattern	YAML Mapping	Description
"Open/access/enter XXX website"	`web: { url: "XXX" }`	Platform configuration
"Automatically plan and execute XXX"	`ai: "XXX"`	AI automatically breaks down into multi-step execution; optional `fileChooserAccept: "path"` to handle file upload dialogs
"Click/press/select XXX"	`aiTap: "XXX"`	Short form
"Hover/move over XXX"	`aiHover: "XXX"`	Trigger dropdown menu or tooltip
"Enter YYY in XXX"	`aiInput: "XXX"` + `value: "YYY"`	Flat sibling format; optional `mode: "replace"\|"clear"\|"typeOnly"\|"append"`
"Press XXX key on keyboard"	`aiKeyboardPress: "XXX"`	Supports key combinations like "Control+A"; `keyName` can be used as an alternative parameter
"Scroll down/up/left/right"	`aiScroll: "Target area"` + `direction: "down"`	Flat sibling format; optional `distance` , `scrollType`
"Wait for XXX to appear"	`aiWaitFor: "XXX"`	Optional timeout (in milliseconds)
"Check/verify/confirm XXX"	`aiAssert: "XXX"`	Optional errorMessage
"Get/extract/read XXX"	`aiQuery: { query: "XXX", name: "result" }`	name is used to store the result
"Pause/wait N seconds"	`sleep: N*1000`	Parameter is in milliseconds
"Execute JS code"	`javascript: "Code content"`	Execute JavaScript directly
"Take screenshot and record to report"	`recordToReport: "Title"` + `content: "Description"`	Take screenshot and record description to report
"Double-click XXX"	`aiDoubleClick: "XXX"`	Double-click operation; optional `deepThink: true`
"Right-click XXX"	`aiRightClick: "XXX"`	Right-click operation; optional `deepThink: true`
"Locate XXX element"	`aiLocate: "XXX"` + `name: "elem"`	Locate element, store result in variable (referencable in Extended mode)
"Is XXX true?"	`aiBoolean: "XXX"` + `name: "flag"`	Returns boolean value; optional `domIncluded` / `screenshotIncluded`
"Get the number of XXX"	`aiNumber: "XXX"` + `name: "count"`	Returns number; optional `domIncluded` / `screenshotIncluded`
"Get text of XXX"	`aiString: "XXX"` + `name: "text"`	Returns string; optional `domIncluded` / `screenshotIncluded`
"Ask AI about XXX"	`aiAsk: "XXX"` + `name: "answer"`	Free-form question, returns text answer
"Drag A to B"	`aiDragAndDrop: "A"` + `to: "B"`	Flat format; or nested `{ from: "A", to: "B" }`
"Clear XXX input box"	`aiClearInput: "XXX"`	Clear input box content
"Execute ADB command"	`runAdbShell: "Command"`	Android platform only
"Execute WDA request"	`runWdaRequest: { ... }`	iOS platform only
"Launch app"	`launch: "Package name"`	Mobile app launch

Extended Control Flow Mapping

Natural Language Pattern	YAML Mapping
"Define variable XXX as YYY"	`variables: { XXX: "YYY" }`
"Use environment variable XXX"	`${ENV:XXX}` or `${ENV.XXX}`
"If XXX then YYY else ZZZ"	`logic: { if: "XXX", then: [YYY], else: [ZZZ] }`
"Repeat N times"	`loop: { type: repeat, count: N, steps: [...] }`
"Execute for each XXX"	`loop: { type: for, items: "XXX", itemVar: "item", steps: [...] }` ( `itemVar` / `as` / `item` are all acceptable)
"Continue doing YYY while XXX"	`loop: { type: while, condition: "XXX", maxIterations: N, steps: [...] }`
"Do A first, do B if it fails"	`try: { steps: [A] }, catch: { steps: [B] }`
"Do A and B simultaneously"	`parallel: { branches: [{steps: [A]}, {steps: [B]}], waitAll: true, merge_results: true }`
"Call XXX API"	`external_call: { type: http, method: POST, url: "XXX", response_as: "varName" }`
"Execute Shell command"	`external_call: { type: shell, command: "XXX" }`
"Import/reuse XXX flow"	`import: [{ flow: "XXX.yaml", as: name }]`
"Filter/sort/map data"	`data_transform: { source, operation, ... }`

Step 4: Select Template Starting Point

Refer to the template files in the

templates/

directory and find the template closest to the user's requirement as the starting point:

Native Templates:

```
templates/native/web-basic.yaml
```
— Basic web operations
```
templates/native/web-login.yaml
```
— Login flow
```
templates/native/web-data-extract.yaml
```
— Data extraction
```
templates/native/web-search.yaml
```
— Web search flow
```
templates/native/web-file-upload.yaml
```
— File upload form
```
templates/native/web-multi-tab.yaml
```
— Multi-tab operations
```
templates/native/deep-think-locator.yaml
```
— Image-assisted location (deepThink/xpath)
```
templates/native/android-app.yaml
```
— Android testing
```
templates/native/ios-app.yaml
```
— iOS testing
```
templates/native/computer-desktop.yaml
```
— Desktop app automation

Extended Templates:

templates/extended/web-conditional-flow.yaml

— Conditional branching

templates/extended/web-pagination-loop.yaml

— Pagination loop

templates/extended/web-data-pipeline.yaml

— Data pipeline

templates/extended/multi-step-with-retry.yaml

— Multi-step with retry

templates/extended/api-integration-test.yaml

— API integration

```
templates/extended/e2e-workflow.yaml
```
— End-to-end complete workflow

templates/extended/reusable-sub-flows.yaml

— Sub-flow reuse (import/use)

```
templates/extended/responsive-test.yaml
```
— Multi-viewport responsive testing
```
templates/extended/web-auth-flow.yaml
```
— OAuth/login authentication flow (using variables and environment references)

Template Selection Decision:

Requirement Feature	Recommended Template
Simple page operations (open, click, input)	`native/web-basic.yaml`
Login / Form filling	`native/web-login.yaml`
Data collection / Information extraction	`native/web-data-extract.yaml`
Search + Result validation	`native/web-search.yaml`
File upload / Attachment submission	`native/web-file-upload.yaml`
OAuth/Third-party authentication login	`extended/web-auth-flow.yaml`
Desktop app automation (non-browser)	`native/computer-desktop.yaml`
Conditional judgment needed (If logged in then...)	`extended/web-conditional-flow.yaml`
Pagination / List traversal needed	`extended/web-pagination-loop.yaml`
Data filtering / Sorting / Aggregation	`extended/web-data-pipeline.yaml`
Retry on failure needed	`extended/multi-step-with-retry.yaml`
External API call needed	`extended/api-integration-test.yaml`
Complete business flow (multi-step + variables + export)	`extended/e2e-workflow.yaml`
Sub-flow reuse / Modularization	`extended/reusable-sub-flows.yaml`
Multi-screen size responsive validation	`extended/responsive-test.yaml`
Complex element location / deepThink	`native/deep-think-locator.yaml`
Multi-tab operations	`native/web-multi-tab.yaml`

Step 5: Generate YAML

Generate YAML content based on templates and conversion rules, pay attention to the following points:

File Header: Add comments explaining the requirement source and generation time
engine field: Extended mode must explicitly declare
```
engine: extended
```
features list: In Extended mode, declare the features used (e.g.,
```
features: [logic, variables, loop]
```
), which can be omitted in Native mode
agent configuration (optional):
```
testId
```
is used to identify tests,
```
groupName
```
/
```
groupDescription
```
for report classification,
```
cache: true
```
can cache AI results to speed up repeated runs
aiActContext (optional): Provide additional context information for AI Agent (such as language annotation for multilingual websites, special domain terms), set in
```
agent: { aiActContext: "Description" }
```
continueOnError (optional): If you need to continue executing subsequent tasks after a task fails, set
```
continueOnError: true
```
output export (optional): Export results like
```
aiQuery
```
to a JSON file for use in subsequent processes

Output Format

yaml

# Auto-generated by Midscene YAML Generator
# Requirement Description: [Original user requirement]
# Generation Time: [timestamp]

engine: native|extended
features: [...]  # Extended mode only

# Optional: agent configuration
# agent:
#   testId: "test-001"
#   groupName: "Automation Testing Group"
#   groupDescription: "Description"
#   cache: true

[platform_config]

tasks:
  - name: "[Task Name]"
    # continueOnError: true  # Optional: Continue on failure
    flow:
      [Generated steps]
    # output:                # Optional: Export data
    #   filePath: "./midscene-output/data.json"
    #   dataName: "variableName"

Step 6: Validate and Output

Output the file to the
```
./midscene-output/
```
directory
Call the validator to confirm the YAML is valid:
bash
```
node scripts/midscene-run.js <file> --dry-run
```
If validation fails, analyze the error cause and auto-fix
After validation passes, prompt the user to execute using Runner:
bash
```
node scripts/midscene-run.js <file>
```

Best Practices for Writing AI Instructions

When generating YAML, the quality of AI instructions (parameters for

aiTap

aiAssert

, etc.) directly affects execution success rate. Follow these principles:

Description Precision

Poor:
```
aiTap: "Button"
```
— There may be multiple buttons on the page

Good:

aiTap: "Blue login button at the top right corner of the page"

— Position + Color + Function

Better:

aiTap: "Button with text 'Login Now' in the navigation bar"

— Precise to text content

Location Strategy Priority

Natural language description (Preferred): High readability, adapts to page changes
deepThink mode: Enable when there are multiple similar elements on complex pages, AI will perform deeper analysis with higher accuracy but longer time consumption
Image-assisted location (image prompting): When text description is insufficient, screenshot annotations can be used to help AI understand the target element (official
```
locate.images
```
capability)
xpath selector (Last resort): When natural language cannot locate precisely. Note: xpath is only applicable to Web platform, Android/iOS should use natural language description

yaml

# Prefer natural language
- aiTap: "Edit button in the third row of the product list"

# Enable deepThink for complex scenarios (when there are many similar elements or location is inaccurate)
- aiTap: "Edit icon in the third row of data"
  deepThink: true

# Last resort: use xpath (Web platform only)
- aiTap: ""
  xpath: "//table/tbody/tr[3]//button[@class='edit']"

Image-assisted Location (locate object)

When natural language description is not precise enough, reference images can be provided via the

locate

object:

yaml

# Use image to assist AI in identifying target element
- aiTap:
    locate:
      prompt: "Icon button similar to the reference image"
      images:
        - name: "target-icon"
          url: "https://example.com/icon.png"
      convertHttpImage2Base64: true

# Simplified form: directly provide in images option
- aiTap: "Icon button similar to the reference image"
  images:
    - "./images/target-icon.png"

aiQuery Result Formatting

Clearly specify the expected data structure in

query

yaml

- aiQuery:
    query: >
      Extract all product information on the page and return it as an array.
      Each element should contain the following fields:
      - name: Product name (string)
      - price: Price (number)
      - inStock: In stock (boolean)
    name: "productList"

Wait Strategy

Add

aiWaitFor

after key operations to ensure the page state is ready:

yaml

- aiTap: "Submit button"
- aiWaitFor: "Submit success prompt appears, or page redirects to result page"
  timeout: 10000

Data Transformation Operation Reference

Operations supported by

data_transform

in Extended mode:

Operation	Description	Key Parameters
`filter`	Filter by condition	`condition` (JS expression, use `item` to reference current element)
`sort`	Sort	`by` (field name), `order` (asc/desc)
`map`	Map/Transform	`template` (field mapping template)
`reduce`	Aggregation calculation	`reducer` (JS expression), `initial` (initial value)
`unique` / `distinct`	Deduplicate	`by` (field for deduplication)
`slice`	Extract subset	`start` , `end`
`flatten`	Flatten nested array	`depth` (flatten depth, default 1)
`groupBy`	Group by field	`by` or `field` (field name for grouping)

Two Formats: Flat format
{source, operation, name}
is suitable for single-step operations; nested format
{input, operations:[], output}
supports chained multi-step operations. Both formats support all 8 operations.

Platform-Specific Notes

Web Platform

```
url
```
must include the full protocol (
```
https://
```
)
Use
```
aiWaitFor
```
to wait for page loading to complete before operations
Ensure input boxes are interactive before form operations

Android Platform

Need to configure
```
deviceId
```
(ADB device ID, e.g.,
```
emulator-5554
```
)
Use
```
launch: "com.example.app"
```
to launch the app (as an action step in flow)
Can use
```
runAdbShell
```
to execute ADB commands

iOS Platform

Need to configure
```
wdaPort
```
(WebDriverAgent port, default 8100) and
```
wdaHost
```
(default localhost)
Use
```
launch: "com.example.app"
```
to launch the app (as an action step in flow)
Can use
```
runWdaRequest
```
to send WebDriverAgent requests

Computer Platform

For general desktop automation scenarios

Common Anti-patterns

Avoid the following common mistakes when generating YAML:

Unnecessary use of nested object format — Flat format is recommended (
```
aiInput: "Search box"
```
+
```
value: "Keyword"
```
), which is more concise and readable. Nested format (
```
aiInput: { locator: "Search box", value: "Keyword" }
```
) is valid in both modes but is usually only used when complex parameters like
```
locate
```
image positioning are needed
Missing
engine: extended
in Extended mode — Must declare the engine when using any extended features (variables, loops, conditions, etc.)
Forgetting
maxIterations
in loops —
```
while
```
loops must set a safety upper limit, the count of
```
for
```
and
```
repeat
```
loops should not exceed 10000

Using nested object format for
aiWaitFor
— Should use

aiWaitFor: "Condition"

timeout: 10000

instead of

aiWaitFor: { condition: "Condition" }

Missing
features
declaration — Extended mode should list the features used to facilitate detection and optimization

Pre-output Self-check List

After generating YAML, verify the following items before output:

Does each
```
aiInput
```
have a corresponding
```
value
```
parameter?
Is there
```
aiWaitFor
```
after key operations to ensure page state is ready?
Does Extended mode declare
```
engine: extended
```
and
```
features
```
list?
Does the loop have a safety upper limit (
```
maxIterations
```
or reasonable
```
count
```
)?
Are sensitive information (passwords, Tokens) referenced via environment variables using
```
${ENV:XXX}
```
?
Are AI instruction descriptions precise enough (including features like position, text, color)?

Notes

Parameters for AI instructions (aiTap, aiAssert, etc.) are described in natural language, no CSS selectors needed
Both Chinese and English descriptions are acceptable, Midscene's AI engine supports multiple languages
Results of
```
aiQuery
```
are stored via the
```
name
```
field and can be referenced in subsequent steps using
```
${name}
```
(Extended mode only)
It is recommended to set a reasonable
```
timeout
```
(in milliseconds) for
```
aiWaitFor
```
, default is usually 15 seconds
Be sure to set
```
maxIterations
```
as a safety upper limit in loops to prevent infinite loops
```
${ENV:XXX}
```
or
```
${ENV.XXX}
```
can be used to reference environment variables, avoiding hardcoding sensitive information in YAML
Always explicitly declare the
```
engine
```
field to avoid unexpected behavior from automatic detection
Variable references are case-sensitive:
```
${userName}
```
and
```
${username}
```
are different variables
Avoid circular imports: Importing B.yaml in A.yaml and A.yaml in B.yaml will cause runtime errors
Be sure to verify syntax and structure via
```
--dry-run
```
after generation (Note:
```
--dry-run
```
does not detect model configuration, AI operations require
```
MIDSCENE_MODEL_API_KEY
```
to be configured for actual execution)
Prompt users to use the Midscene Runner skill to execute the generated file

Iterative Fix Process

When the generated YAML fails to execute:

Runner can fix it automatically: If the error can be resolved by modifying YAML (e.g., imprecise location description, insufficient wait time), Runner Skill will directly modify and retry
When regeneration is needed: If the error involves fundamental design issues (e.g., wrong mode selected, missing key steps), users can describe the failure to Generator, which will regenerate an improved YAML based on the error information
Recommended Flow: Generate → dry-run validation → Execute → If failed, describe error for Generator to fix → Re-execute

Collaboration Agreement

After generation is complete, return the following structured information to the user:

Generated File Path:
```
./midscene-output/<filename>.yaml
```
Execution Mode: native or extended

Recommended Next Command:

node scripts/midscene-run.js <path> --dry-run

If dry-run validation fails, automatically analyze the error, fix the YAML, and re-validate

midscene-yaml-generator

NPX Install

Tags

SKILL.md Content (Chinese)

Midscene YAML Generator

Typical Workflow

Trigger Conditions

Workflow

Step 1: Analyze Requirement Complexity

Step 2: Determine Target Platform

Step 3: Natural Language → YAML Conversion

Action Selection Priority (Important)

Native Mode YAML Format Specification (Important)

Native Action Mapping

Extended Control Flow Mapping

Step 4: Select Template Starting Point

Step 5: Generate YAML

Output Format

Step 6: Validate and Output

Best Practices for Writing AI Instructions

Description Precision

Location Strategy Priority

Image-assisted Location (locate object)

aiQuery Result Formatting

Wait Strategy

Data Transformation Operation Reference

Platform-Specific Notes

Web Platform

Android Platform

iOS Platform

Computer Platform

Common Anti-patterns

Pre-output Self-check List

Notes

Iterative Fix Process

Collaboration Agreement