Midscene YAML Generator
Typical Workflow
User Requirement → [Generator] Generate YAML
→ [Generator] Auto dry-run validation
→ Validation failed? → [Generator] Auto-fix
→ [Runner] Execute
→ Execution failed? → [Runner] Analyze + Fix YAML → Re-execute
→ Success → Display report summary
Trigger Conditions
Use this when users describe a browser automation requirement (in natural language) and need to generate a Midscene YAML file.
Common trigger phrases:
- "Generate a YAML to..."
- "Help me write an automation script..."
- "Create a Midscene test case..."
- "I want to automate the XXX operation..."
- "Convert this requirement to YAML..."
- "Write a Midscene config file..."
English trigger phrases:
- "Generate a YAML for..."
- "Write an automation script to..."
- "Create a test case for..."
- "Automate the login flow"
- "Convert this requirement to YAML"
- "Write a Midscene config file for..."
Workflow
Step 1: Analyze Requirement Complexity
Determine the required mode based on the user's description:
Choose Native Mode — When the requirement only involves:
- Open web pages / Launch applications
- Basic interactions like clicking, hovering, inputting, scrolling, keyboard operations
- AI automatic planning and execution ()
- Data extraction ()
- Validation assertions ()
- Wait conditions ()
- Tool operations (, , )
- Platform-specific operations (, , )
Choose Extended Mode — When the requirement involves any of the following:
- Conditional judgment ("If...then...")
- Loop operations ("Repeat", "Traverse", "Pagination")
- Variables and dynamic data ("Define variables", "Parameterization")
- External API calls ("Call API")
- Error handling and retries ("If failed...", "Retry")
- Parallel tasks ("Do...simultaneously")
- Data transformation processing ("Filter", "Sort", "Map")
- Import and reuse sub-flows ("Reuse", "Import")
Rule of Thumb: Start with Native mode, switch to Extended mode when you find you need
,
or variables.
Step 2: Determine Target Platform
Determine platform configuration based on user description:
| User Description | Platform | YAML Configuration |
|---|
| "Open web page/website/URL" | Web | web: { url: "...", headless: false }
|
| "Test Android app" | Android | android: { deviceId: "..." }
+ |
| "Test iOS app" | iOS | + |
| "Desktop automation" | Computer | |
Additional Web Platform Configuration Options:
- — Run in headless mode (default false)
- / — Viewport size (default 1280×720)
- — Custom User-Agent
- — Device pixel ratio (e.g., set to 2 for Retina screens)
- — Network idle wait configuration, supports or object format
{ timeout: 2000, continueOnNetworkIdleError: true }
- — Path to Cookie JSON file (enables login-free session recovery)
- — Bridge mode: (default) | | , reuses logged-in desktop browser
- — Array of custom Chrome launch arguments (e.g.,
['--disable-gpu', '--proxy-server=...']
)
- — Local static file directory, starts built-in server
- — Ignore HTTPS certificate errors (default false)
- — Restrict navigation to current tab (default true)
Step 3: Natural Language → YAML Conversion
Action Selection Priority (Important)
- Prefer — Describe the entire intent in natural language, let AI automatically plan and execute multi-step operations. Suitable for most scenarios with the highest success rate
- When precise control is needed — Use specific actions like , (e.g., filling in specific form fields)
- When data extraction is needed — Must use ( cannot return structured data)
- When state validation is needed — Use or
Rule of Thumb: If the user's requirement can be described in a single natural language sentence, prioritize using one
step instead of splitting into multiple
+
steps.
Golden Path - Minimal Working Example:
yaml
web:
url: "https://www.baidu.com"
tasks:
- name: "Search for Midscene"
flow:
- ai: "Enter Midscene in the search box and click search"
- sleep: 3000
- aiAssert: "The page displays search results"
Native Mode YAML Format Specification (Important)
Native mode supports two formats for action parameters:
Flat Format (Recommended, concise): Action keyword followed by string value, additional parameters as sibling keys.
yaml
- aiInput: "Search box"
value: "Keyword"
- aiWaitFor: "Page loaded completely"
timeout: 10000
- aiTap: "Button description"
deepThink: true
- aiAssert: "Page contains expected content"
errorMessage: "Content validation failed"
Nested Format (Also valid, suitable for complex parameters):
yaml
- aiInput:
locator: "Search box"
value: "Keyword"
- aiQuery:
query: "Extract product list"
name: "products"
Use the following mapping rule table to convert user requirements to YAML:
Native Action Mapping
| Natural Language Pattern | YAML Mapping | Description |
|---|
| "Open/access/enter XXX website" | | Platform configuration |
| "Automatically plan and execute XXX" | | AI automatically breaks down into multi-step execution; optional fileChooserAccept: "path"
to handle file upload dialogs |
| "Click/press/select XXX" | | Short form |
| "Hover/move over XXX" | | Trigger dropdown menu or tooltip |
| "Enter YYY in XXX" | + | Flat sibling format; optional mode: "replace"|"clear"|"typeOnly"|"append"
|
| "Press XXX key on keyboard" | | Supports key combinations like "Control+A"; can be used as an alternative parameter |
| "Scroll down/up/left/right" | + | Flat sibling format; optional , |
| "Wait for XXX to appear" | | Optional timeout (in milliseconds) |
| "Check/verify/confirm XXX" | | Optional errorMessage |
| "Get/extract/read XXX" | aiQuery: { query: "XXX", name: "result" }
| name is used to store the result |
| "Pause/wait N seconds" | | Parameter is in milliseconds |
| "Execute JS code" | javascript: "Code content"
| Execute JavaScript directly |
| "Take screenshot and record to report" | + | Take screenshot and record description to report |
| "Double-click XXX" | | Double-click operation; optional |
| "Right-click XXX" | | Right-click operation; optional |
| "Locate XXX element" | + | Locate element, store result in variable (referencable in Extended mode) |
| "Is XXX true?" | + | Returns boolean value; optional / |
| "Get the number of XXX" | + | Returns number; optional / |
| "Get text of XXX" | + | Returns string; optional / |
| "Ask AI about XXX" | + | Free-form question, returns text answer |
| "Drag A to B" | + | Flat format; or nested |
| "Clear XXX input box" | | Clear input box content |
| "Execute ADB command" | | Android platform only |
| "Execute WDA request" | | iOS platform only |
| "Launch app" | | Mobile app launch |
Extended Control Flow Mapping
| Natural Language Pattern | YAML Mapping |
|---|
| "Define variable XXX as YYY" | variables: { XXX: "YYY" }
|
| "Use environment variable XXX" | or |
| "If XXX then YYY else ZZZ" | logic: { if: "XXX", then: [YYY], else: [ZZZ] }
|
| "Repeat N times" | loop: { type: repeat, count: N, steps: [...] }
|
| "Execute for each XXX" | loop: { type: for, items: "XXX", itemVar: "item", steps: [...] }
(// are all acceptable) |
| "Continue doing YYY while XXX" | loop: { type: while, condition: "XXX", maxIterations: N, steps: [...] }
|
| "Do A first, do B if it fails" | try: { steps: [A] }, catch: { steps: [B] }
|
| "Do A and B simultaneously" | parallel: { branches: [{steps: [A]}, {steps: [B]}], waitAll: true, merge_results: true }
|
| "Call XXX API" | external_call: { type: http, method: POST, url: "XXX", response_as: "varName" }
|
| "Execute Shell command" | external_call: { type: shell, command: "XXX" }
|
| "Import/reuse XXX flow" | import: [{ flow: "XXX.yaml", as: name }]
|
| "Filter/sort/map data" | data_transform: { source, operation, ... }
|
Step 4: Select Template Starting Point
Refer to the template files in the
directory and find the template closest to the user's requirement as the starting point:
Native Templates:
templates/native/web-basic.yaml
— Basic web operations
templates/native/web-login.yaml
— Login flow
templates/native/web-data-extract.yaml
— Data extraction
templates/native/web-search.yaml
— Web search flow
templates/native/web-file-upload.yaml
— File upload form
templates/native/web-multi-tab.yaml
— Multi-tab operations
templates/native/deep-think-locator.yaml
— Image-assisted location (deepThink/xpath)
templates/native/android-app.yaml
— Android testing
templates/native/ios-app.yaml
— iOS testing
templates/native/computer-desktop.yaml
— Desktop app automation
Extended Templates:
templates/extended/web-conditional-flow.yaml
— Conditional branching
templates/extended/web-pagination-loop.yaml
— Pagination loop
templates/extended/web-data-pipeline.yaml
— Data pipeline
templates/extended/multi-step-with-retry.yaml
— Multi-step with retry
templates/extended/api-integration-test.yaml
— API integration
templates/extended/e2e-workflow.yaml
— End-to-end complete workflow
templates/extended/reusable-sub-flows.yaml
— Sub-flow reuse (import/use)
templates/extended/responsive-test.yaml
— Multi-viewport responsive testing
templates/extended/web-auth-flow.yaml
— OAuth/login authentication flow (using variables and environment references)
Template Selection Decision:
| Requirement Feature | Recommended Template |
|---|
| Simple page operations (open, click, input) | |
| Login / Form filling | |
| Data collection / Information extraction | native/web-data-extract.yaml
|
| Search + Result validation | |
| File upload / Attachment submission | native/web-file-upload.yaml
|
| OAuth/Third-party authentication login | extended/web-auth-flow.yaml
|
| Desktop app automation (non-browser) | native/computer-desktop.yaml
|
| Conditional judgment needed (If logged in then...) | extended/web-conditional-flow.yaml
|
| Pagination / List traversal needed | extended/web-pagination-loop.yaml
|
| Data filtering / Sorting / Aggregation | extended/web-data-pipeline.yaml
|
| Retry on failure needed | extended/multi-step-with-retry.yaml
|
| External API call needed | extended/api-integration-test.yaml
|
| Complete business flow (multi-step + variables + export) | extended/e2e-workflow.yaml
|
| Sub-flow reuse / Modularization | extended/reusable-sub-flows.yaml
|
| Multi-screen size responsive validation | extended/responsive-test.yaml
|
| Complex element location / deepThink | native/deep-think-locator.yaml
|
| Multi-tab operations | native/web-multi-tab.yaml
|
Step 5: Generate YAML
Generate YAML content based on templates and conversion rules, pay attention to the following points:
- File Header: Add comments explaining the requirement source and generation time
- engine field: Extended mode must explicitly declare
- features list: In Extended mode, declare the features used (e.g.,
features: [logic, variables, loop]
), which can be omitted in Native mode
- agent configuration (optional): is used to identify tests, / for report classification, can cache AI results to speed up repeated runs
- aiActContext (optional): Provide additional context information for AI Agent (such as language annotation for multilingual websites, special domain terms), set in
agent: { aiActContext: "Description" }
- continueOnError (optional): If you need to continue executing subsequent tasks after a task fails, set
- output export (optional): Export results like to a JSON file for use in subsequent processes
Output Format
yaml
# Auto-generated by Midscene YAML Generator
# Requirement Description: [Original user requirement]
# Generation Time: [timestamp]
engine: native|extended
features: [...] # Extended mode only
# Optional: agent configuration
# agent:
# testId: "test-001"
# groupName: "Automation Testing Group"
# groupDescription: "Description"
# cache: true
[platform_config]
tasks:
- name: "[Task Name]"
# continueOnError: true # Optional: Continue on failure
flow:
[Generated steps]
# output: # Optional: Export data
# filePath: "./midscene-output/data.json"
# dataName: "variableName"
Step 6: Validate and Output
- Output the file to the directory
- Call the validator to confirm the YAML is valid:
bash
node scripts/midscene-run.js <file> --dry-run
- If validation fails, analyze the error cause and auto-fix
- After validation passes, prompt the user to execute using Runner:
bash
node scripts/midscene-run.js <file>
Best Practices for Writing AI Instructions
When generating YAML, the quality of AI instructions (parameters for
,
, etc.) directly affects execution success rate. Follow these principles:
Description Precision
- Poor: — There may be multiple buttons on the page
- Good:
aiTap: "Blue login button at the top right corner of the page"
— Position + Color + Function
- Better:
aiTap: "Button with text 'Login Now' in the navigation bar"
— Precise to text content
Location Strategy Priority
- Natural language description (Preferred): High readability, adapts to page changes
- deepThink mode: Enable when there are multiple similar elements on complex pages, AI will perform deeper analysis with higher accuracy but longer time consumption
- Image-assisted location (image prompting): When text description is insufficient, screenshot annotations can be used to help AI understand the target element (official capability)
- xpath selector (Last resort): When natural language cannot locate precisely. Note: xpath is only applicable to Web platform, Android/iOS should use natural language description
yaml
# Prefer natural language
- aiTap: "Edit button in the third row of the product list"
# Enable deepThink for complex scenarios (when there are many similar elements or location is inaccurate)
- aiTap: "Edit icon in the third row of data"
deepThink: true
# Last resort: use xpath (Web platform only)
- aiTap: ""
xpath: "//table/tbody/tr[3]//button[@class='edit']"
Image-assisted Location (locate object)
When natural language description is not precise enough, reference images can be provided via the
object:
yaml
# Use image to assist AI in identifying target element
- aiTap:
locate:
prompt: "Icon button similar to the reference image"
images:
- name: "target-icon"
url: "https://example.com/icon.png"
convertHttpImage2Base64: true
# Simplified form: directly provide in images option
- aiTap: "Icon button similar to the reference image"
images:
- "./images/target-icon.png"
aiQuery Result Formatting
Clearly specify the expected data structure in
:
yaml
- aiQuery:
query: >
Extract all product information on the page and return it as an array.
Each element should contain the following fields:
- name: Product name (string)
- price: Price (number)
- inStock: In stock (boolean)
name: "productList"
Wait Strategy
Add
after key operations to ensure the page state is ready:
yaml
- aiTap: "Submit button"
- aiWaitFor: "Submit success prompt appears, or page redirects to result page"
timeout: 10000
Data Transformation Operation Reference
Operations supported by
in Extended mode:
| Operation | Description | Key Parameters |
|---|
| Filter by condition | (JS expression, use to reference current element) |
| Sort | (field name), (asc/desc) |
| Map/Transform | (field mapping template) |
| Aggregation calculation | (JS expression), (initial value) |
| / | Deduplicate | (field for deduplication) |
| Extract subset | , |
| Flatten nested array | (flatten depth, default 1) |
| Group by field | or (field name for grouping) |
Two Formats: Flat format
{source, operation, name}
is suitable for single-step operations; nested format
{input, operations:[], output}
supports chained multi-step operations. Both formats support all 8 operations.
Platform-Specific Notes
Web Platform
- must include the full protocol ()
- Use to wait for page loading to complete before operations
- Ensure input boxes are interactive before form operations
Android Platform
- Need to configure (ADB device ID, e.g., )
- Use
launch: "com.example.app"
to launch the app (as an action step in flow)
- Can use to execute ADB commands
iOS Platform
- Need to configure (WebDriverAgent port, default 8100) and (default localhost)
- Use
launch: "com.example.app"
to launch the app (as an action step in flow)
- Can use to send WebDriverAgent requests
Computer Platform
- For general desktop automation scenarios
Common Anti-patterns
Avoid the following common mistakes when generating YAML:
- Unnecessary use of nested object format — Flat format is recommended ( + ), which is more concise and readable. Nested format (
aiInput: { locator: "Search box", value: "Keyword" }
) is valid in both modes but is usually only used when complex parameters like image positioning are needed
- Missing in Extended mode — Must declare the engine when using any extended features (variables, loops, conditions, etc.)
- Forgetting in loops — loops must set a safety upper limit, the count of and loops should not exceed 10000
- Using nested object format for — Should use + instead of
aiWaitFor: { condition: "Condition" }
- Missing declaration — Extended mode should list the features used to facilitate detection and optimization
Pre-output Self-check List
After generating YAML, verify the following items before output:
Notes
- Parameters for AI instructions (aiTap, aiAssert, etc.) are described in natural language, no CSS selectors needed
- Both Chinese and English descriptions are acceptable, Midscene's AI engine supports multiple languages
- Results of are stored via the field and can be referenced in subsequent steps using (Extended mode only)
- It is recommended to set a reasonable (in milliseconds) for , default is usually 15 seconds
- Be sure to set as a safety upper limit in loops to prevent infinite loops
- or can be used to reference environment variables, avoiding hardcoding sensitive information in YAML
- Always explicitly declare the field to avoid unexpected behavior from automatic detection
- Variable references are case-sensitive: and are different variables
- Avoid circular imports: Importing B.yaml in A.yaml and A.yaml in B.yaml will cause runtime errors
- Be sure to verify syntax and structure via after generation (Note: does not detect model configuration, AI operations require to be configured for actual execution)
- Prompt users to use the Midscene Runner skill to execute the generated file
Iterative Fix Process
When the generated YAML fails to execute:
- Runner can fix it automatically: If the error can be resolved by modifying YAML (e.g., imprecise location description, insufficient wait time), Runner Skill will directly modify and retry
- When regeneration is needed: If the error involves fundamental design issues (e.g., wrong mode selected, missing key steps), users can describe the failure to Generator, which will regenerate an improved YAML based on the error information
- Recommended Flow: Generate → dry-run validation → Execute → If failed, describe error for Generator to fix → Re-execute
Collaboration Agreement
After generation is complete, return the following structured information to the user:
- Generated File Path:
./midscene-output/<filename>.yaml
- Execution Mode: native or extended
- Recommended Next Command:
node scripts/midscene-run.js <path> --dry-run
- If dry-run validation fails, automatically analyze the error, fix the YAML, and re-validate