agent-desktop

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

agent-desktop

agent-desktop

CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.
Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.
一款CLI工具,支持AI Agent通过原生操作系统无障碍树观察和控制桌面应用。
核心原则: agent-desktop并非AI Agent,而是供AI Agent调用的工具。它输出带有基于引用(ref)的元素标识符的结构化JSON。观察-行动循环逻辑存在于调用它的Agent中。

Installation

安装

bash
npm install -g agent-desktop
bash
npm install -g agent-desktop

or

bun install -g --trust agent-desktop

Requires macOS 12+ with Accessibility permission granted to your terminal.
bun install -g --trust agent-desktop

要求macOS 12及以上版本,且需为终端授予无障碍权限。

Reference Files

参考文件

Detailed documentation is split into focused reference files. Read them as needed:
ReferenceContents
references/commands-observation.md
snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples
references/commands-interaction.md
click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command
references/commands-system.md
launch, close, windows, clipboard, wait, batch, status, permissions, version
references/workflows.md
12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns
references/macos.md
macOS permissions/TCC, AX API internals, smart activation chain, surfaces, troubleshooting
详细文档拆分至多个聚焦的参考文件中,可按需阅读:
参考文件内容
references/commands-observation.md
snapshot、find、get、is、screenshot、list-surfaces — 所有参数、输出示例
references/commands-interaction.md
click、type、set-value、select、toggle、scroll、drag、keyboard、mouse — 命令选择指南
references/commands-system.md
launch、close、windows、clipboard、wait、batch、status、permissions、version
references/workflows.md
12种常见模式:表单、菜单、对话框、滚动查找、拖拽、异步等待、反模式
references/macos.md
macOS权限/TCC、AX API内部机制、智能激活链、界面层、故障排查

The Observe-Act Loop

观察-行动循环

Every automation follows this pattern:
1. OBSERVE  → agent-desktop snapshot --app "App Name" -i
2. REASON   → Parse JSON, find target element by ref (@e1, @e2...)
3. ACT      → agent-desktop click @e5  (or type, select, toggle...)
4. VERIFY   → agent-desktop snapshot again to confirm state change
5. REPEAT   → Continue until task is complete
Always snapshot before acting. Refs are snapshot-scoped and become stale after UI changes.
所有自动化流程均遵循以下模式:
1. 观察  → agent-desktop snapshot --app "应用名称" -i
2. 推理  → 解析JSON,通过引用(@e1、@e2...)定位目标元素
3. 行动  → agent-desktop click @e5 (或type、select、toggle等命令)
4. 验证  → 再次执行snapshot确认状态变更
5. 重复  → 持续执行直至任务完成
执行操作前务必先获取快照。引用仅在当前快照范围内有效,UI变更后引用会失效。

Ref System

引用系统

  • Refs assigned depth-first:
    @e1
    ,
    @e2
    ,
    @e3
    ...
  • Only interactive elements get refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell
  • Static text, groups, containers remain in tree for context but have no ref
  • Refs are deterministic within a snapshot but NOT stable across snapshots if UI changed
  • After any action that changes UI, run
    snapshot
    again for fresh refs
  • 引用(Ref)按深度优先分配:
    @e1
    @e2
    @e3
    ...
  • 仅交互元素会被分配引用:按钮、文本框、复选框、链接、菜单项、标签页、滑块、下拉框、树状项、单元格
  • 静态文本、分组、容器元素会保留在树中用于上下文参考,但无引用
  • 引用在同一张快照内是确定的,但UI变更后不同快照间的引用不稳定
  • 任何会改变UI的操作执行后,需重新执行
    snapshot
    获取新引用

JSON Output Contract

JSON输出契约

Every command returns a JSON envelope on stdout:
Success:
{ "version": "1.0", "ok": true, "command": "snapshot", "data": { ... } }
Error:
{ "version": "1.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }
Exit codes:
0
success,
1
structured error,
2
argument error.
每个命令都会在标准输出中返回一个JSON包:
成功:
{ "version": "1.0", "ok": true, "command": "snapshot", "data": { ... } }
错误:
{ "version": "1.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }
退出码:
0
成功,
1
结构化错误,
2
参数错误。

Error Codes

错误代码

CodeMeaningRecovery
PERM_DENIED
Accessibility permission not grantedGrant in System Settings > Privacy > Accessibility
ELEMENT_NOT_FOUND
Ref not in current refmapRe-run snapshot, use fresh ref
APP_NOT_FOUND
App not runningLaunch it first
ACTION_FAILED
AX action rejectedTry alternative approach or coordinate-based click
ACTION_NOT_SUPPORTED
Element can't do thisUse different command
STALE_REF
Ref from old snapshotRe-run snapshot
WINDOW_NOT_FOUND
No matching windowCheck app name, use list-windows
TIMEOUT
Wait condition not metIncrease --timeout
INVALID_ARGS
Bad argumentsCheck command syntax
代码含义恢复方案
PERM_DENIED
未授予无障碍权限在系统设置 > 隐私与安全性 > 无障碍中授予权限
ELEMENT_NOT_FOUND
当前引用映射中无此引用重新执行snapshot,使用新引用
APP_NOT_FOUND
应用未运行先启动应用
ACTION_FAILED
AX操作被拒绝尝试替代方案或基于坐标的点击
ACTION_NOT_SUPPORTED
该元素不支持此操作使用其他命令
STALE_REF
引用来自旧快照重新执行snapshot
WINDOW_NOT_FOUND
无匹配窗口检查应用名称,使用list-windows命令
TIMEOUT
等待条件未满足增加--timeout参数值
INVALID_ARGS
参数错误检查命令语法

Command Quick Reference (50 commands)

命令速查(共50个命令)

Observation

观察类

agent-desktop snapshot --app "App" -i           # Accessibility tree with refs
agent-desktop screenshot --app "App" out.png    # PNG screenshot
agent-desktop find --app "App" --role button    # Search elements
agent-desktop get @e1 --property text           # Read element property
agent-desktop is @e1 --property enabled         # Check element state
agent-desktop list-surfaces --app "App"         # Available surfaces
agent-desktop snapshot --app "应用" -i           # 带引用的无障碍树
agent-desktop screenshot --app "应用" out.png    # PNG截图
agent-desktop find --app "应用" --role button    # 搜索元素
agent-desktop get @e1 --property text           # 读取元素属性
agent-desktop is @e1 --property enabled         # 检查元素状态
agent-desktop list-surfaces --app "应用"         # 可用界面层

Interaction

交互类

agent-desktop click @e5                         # Click element
agent-desktop double-click @e3                  # Double-click
agent-desktop triple-click @e2                  # Triple-click (select line)
agent-desktop right-click @e5                   # Right-click (context menu)
agent-desktop type @e2 "hello"                  # Type text into element
agent-desktop set-value @e2 "new value"         # Set value directly
agent-desktop clear @e2                         # Clear element value
agent-desktop focus @e2                         # Set keyboard focus
agent-desktop select @e4 "Option B"             # Select dropdown option
agent-desktop toggle @e6                        # Toggle checkbox/switch
agent-desktop check @e6                         # Idempotent check
agent-desktop uncheck @e6                       # Idempotent uncheck
agent-desktop expand @e7                        # Expand disclosure
agent-desktop collapse @e7                      # Collapse disclosure
agent-desktop scroll @e1 --direction down       # Scroll element
agent-desktop scroll-to @e8                     # Scroll into view
agent-desktop click @e5                         # 点击元素
agent-desktop double-click @e3                  # 双击
agent-desktop triple-click @e2                  # 三击(选中整行)
agent-desktop right-click @e5                   # 右键点击(打开上下文菜单)
agent-desktop type @e2 "hello"                  # 向元素输入文本
agent-desktop set-value @e2 "新值"         # 直接设置元素值
agent-desktop clear @e2                         # 清空元素值
agent-desktop focus @e2                         # 设置键盘焦点
agent-desktop select @e4 "选项B"             # 选择下拉选项
agent-desktop toggle @e6                        # 切换复选框/开关
agent-desktop check @e6                         # 幂等性勾选
agent-desktop uncheck @e6                       # 幂等性取消勾选
agent-desktop expand @e7                        # 展开折叠面板
agent-desktop collapse @e7                      # 收起折叠面板
agent-desktop scroll @e1 --direction down       # 滚动元素
agent-desktop scroll-to @e8                     # 滚动至元素可见

Keyboard & Mouse

键盘与鼠标类

agent-desktop press cmd+c                       # Key combo
agent-desktop press return --app "App"          # Targeted key press
agent-desktop key-down shift                    # Hold key
agent-desktop key-up shift                      # Release key
agent-desktop hover @e5                         # Cursor to element
agent-desktop hover --xy 500,300                # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5          # Drag between elements
agent-desktop mouse-click --xy 500,300          # Click at coordinates
agent-desktop mouse-move --xy 100,200           # Move cursor
agent-desktop mouse-down --xy 100,200           # Press mouse button
agent-desktop mouse-up --xy 300,400             # Release mouse button
agent-desktop press cmd+c                       # 组合键按下
agent-desktop press return --app "应用"          # 定向按键
agent-desktop key-down shift                    # 按住按键
agent-desktop key-up shift                      # 释放按键
agent-desktop hover @e5                         # 光标移至元素
agent-desktop hover --xy 500,300                # 光标移至指定坐标
agent-desktop drag --from @e1 --to @e5          # 在元素间拖拽
agent-desktop mouse-click --xy 500,300          # 点击指定坐标
agent-desktop mouse-move --xy 100,200           # 移动光标
agent-desktop mouse-down --xy 100,200           # 按下鼠标按键
agent-desktop mouse-up --xy 300,400             # 释放鼠标按键

App & Window

应用与窗口类

agent-desktop launch "System Settings"          # Launch and wait
agent-desktop close-app "TextEdit"              # Quit gracefully
agent-desktop close-app "TextEdit" --force      # Force kill
agent-desktop list-windows --app "Finder"       # List windows
agent-desktop list-apps                         # List running GUI apps
agent-desktop focus-window --app "Finder"       # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"
agent-desktop launch "系统设置"          # 启动应用并等待
agent-desktop close-app "文本编辑"              # 优雅退出应用
agent-desktop close-app "文本编辑" --force      # 强制终止
agent-desktop list-windows --app "访达"       # 列出窗口
agent-desktop list-apps                         # 列出正在运行的GUI应用
agent-desktop focus-window --app "访达"       # 前置窗口
agent-desktop resize-window --app "应用" --width 800 --height 600
agent-desktop move-window --app "应用" --x 0 --y 0
agent-desktop minimize --app "应用"
agent-desktop maximize --app "应用"
agent-desktop restore --app "应用"

Clipboard

剪贴板类

agent-desktop clipboard-get                     # Read clipboard
agent-desktop clipboard-set "text"              # Write to clipboard
agent-desktop clipboard-clear                   # Clear clipboard
agent-desktop clipboard-get                     # 读取剪贴板
agent-desktop clipboard-set "文本"              # 写入剪贴板
agent-desktop clipboard-clear                   # 清空剪贴板

Wait

等待类

agent-desktop wait 1000                         # Pause 1 second
agent-desktop wait --element @e5 --timeout 5000 # Wait for element
agent-desktop wait --window "Title"             # Wait for window
agent-desktop wait --text "Done" --app "App"    # Wait for text
agent-desktop wait --menu --app "App"           # Wait for context menu
agent-desktop wait --menu-closed --app "App"    # Wait for menu dismissal
agent-desktop wait 1000                         # 暂停1秒
agent-desktop wait --element @e5 --timeout 5000 # 等待元素出现
agent-desktop wait --window "标题"             # 等待窗口出现
agent-desktop wait --text "完成" --app "应用"    # 等待文本出现
agent-desktop wait --menu --app "应用"           # 等待上下文菜单出现
agent-desktop wait --menu-closed --app "应用"    # 等待菜单关闭

System

系统类

agent-desktop status                            # Health check
agent-desktop permissions                       # Check permission
agent-desktop permissions --request             # Trigger permission dialog
agent-desktop version --json                    # Version info
agent-desktop batch '[...]' --stop-on-error     # Batch commands
agent-desktop status                            # 健康检查
agent-desktop permissions                       # 检查权限状态
agent-desktop permissions --request             # 触发权限申请对话框
agent-desktop version --json                    # 版本信息
agent-desktop batch '[...]' --stop-on-error     # 批量执行命令

Key Principles for Agents

Agent使用核心原则

  1. Always snapshot first. Never assume UI state.
  2. Use
    -i
    flag.
    Filters to interactive elements only, reducing tokens.
  3. Refs are ephemeral. Snapshot again after any UI-changing action.
  4. Prefer refs over coordinates.
    click @e5
    >
    mouse-click --xy 500,300
    .
  5. Use
    wait
    for async UI.
    After launch/dialog triggers, wait for expected state.
  6. Check permissions first. Run
    permissions
    on first use.
  7. Handle errors. Parse
    error.code
    and follow
    error.suggestion
    .
  8. Use
    find
    for targeted searches.
    Faster than full snapshot when you know role/name.
  9. Use surfaces for menus.
    snapshot --surface menu
    captures open menus.
  10. Batch for performance. Multiple commands in one invocation.
  1. 始终先执行快照。 绝不假设UI状态。
  2. 使用
    -i
    参数。
    仅过滤出交互元素,减少Token消耗。
  3. 引用是临时的。 任何会改变UI的操作执行后,需重新获取快照。
  4. 优先使用引用而非坐标。
    click @e5
    优于
    mouse-click --xy 500,300
  5. 异步UI需使用
    wait
    命令。
    启动应用/触发对话框后,等待预期状态出现。
  6. 先检查权限。 首次使用时执行
    permissions
    命令。
  7. 处理错误。 解析
    error.code
    并遵循
    error.suggestion
    的指引。
  8. 使用
    find
    命令精准搜索。
    已知元素角色/名称时,比全量快照更高效。
  9. 菜单操作使用界面层。
    snapshot --surface menu
    可捕获已打开的菜单。
  10. 批量执行提升性能。 一次调用执行多个命令。