agent-desktop
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseagent-desktop
agent-desktop
CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.
Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.
一款CLI工具,支持AI Agent通过原生操作系统无障碍树观察和控制桌面应用。
核心原则: agent-desktop并非AI Agent,而是供AI Agent调用的工具。它输出带有基于引用(ref)的元素标识符的结构化JSON。观察-行动循环逻辑存在于调用它的Agent中。
Installation
安装
bash
npm install -g agent-desktopbash
npm install -g agent-desktopor
或
bun install -g --trust agent-desktop
Requires macOS 12+ with Accessibility permission granted to your terminal.bun install -g --trust agent-desktop
要求macOS 12及以上版本,且需为终端授予无障碍权限。Reference Files
参考文件
Detailed documentation is split into focused reference files. Read them as needed:
| Reference | Contents |
|---|---|
| snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples |
| click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command |
| launch, close, windows, clipboard, wait, batch, status, permissions, version |
| 12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns |
| macOS permissions/TCC, AX API internals, smart activation chain, surfaces, troubleshooting |
详细文档拆分至多个聚焦的参考文件中,可按需阅读:
| 参考文件 | 内容 |
|---|---|
| snapshot、find、get、is、screenshot、list-surfaces — 所有参数、输出示例 |
| click、type、set-value、select、toggle、scroll、drag、keyboard、mouse — 命令选择指南 |
| launch、close、windows、clipboard、wait、batch、status、permissions、version |
| 12种常见模式:表单、菜单、对话框、滚动查找、拖拽、异步等待、反模式 |
| macOS权限/TCC、AX API内部机制、智能激活链、界面层、故障排查 |
The Observe-Act Loop
观察-行动循环
Every automation follows this pattern:
1. OBSERVE → agent-desktop snapshot --app "App Name" -i
2. REASON → Parse JSON, find target element by ref (@e1, @e2...)
3. ACT → agent-desktop click @e5 (or type, select, toggle...)
4. VERIFY → agent-desktop snapshot again to confirm state change
5. REPEAT → Continue until task is completeAlways snapshot before acting. Refs are snapshot-scoped and become stale after UI changes.
所有自动化流程均遵循以下模式:
1. 观察 → agent-desktop snapshot --app "应用名称" -i
2. 推理 → 解析JSON,通过引用(@e1、@e2...)定位目标元素
3. 行动 → agent-desktop click @e5 (或type、select、toggle等命令)
4. 验证 → 再次执行snapshot确认状态变更
5. 重复 → 持续执行直至任务完成执行操作前务必先获取快照。引用仅在当前快照范围内有效,UI变更后引用会失效。
Ref System
引用系统
- Refs assigned depth-first: ,
@e1,@e2...@e3 - Only interactive elements get refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell
- Static text, groups, containers remain in tree for context but have no ref
- Refs are deterministic within a snapshot but NOT stable across snapshots if UI changed
- After any action that changes UI, run again for fresh refs
snapshot
- 引用(Ref)按深度优先分配:、
@e1、@e2...@e3 - 仅交互元素会被分配引用:按钮、文本框、复选框、链接、菜单项、标签页、滑块、下拉框、树状项、单元格
- 静态文本、分组、容器元素会保留在树中用于上下文参考,但无引用
- 引用在同一张快照内是确定的,但UI变更后不同快照间的引用不稳定
- 任何会改变UI的操作执行后,需重新执行获取新引用
snapshot
JSON Output Contract
JSON输出契约
Every command returns a JSON envelope on stdout:
Success:
Error:
{ "version": "1.0", "ok": true, "command": "snapshot", "data": { ... } }{ "version": "1.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }Exit codes: success, structured error, argument error.
012每个命令都会在标准输出中返回一个JSON包:
成功:
错误:
{ "version": "1.0", "ok": true, "command": "snapshot", "data": { ... } }{ "version": "1.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }退出码: 成功, 结构化错误, 参数错误。
012Error Codes
错误代码
| Code | Meaning | Recovery |
|---|---|---|
| Accessibility permission not granted | Grant in System Settings > Privacy > Accessibility |
| Ref not in current refmap | Re-run snapshot, use fresh ref |
| App not running | Launch it first |
| AX action rejected | Try alternative approach or coordinate-based click |
| Element can't do this | Use different command |
| Ref from old snapshot | Re-run snapshot |
| No matching window | Check app name, use list-windows |
| Wait condition not met | Increase --timeout |
| Bad arguments | Check command syntax |
| 代码 | 含义 | 恢复方案 |
|---|---|---|
| 未授予无障碍权限 | 在系统设置 > 隐私与安全性 > 无障碍中授予权限 |
| 当前引用映射中无此引用 | 重新执行snapshot,使用新引用 |
| 应用未运行 | 先启动应用 |
| AX操作被拒绝 | 尝试替代方案或基于坐标的点击 |
| 该元素不支持此操作 | 使用其他命令 |
| 引用来自旧快照 | 重新执行snapshot |
| 无匹配窗口 | 检查应用名称,使用list-windows命令 |
| 等待条件未满足 | 增加--timeout参数值 |
| 参数错误 | 检查命令语法 |
Command Quick Reference (50 commands)
命令速查(共50个命令)
Observation
观察类
agent-desktop snapshot --app "App" -i # Accessibility tree with refs
agent-desktop screenshot --app "App" out.png # PNG screenshot
agent-desktop find --app "App" --role button # Search elements
agent-desktop get @e1 --property text # Read element property
agent-desktop is @e1 --property enabled # Check element state
agent-desktop list-surfaces --app "App" # Available surfacesagent-desktop snapshot --app "应用" -i # 带引用的无障碍树
agent-desktop screenshot --app "应用" out.png # PNG截图
agent-desktop find --app "应用" --role button # 搜索元素
agent-desktop get @e1 --property text # 读取元素属性
agent-desktop is @e1 --property enabled # 检查元素状态
agent-desktop list-surfaces --app "应用" # 可用界面层Interaction
交互类
agent-desktop click @e5 # Click element
agent-desktop double-click @e3 # Double-click
agent-desktop triple-click @e2 # Triple-click (select line)
agent-desktop right-click @e5 # Right-click (context menu)
agent-desktop type @e2 "hello" # Type text into element
agent-desktop set-value @e2 "new value" # Set value directly
agent-desktop clear @e2 # Clear element value
agent-desktop focus @e2 # Set keyboard focus
agent-desktop select @e4 "Option B" # Select dropdown option
agent-desktop toggle @e6 # Toggle checkbox/switch
agent-desktop check @e6 # Idempotent check
agent-desktop uncheck @e6 # Idempotent uncheck
agent-desktop expand @e7 # Expand disclosure
agent-desktop collapse @e7 # Collapse disclosure
agent-desktop scroll @e1 --direction down # Scroll element
agent-desktop scroll-to @e8 # Scroll into viewagent-desktop click @e5 # 点击元素
agent-desktop double-click @e3 # 双击
agent-desktop triple-click @e2 # 三击(选中整行)
agent-desktop right-click @e5 # 右键点击(打开上下文菜单)
agent-desktop type @e2 "hello" # 向元素输入文本
agent-desktop set-value @e2 "新值" # 直接设置元素值
agent-desktop clear @e2 # 清空元素值
agent-desktop focus @e2 # 设置键盘焦点
agent-desktop select @e4 "选项B" # 选择下拉选项
agent-desktop toggle @e6 # 切换复选框/开关
agent-desktop check @e6 # 幂等性勾选
agent-desktop uncheck @e6 # 幂等性取消勾选
agent-desktop expand @e7 # 展开折叠面板
agent-desktop collapse @e7 # 收起折叠面板
agent-desktop scroll @e1 --direction down # 滚动元素
agent-desktop scroll-to @e8 # 滚动至元素可见Keyboard & Mouse
键盘与鼠标类
agent-desktop press cmd+c # Key combo
agent-desktop press return --app "App" # Targeted key press
agent-desktop key-down shift # Hold key
agent-desktop key-up shift # Release key
agent-desktop hover @e5 # Cursor to element
agent-desktop hover --xy 500,300 # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5 # Drag between elements
agent-desktop mouse-click --xy 500,300 # Click at coordinates
agent-desktop mouse-move --xy 100,200 # Move cursor
agent-desktop mouse-down --xy 100,200 # Press mouse button
agent-desktop mouse-up --xy 300,400 # Release mouse buttonagent-desktop press cmd+c # 组合键按下
agent-desktop press return --app "应用" # 定向按键
agent-desktop key-down shift # 按住按键
agent-desktop key-up shift # 释放按键
agent-desktop hover @e5 # 光标移至元素
agent-desktop hover --xy 500,300 # 光标移至指定坐标
agent-desktop drag --from @e1 --to @e5 # 在元素间拖拽
agent-desktop mouse-click --xy 500,300 # 点击指定坐标
agent-desktop mouse-move --xy 100,200 # 移动光标
agent-desktop mouse-down --xy 100,200 # 按下鼠标按键
agent-desktop mouse-up --xy 300,400 # 释放鼠标按键App & Window
应用与窗口类
agent-desktop launch "System Settings" # Launch and wait
agent-desktop close-app "TextEdit" # Quit gracefully
agent-desktop close-app "TextEdit" --force # Force kill
agent-desktop list-windows --app "Finder" # List windows
agent-desktop list-apps # List running GUI apps
agent-desktop focus-window --app "Finder" # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"agent-desktop launch "系统设置" # 启动应用并等待
agent-desktop close-app "文本编辑" # 优雅退出应用
agent-desktop close-app "文本编辑" --force # 强制终止
agent-desktop list-windows --app "访达" # 列出窗口
agent-desktop list-apps # 列出正在运行的GUI应用
agent-desktop focus-window --app "访达" # 前置窗口
agent-desktop resize-window --app "应用" --width 800 --height 600
agent-desktop move-window --app "应用" --x 0 --y 0
agent-desktop minimize --app "应用"
agent-desktop maximize --app "应用"
agent-desktop restore --app "应用"Clipboard
剪贴板类
agent-desktop clipboard-get # Read clipboard
agent-desktop clipboard-set "text" # Write to clipboard
agent-desktop clipboard-clear # Clear clipboardagent-desktop clipboard-get # 读取剪贴板
agent-desktop clipboard-set "文本" # 写入剪贴板
agent-desktop clipboard-clear # 清空剪贴板Wait
等待类
agent-desktop wait 1000 # Pause 1 second
agent-desktop wait --element @e5 --timeout 5000 # Wait for element
agent-desktop wait --window "Title" # Wait for window
agent-desktop wait --text "Done" --app "App" # Wait for text
agent-desktop wait --menu --app "App" # Wait for context menu
agent-desktop wait --menu-closed --app "App" # Wait for menu dismissalagent-desktop wait 1000 # 暂停1秒
agent-desktop wait --element @e5 --timeout 5000 # 等待元素出现
agent-desktop wait --window "标题" # 等待窗口出现
agent-desktop wait --text "完成" --app "应用" # 等待文本出现
agent-desktop wait --menu --app "应用" # 等待上下文菜单出现
agent-desktop wait --menu-closed --app "应用" # 等待菜单关闭System
系统类
agent-desktop status # Health check
agent-desktop permissions # Check permission
agent-desktop permissions --request # Trigger permission dialog
agent-desktop version --json # Version info
agent-desktop batch '[...]' --stop-on-error # Batch commandsagent-desktop status # 健康检查
agent-desktop permissions # 检查权限状态
agent-desktop permissions --request # 触发权限申请对话框
agent-desktop version --json # 版本信息
agent-desktop batch '[...]' --stop-on-error # 批量执行命令Key Principles for Agents
Agent使用核心原则
- Always snapshot first. Never assume UI state.
- Use flag. Filters to interactive elements only, reducing tokens.
-i - Refs are ephemeral. Snapshot again after any UI-changing action.
- Prefer refs over coordinates. >
click @e5.mouse-click --xy 500,300 - Use for async UI. After launch/dialog triggers, wait for expected state.
wait - Check permissions first. Run on first use.
permissions - Handle errors. Parse and follow
error.code.error.suggestion - Use for targeted searches. Faster than full snapshot when you know role/name.
find - Use surfaces for menus. captures open menus.
snapshot --surface menu - Batch for performance. Multiple commands in one invocation.
- 始终先执行快照。 绝不假设UI状态。
- 使用参数。 仅过滤出交互元素,减少Token消耗。
-i - 引用是临时的。 任何会改变UI的操作执行后,需重新获取快照。
- 优先使用引用而非坐标。 优于
click @e5。mouse-click --xy 500,300 - 异步UI需使用命令。 启动应用/触发对话框后,等待预期状态出现。
wait - 先检查权限。 首次使用时执行命令。
permissions - 处理错误。 解析并遵循
error.code的指引。error.suggestion - 使用命令精准搜索。 已知元素角色/名称时,比全量快照更高效。
find - 菜单操作使用界面层。 可捕获已打开的菜单。
snapshot --surface menu - 批量执行提升性能。 一次调用执行多个命令。