agent-desktop

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

agent-desktop

CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.

Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.

一款CLI工具，支持AI Agent通过原生操作系统无障碍树观察和控制桌面应用。

核心原则： agent-desktop并非AI Agent，而是供AI Agent调用的工具。它输出带有基于引用（ref）的元素标识符的结构化JSON。观察-行动循环逻辑存在于调用它的Agent中。

Installation

安装

bash

npm install -g agent-desktop

bash

npm install -g agent-desktop

or

或

bun install -g --trust agent-desktop


Requires macOS 12+ with Accessibility permission granted to your terminal.

bun install -g --trust agent-desktop


要求macOS 12及以上版本，且需为终端授予无障碍权限。

Reference Files

参考文件

Detailed documentation is split into focused reference files. Read them as needed:

Reference	Contents
`references/commands-observation.md`	snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples
`references/commands-interaction.md`	click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command
`references/commands-system.md`	launch, close, windows, clipboard, wait, batch, status, permissions, version
`references/workflows.md`	12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns
`references/macos.md`	macOS permissions/TCC, AX API internals, smart activation chain, surfaces, troubleshooting

详细文档拆分至多个聚焦的参考文件中，可按需阅读：

参考文件	内容
`references/commands-observation.md`	snapshot、find、get、is、screenshot、list-surfaces — 所有参数、输出示例
`references/commands-interaction.md`	click、type、set-value、select、toggle、scroll、drag、keyboard、mouse — 命令选择指南
`references/commands-system.md`	launch、close、windows、clipboard、wait、batch、status、permissions、version
`references/workflows.md`	12种常见模式：表单、菜单、对话框、滚动查找、拖拽、异步等待、反模式
`references/macos.md`	macOS权限/TCC、AX API内部机制、智能激活链、界面层、故障排查

The Observe-Act Loop

观察-行动循环

Every automation follows this pattern:

1. OBSERVE  → agent-desktop snapshot --app "App Name" -i
2. REASON   → Parse JSON, find target element by ref (@e1, @e2...)
3. ACT      → agent-desktop click @e5  (or type, select, toggle...)
4. VERIFY   → agent-desktop snapshot again to confirm state change
5. REPEAT   → Continue until task is complete

Always snapshot before acting. Refs are snapshot-scoped and become stale after UI changes.

所有自动化流程均遵循以下模式：

1. 观察  → agent-desktop snapshot --app "应用名称" -i
2. 推理  → 解析JSON，通过引用（@e1、@e2...）定位目标元素
3. 行动  → agent-desktop click @e5 （或type、select、toggle等命令）
4. 验证  → 再次执行snapshot确认状态变更
5. 重复  → 持续执行直至任务完成

执行操作前务必先获取快照。引用仅在当前快照范围内有效，UI变更后引用会失效。

Ref System

引用系统

Refs assigned depth-first:
```
@e1
```
,
```
@e2
```
,
```
@e3
```
...
Only interactive elements get refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell
Static text, groups, containers remain in tree for context but have no ref
Refs are deterministic within a snapshot but NOT stable across snapshots if UI changed
After any action that changes UI, run
```
snapshot
```
again for fresh refs

引用（Ref）按深度优先分配：
```
@e1
```
、
```
@e2
```
、
```
@e3
```
...
仅交互元素会被分配引用：按钮、文本框、复选框、链接、菜单项、标签页、滑块、下拉框、树状项、单元格
静态文本、分组、容器元素会保留在树中用于上下文参考，但无引用
引用在同一张快照内是确定的，但UI变更后不同快照间的引用不稳定
任何会改变UI的操作执行后，需重新执行
```
snapshot
```
获取新引用

JSON Output Contract

JSON输出契约

Every command returns a JSON envelope on stdout:

Success:

{ "version": "1.0", "ok": true, "command": "snapshot", "data": { ... } }

Error:

{ "version": "1.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }

Exit codes:

success,

structured error,

argument error.

每个命令都会在标准输出中返回一个JSON包：

成功：

{ "version": "1.0", "ok": true, "command": "snapshot", "data": { ... } }

错误：

{ "version": "1.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }

退出码：

成功，

结构化错误，

参数错误。

Error Codes

错误代码

Code	Meaning	Recovery
`PERM_DENIED`	Accessibility permission not granted	Grant in System Settings > Privacy > Accessibility
`ELEMENT_NOT_FOUND`	Ref not in current refmap	Re-run snapshot, use fresh ref
`APP_NOT_FOUND`	App not running	Launch it first
`ACTION_FAILED`	AX action rejected	Try alternative approach or coordinate-based click
`ACTION_NOT_SUPPORTED`	Element can't do this	Use different command
`STALE_REF`	Ref from old snapshot	Re-run snapshot
`WINDOW_NOT_FOUND`	No matching window	Check app name, use list-windows
`TIMEOUT`	Wait condition not met	Increase --timeout
`INVALID_ARGS`	Bad arguments	Check command syntax

代码	含义	恢复方案
`PERM_DENIED`	未授予无障碍权限	在系统设置 > 隐私与安全性 > 无障碍中授予权限
`ELEMENT_NOT_FOUND`	当前引用映射中无此引用	重新执行snapshot，使用新引用
`APP_NOT_FOUND`	应用未运行	先启动应用
`ACTION_FAILED`	AX操作被拒绝	尝试替代方案或基于坐标的点击
`ACTION_NOT_SUPPORTED`	该元素不支持此操作	使用其他命令
`STALE_REF`	引用来自旧快照	重新执行snapshot
`WINDOW_NOT_FOUND`	无匹配窗口	检查应用名称，使用list-windows命令
`TIMEOUT`	等待条件未满足	增加--timeout参数值
`INVALID_ARGS`	参数错误	检查命令语法

Command Quick Reference (50 commands)

命令速查（共50个命令）

Observation

观察类

agent-desktop snapshot --app "App" -i           # Accessibility tree with refs
agent-desktop screenshot --app "App" out.png    # PNG screenshot
agent-desktop find --app "App" --role button    # Search elements
agent-desktop get @e1 --property text           # Read element property
agent-desktop is @e1 --property enabled         # Check element state
agent-desktop list-surfaces --app "App"         # Available surfaces

agent-desktop snapshot --app "应用" -i           # 带引用的无障碍树
agent-desktop screenshot --app "应用" out.png    # PNG截图
agent-desktop find --app "应用" --role button    # 搜索元素
agent-desktop get @e1 --property text           # 读取元素属性
agent-desktop is @e1 --property enabled         # 检查元素状态
agent-desktop list-surfaces --app "应用"         # 可用界面层

Interaction

交互类

agent-desktop click @e5                         # Click element
agent-desktop double-click @e3                  # Double-click
agent-desktop triple-click @e2                  # Triple-click (select line)
agent-desktop right-click @e5                   # Right-click (context menu)
agent-desktop type @e2 "hello"                  # Type text into element
agent-desktop set-value @e2 "new value"         # Set value directly
agent-desktop clear @e2                         # Clear element value
agent-desktop focus @e2                         # Set keyboard focus
agent-desktop select @e4 "Option B"             # Select dropdown option
agent-desktop toggle @e6                        # Toggle checkbox/switch
agent-desktop check @e6                         # Idempotent check
agent-desktop uncheck @e6                       # Idempotent uncheck
agent-desktop expand @e7                        # Expand disclosure
agent-desktop collapse @e7                      # Collapse disclosure
agent-desktop scroll @e1 --direction down       # Scroll element
agent-desktop scroll-to @e8                     # Scroll into view

agent-desktop click @e5                         # 点击元素
agent-desktop double-click @e3                  # 双击
agent-desktop triple-click @e2                  # 三击（选中整行）
agent-desktop right-click @e5                   # 右键点击（打开上下文菜单）
agent-desktop type @e2 "hello"                  # 向元素输入文本
agent-desktop set-value @e2 "新值"         # 直接设置元素值
agent-desktop clear @e2                         # 清空元素值
agent-desktop focus @e2                         # 设置键盘焦点
agent-desktop select @e4 "选项B"             # 选择下拉选项
agent-desktop toggle @e6                        # 切换复选框/开关
agent-desktop check @e6                         # 幂等性勾选
agent-desktop uncheck @e6                       # 幂等性取消勾选
agent-desktop expand @e7                        # 展开折叠面板
agent-desktop collapse @e7                      # 收起折叠面板
agent-desktop scroll @e1 --direction down       # 滚动元素
agent-desktop scroll-to @e8                     # 滚动至元素可见

Keyboard & Mouse

键盘与鼠标类

agent-desktop press cmd+c                       # Key combo
agent-desktop press return --app "App"          # Targeted key press
agent-desktop key-down shift                    # Hold key
agent-desktop key-up shift                      # Release key
agent-desktop hover @e5                         # Cursor to element
agent-desktop hover --xy 500,300                # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5          # Drag between elements
agent-desktop mouse-click --xy 500,300          # Click at coordinates
agent-desktop mouse-move --xy 100,200           # Move cursor
agent-desktop mouse-down --xy 100,200           # Press mouse button
agent-desktop mouse-up --xy 300,400             # Release mouse button

agent-desktop press cmd+c                       # 组合键按下
agent-desktop press return --app "应用"          # 定向按键
agent-desktop key-down shift                    # 按住按键
agent-desktop key-up shift                      # 释放按键
agent-desktop hover @e5                         # 光标移至元素
agent-desktop hover --xy 500,300                # 光标移至指定坐标
agent-desktop drag --from @e1 --to @e5          # 在元素间拖拽
agent-desktop mouse-click --xy 500,300          # 点击指定坐标
agent-desktop mouse-move --xy 100,200           # 移动光标
agent-desktop mouse-down --xy 100,200           # 按下鼠标按键
agent-desktop mouse-up --xy 300,400             # 释放鼠标按键

App & Window

应用与窗口类

agent-desktop launch "System Settings"          # Launch and wait
agent-desktop close-app "TextEdit"              # Quit gracefully
agent-desktop close-app "TextEdit" --force      # Force kill
agent-desktop list-windows --app "Finder"       # List windows
agent-desktop list-apps                         # List running GUI apps
agent-desktop focus-window --app "Finder"       # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"

agent-desktop launch "系统设置"          # 启动应用并等待
agent-desktop close-app "文本编辑"              # 优雅退出应用
agent-desktop close-app "文本编辑" --force      # 强制终止
agent-desktop list-windows --app "访达"       # 列出窗口
agent-desktop list-apps                         # 列出正在运行的GUI应用
agent-desktop focus-window --app "访达"       # 前置窗口
agent-desktop resize-window --app "应用" --width 800 --height 600
agent-desktop move-window --app "应用" --x 0 --y 0
agent-desktop minimize --app "应用"
agent-desktop maximize --app "应用"
agent-desktop restore --app "应用"

Clipboard

剪贴板类

agent-desktop clipboard-get                     # Read clipboard
agent-desktop clipboard-set "text"              # Write to clipboard
agent-desktop clipboard-clear                   # Clear clipboard

agent-desktop clipboard-get                     # 读取剪贴板
agent-desktop clipboard-set "文本"              # 写入剪贴板
agent-desktop clipboard-clear                   # 清空剪贴板

Wait

等待类

agent-desktop wait 1000                         # Pause 1 second
agent-desktop wait --element @e5 --timeout 5000 # Wait for element
agent-desktop wait --window "Title"             # Wait for window
agent-desktop wait --text "Done" --app "App"    # Wait for text
agent-desktop wait --menu --app "App"           # Wait for context menu
agent-desktop wait --menu-closed --app "App"    # Wait for menu dismissal

agent-desktop wait 1000                         # 暂停1秒
agent-desktop wait --element @e5 --timeout 5000 # 等待元素出现
agent-desktop wait --window "标题"             # 等待窗口出现
agent-desktop wait --text "完成" --app "应用"    # 等待文本出现
agent-desktop wait --menu --app "应用"           # 等待上下文菜单出现
agent-desktop wait --menu-closed --app "应用"    # 等待菜单关闭

System

系统类

agent-desktop status                            # Health check
agent-desktop permissions                       # Check permission
agent-desktop permissions --request             # Trigger permission dialog
agent-desktop version --json                    # Version info
agent-desktop batch '[...]' --stop-on-error     # Batch commands

agent-desktop status                            # 健康检查
agent-desktop permissions                       # 检查权限状态
agent-desktop permissions --request             # 触发权限申请对话框
agent-desktop version --json                    # 版本信息
agent-desktop batch '[...]' --stop-on-error     # 批量执行命令

Key Principles for Agents

Agent使用核心原则

Always snapshot first. Never assume UI state.
Use
-i
flag. Filters to interactive elements only, reducing tokens.
Refs are ephemeral. Snapshot again after any UI-changing action.
Prefer refs over coordinates.
```
click @e5
```
>
```
mouse-click --xy 500,300
```
.
Use
wait
for async UI. After launch/dialog triggers, wait for expected state.
Check permissions first. Run
```
permissions
```
on first use.
Handle errors. Parse
```
error.code
```
and follow
```
error.suggestion
```
.
Use
find
for targeted searches. Faster than full snapshot when you know role/name.
Use surfaces for menus.
```
snapshot --surface menu
```
captures open menus.
Batch for performance. Multiple commands in one invocation.

始终先执行快照。 绝不假设UI状态。
使用
-i
参数。仅过滤出交互元素，减少Token消耗。
引用是临时的。 任何会改变UI的操作执行后，需重新获取快照。
优先使用引用而非坐标。
```
click @e5
```
优于
```
mouse-click --xy 500,300
```
。
异步UI需使用
wait
命令。启动应用/触发对话框后，等待预期状态出现。
先检查权限。 首次使用时执行
```
permissions
```
命令。
处理错误。 解析
```
error.code
```
并遵循
```
error.suggestion
```
的指引。
使用
find
命令精准搜索。已知元素角色/名称时，比全量快照更高效。
菜单操作使用界面层。
```
snapshot --surface menu
```
可捕获已打开的菜单。
批量执行提升性能。 一次调用执行多个命令。