computer-use-agents
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseComputer Use Agents
计算机操作AI Agent
Patterns
模式
Perception-Reasoning-Action Loop
感知-推理-行动循环
The fundamental architecture of computer use agents: observe screen,
reason about next action, execute action, repeat. This loop integrates
vision models with action execution through an iterative pipeline.
Key components:
- PERCEPTION: Screenshot captures current screen state
- REASONING: Vision-language model analyzes and plans
- ACTION: Execute mouse/keyboard operations
- FEEDBACK: Observe result, continue or correct
Critical insight: Vision agents are completely still during "thinking"
phase (1-5 seconds), creating a detectable pause pattern.
When to use: ['Building any computer use agent from scratch', 'Integrating vision models with desktop control', 'Understanding agent behavior patterns']
python
from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time
class ComputerUseAgent:
"""
Perception-Reasoning-Action loop implementation.
Based on Anthropic Computer Use patterns.
"""
def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
self.client = client
self.model = model
self.max_steps = 50 # Prevent runaway loops
self.action_delay = 0.5 # Seconds between actions
def capture_screenshot(self) -> str:
"""Capture screen and return base64 encoded image."""
screenshot = pyautogui.screenshot()
# Resize for token efficiency (1280x800 is good balance)
screenshot = screenshot.resize((1280, 800), Image.LANCZOS)
import io
buffer = io.BytesIO()
screenshot.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode()
def execute_action(self, action: dict) -> dict:
"""Execute mouse/keyboard action on the computer."""
action_type = action.get("type")
if action_type == "click":
x, y = action["x"], action["y"]
button = action.get("button", "left")
pyautogui.click(x, y, button=button)
return {"success": True, "action": f"clicked at ({x}, {y})"}
elif action_type == "type":
text = action["text"]
pyautogui.typewrite(text, interval=0.02)
return {"success": True, "action": f"typed {len(text)} chars"}
elif action_type == "key":
key = action["key"]
pyautogui.press(key)
return {"success": True, "action": f"pressed {key}"}
elif action_type == "scroll":
direction = action.get("direction", "down")
amount = action.get("amount", 3)
scroll = -amount if direction == "down" else amount
pyautogui.scroll(scroll)
return {"success": True, "action": f"scrolled {dir计算机操作Agent的基础架构:观察屏幕,推理下一步行动,执行行动,重复该循环。此循环通过迭代流程将视觉模型与行动执行集成在一起。
核心组件:
- 感知:截图捕获当前屏幕状态
- 推理:视觉语言模型进行分析和规划
- 行动:执行鼠标/键盘操作
- 反馈:观察结果,继续或修正操作
关键见解:视觉Agent在“思考”阶段(1-5秒)完全处于静止状态,会形成可检测到的停顿模式。
适用场景:['从零开始构建任何计算机操作Agent', '将视觉模型与桌面控制集成', '理解Agent行为模式']
python
from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time
class ComputerUseAgent:
"""
Perception-Reasoning-Action loop implementation.
Based on Anthropic Computer Use patterns.
"""
def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
self.client = client
self.model = model
self.max_steps = 50 # Prevent runaway loops
self.action_delay = 0.5 # Seconds between actions
def capture_screenshot(self) -> str:
"""Capture screen and return base64 encoded image."""
screenshot = pyautogui.screenshot()
# Resize for token efficiency (1280x800 is good balance)
screenshot = screenshot.resize((1280, 800), Image.LANCZOS)
import io
buffer = io.BytesIO()
screenshot.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode()
def execute_action(self, action: dict) -> dict:
"""Execute mouse/keyboard action on the computer."""
action_type = action.get("type")
if action_type == "click":
x, y = action["x"], action["y"]
button = action.get("button", "left")
pyautogui.click(x, y, button=button)
return {"success": True, "action": f"clicked at ({x}, {y})"}
elif action_type == "type":
text = action["text"]
pyautogui.typewrite(text, interval=0.02)
return {"success": True, "action": f"typed {len(text)} chars"}
elif action_type == "key":
key = action["key"]
pyautogui.press(key)
return {"success": True, "action": f"pressed {key}"}
elif action_type == "scroll":
direction = action.get("direction", "down")
amount = action.get("amount", 3)
scroll = -amount if direction == "down" else amount
pyautogui.scroll(scroll)
return {"success": True, "action": f"scrolled {dirSandboxed Environment Pattern
沙箱环境模式
Computer use agents MUST run in isolated, sandboxed environments.
Never give agents direct access to your main system - the security
risks are too high. Use Docker containers with virtual desktops.
Key isolation requirements:
- NETWORK: Restrict to necessary endpoints only
- FILESYSTEM: Read-only or scoped to temp directories
- CREDENTIALS: No access to host credentials
- SYSCALLS: Filter dangerous system calls
- RESOURCES: Limit CPU, memory, time
The goal is "blast radius minimization" - if the agent goes wrong,
damage is contained to the sandbox.
When to use: ['Deploying any computer use agent', 'Testing agent behavior safely', 'Running untrusted automation tasks']
python
undefined计算机操作Agent必须在隔离的沙箱环境中运行。绝不能让Agent直接访问你的主系统——安全风险过高。请使用带有虚拟桌面的Docker容器。
核心隔离要求:
- 网络:仅限制访问必要的端点
- 文件系统:只读或限定在临时目录
- 凭据:禁止访问主机凭据
- 系统调用:过滤危险的系统调用
- 资源:限制CPU、内存和时间
目标是“最小化爆炸半径”——如果Agent出现故障,损害将被限制在沙箱内。
适用场景:['部署任何计算机操作Agent', '安全测试Agent行为', '运行不受信任的自动化任务']
python
undefinedDockerfile for sandboxed computer use environment
Dockerfile for sandboxed computer use environment
Based on Anthropic's reference implementation pattern
Based on Anthropic's reference implementation pattern
FROM ubuntu:22.04
FROM ubuntu:22.04
Install desktop environment
Install desktop environment
RUN apt-get update && apt-get install -y
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor
RUN apt-get update && apt-get install -y
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor
Security: Create non-root user
Security: Create non-root user
RUN useradd -m -s /bin/bash agent &&
mkdir -p /home/agent/.vnc
mkdir -p /home/agent/.vnc
RUN useradd -m -s /bin/bash agent &&
mkdir -p /home/agent/.vnc
mkdir -p /home/agent/.vnc
Install Python dependencies
Install Python dependencies
COPY requirements.txt /tmp/
RUN pip3 install -r /tmp/requirements.txt
COPY requirements.txt /tmp/
RUN pip3 install -r /tmp/requirements.txt
Security: Drop capabilities
Security: Drop capabilities
RUN apt-get install -y --no-install-recommends libcap2-bin &&
setcap -r /usr/bin/python3 || true
setcap -r /usr/bin/python3 || true
RUN apt-get install -y --no-install-recommends libcap2-bin &&
setcap -r /usr/bin/python3 || true
setcap -r /usr/bin/python3 || true
Copy agent code
Copy agent code
COPY --chown=agent:agent . /app
WORKDIR /app
COPY --chown=agent:agent . /app
WORKDIR /app
Supervisor config for virtual display + VNC
Supervisor config for virtual display + VNC
COPY supervisord.conf /etc/supervisor/conf.d/
COPY supervisord.conf /etc/supervisor/conf.d/
Expose VNC port only (not desktop directly)
Expose VNC port only (not desktop directly)
EXPOSE 5900
EXPOSE 5900
Run as non-root
Run as non-root
USER agent
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
USER agent
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
docker-compose.yml with security constraints
docker-compose.yml with security constraints
version: '3.8'
services:
computer-use-agent:
build: .
ports:
- "5900:5900" # VNC for observation
- "8080:8080" # API for control
# Security constraints
security_opt:
- no-new-privileges:true
- seccomp:seccomp-profile.json
# Resource limits
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '0.5'
memory: 1G
# Network isolation
networks:
- agent-network
# No access to host filesystem
volumes:
- agent-tmp:/tmp
# Read-only root filesystem
read_only: true
tmpfs:
- /run
- /var/run
# Environment
environment:
- DISPLAY=:99
- NO_PROXY=localhostnetworks:
agent-network:
driver: bridge
internal: true # No internet by default
volumes:
agent-tmp:
version: '3.8'
services:
computer-use-agent:
build: .
ports:
- "5900:5900" # VNC for observation
- "8080:8080" # API for control
# Security constraints
security_opt:
- no-new-privileges:true
- seccomp:seccomp-profile.json
# Resource limits
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '0.5'
memory: 1G
# Network isolation
networks:
- agent-network
# No access to host filesystem
volumes:
- agent-tmp:/tmp
# Read-only root filesystem
read_only: true
tmpfs:
- /run
- /var/run
# Environment
environment:
- DISPLAY=:99
- NO_PROXY=localhostnetworks:
agent-network:
driver: bridge
internal: true # No internet by default
volumes:
agent-tmp:
Python wrapper with additional runtime sandboxing
Python wrapper with additional runtime sandboxing
import subprocess
import os
from dataclasses im
undefinedimport subprocess
import os
from dataclasses im
undefinedAnthropic Computer Use Implementation
Anthropic Computer Use 实现
Official implementation pattern using Claude's computer use capability.
Claude 3.5 Sonnet was the first frontier model to offer computer use.
Claude Opus 4.5 is now the "best model in the world for computer use."
Key capabilities:
- screenshot: Capture current screen state
- mouse: Click, move, drag operations
- keyboard: Type text, press keys
- bash: Run shell commands
- text_editor: View and edit files
Tool versions:
- computer_20251124 (Opus 4.5): Adds zoom action for detailed inspection
- computer_20250124 (All other models): Standard capabilities
Critical limitation: "Some UI elements (like dropdowns and scrollbars)
might be tricky for Claude to manipulate" - Anthropic docs
When to use: ['Building production computer use agents', 'Need highest quality vision understanding', 'Full desktop control (not just browser)']
python
from anthropic import Anthropic
from anthropic.types.beta import (
BetaToolComputerUse20241022,
BetaToolBash20241022,
BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io
class AnthropicComputerUse:
"""
Official Anthropic Computer Use implementation.
Requires:
- Docker container with virtual display
- VNC for viewing agent actions
- Proper tool implementations
"""
def __init__(self):
self.client = Anthropic()
self.model = "claude-sonnet-4-20250514" # Best for computer use
self.screen_size = (1280, 800)
def get_tools(self) -> list:
"""Define computer use tools."""
return [
BetaToolComputerUse20241022(
type="computer_20241022",
name="computer",
display_width_px=self.screen_size[0],
display_height_px=self.screen_size[1],
),
BetaToolBash20241022(
type="bash_20241022",
name="bash",
),
BetaToolTextEditor20241022(
type="text_editor_20241022",
name="str_replace_editor",
),
]
def execute_tool(self, name: str, input: dict) -> dict:
"""Execute a tool and return result."""
if name == "computer":
return self._handle_computer_action(input)
elif name == "bash":
return self._handle_bash(input)
elif name == "str_replace_editor":
return self._handle_editor(input)
else:
return {"error": f"Unknown tool: {name}"}
def _handle_computer_action(self, input: dict) -> dict:
"""Handle computer control actions."""
action = input.get("action")
if action == "screenshot":
# Capture via xdotool/scrot
subprocess.run(["scrot", "/tmp/screenshot.png"])
with open("/tmp/screenshot.png", "rb") as f:
使用Claude的计算机操作能力的官方实现模式。Claude 3.5 Sonnet是首个提供计算机操作能力的前沿模型。Claude Opus 4.5现在是“全球最适合计算机操作的模型”。
核心功能:
- 截图:捕获当前屏幕状态
- 鼠标:点击、移动、拖拽操作
- 键盘:输入文本、按键
- Bash:运行Shell命令
- 文本编辑器:查看和编辑文件
工具版本:
- computer_20251124(Opus 4.5):添加了缩放操作以进行详细检查
- computer_20250124(其他所有模型):标准功能
关键限制:“部分UI元素(如下拉菜单和滚动条)可能对Claude来说难以操作”——Anthropic文档
适用场景:['构建生产级计算机操作Agent', '需要最高质量的视觉理解', '完整桌面控制(不仅仅是浏览器)']
python
from anthropic import Anthropic
from anthropic.types.beta import (
BetaToolComputerUse20241022,
BetaToolBash20241022,
BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io
class AnthropicComputerUse:
"""
Official Anthropic Computer Use implementation.
Requires:
- Docker container with virtual display
- VNC for viewing agent actions
- Proper tool implementations
"""
def __init__(self):
self.client = Anthropic()
self.model = "claude-sonnet-4-20250514" # Best for computer use
self.screen_size = (1280, 800)
def get_tools(self) -> list:
"""Define computer use tools."""
return [
BetaToolComputerUse20241022(
type="computer_20241022",
name="computer",
display_width_px=self.screen_size[0],
display_height_px=self.screen_size[1],
),
BetaToolBash20241022(
type="bash_20241022",
name="bash",
),
BetaToolTextEditor20241022(
type="text_editor_20241022",
name="str_replace_editor",
),
]
def execute_tool(self, name: str, input: dict) -> dict:
"""Execute a tool and return result."""
if name == "computer":
return self._handle_computer_action(input)
elif name == "bash":
return self._handle_bash(input)
elif name == "str_replace_editor":
return self._handle_editor(input)
else:
return {"error": f"Unknown tool: {name}"}
def _handle_computer_action(self, input: dict) -> dict:
"""Handle computer control actions."""
action = input.get("action")
if action == "screenshot":
# Capture via xdotool/scrot
subprocess.run(["scrot", "/tmp/screenshot.png"])
with open("/tmp/screenshot.png", "rb") as f:
⚠️ Sharp Edges
⚠️ 注意事项
| Issue | Severity | Solution |
|---|---|---|
| Issue | critical | ## Defense in depth - no single solution works |
| Issue | medium | ## Add human-like variance to actions |
| Issue | high | ## Use keyboard alternatives when possible |
| Issue | medium | ## Accept the tradeoff |
| Issue | high | ## Implement context management |
| Issue | high | ## Monitor and limit costs |
| Issue | critical | ## ALWAYS use sandboxing |
| 问题 | 严重程度 | 解决方案 |
|---|---|---|
| 问题 | 严重 | ## 深度防御——单一解决方案无效 |
| 问题 | 中等 | ## 为操作添加类人的随机性 |
| 问题 | 高 | ## 尽可能使用键盘替代方案 |
| 问题 | 中等 | ## 接受这种权衡 |
| 问题 | 高 | ## 实现上下文管理 |
| 问题 | 高 | ## 监控并限制成本 |
| 问题 | 严重 | ## 始终使用沙箱化 |