computer-use-agents

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Computer Use Agents

计算机操作AI Agent

Patterns

模式

Perception-Reasoning-Action Loop

感知-推理-行动循环

The fundamental architecture of computer use agents: observe screen, reason about next action, execute action, repeat. This loop integrates vision models with action execution through an iterative pipeline.
Key components:
  1. PERCEPTION: Screenshot captures current screen state
  2. REASONING: Vision-language model analyzes and plans
  3. ACTION: Execute mouse/keyboard operations
  4. FEEDBACK: Observe result, continue or correct
Critical insight: Vision agents are completely still during "thinking" phase (1-5 seconds), creating a detectable pause pattern.
When to use: ['Building any computer use agent from scratch', 'Integrating vision models with desktop control', 'Understanding agent behavior patterns']
python
from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time

class ComputerUseAgent:
    """
    Perception-Reasoning-Action loop implementation.
    Based on Anthropic Computer Use patterns.
    """

    def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
        self.client = client
        self.model = model
        self.max_steps = 50  # Prevent runaway loops
        self.action_delay = 0.5  # Seconds between actions

    def capture_screenshot(self) -> str:
        """Capture screen and return base64 encoded image."""
        screenshot = pyautogui.screenshot()
        # Resize for token efficiency (1280x800 is good balance)
        screenshot = screenshot.resize((1280, 800), Image.LANCZOS)

        import io
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        return base64.b64encode(buffer.getvalue()).decode()

    def execute_action(self, action: dict) -> dict:
        """Execute mouse/keyboard action on the computer."""
        action_type = action.get("type")

        if action_type == "click":
            x, y = action["x"], action["y"]
            button = action.get("button", "left")
            pyautogui.click(x, y, button=button)
            return {"success": True, "action": f"clicked at ({x}, {y})"}

        elif action_type == "type":
            text = action["text"]
            pyautogui.typewrite(text, interval=0.02)
            return {"success": True, "action": f"typed {len(text)} chars"}

        elif action_type == "key":
            key = action["key"]
            pyautogui.press(key)
            return {"success": True, "action": f"pressed {key}"}

        elif action_type == "scroll":
            direction = action.get("direction", "down")
            amount = action.get("amount", 3)
            scroll = -amount if direction == "down" else amount
            pyautogui.scroll(scroll)
            return {"success": True, "action": f"scrolled {dir
计算机操作Agent的基础架构:观察屏幕,推理下一步行动,执行行动,重复该循环。此循环通过迭代流程将视觉模型与行动执行集成在一起。
核心组件:
  1. 感知:截图捕获当前屏幕状态
  2. 推理:视觉语言模型进行分析和规划
  3. 行动:执行鼠标/键盘操作
  4. 反馈:观察结果,继续或修正操作
关键见解:视觉Agent在“思考”阶段(1-5秒)完全处于静止状态,会形成可检测到的停顿模式。
适用场景:['从零开始构建任何计算机操作Agent', '将视觉模型与桌面控制集成', '理解Agent行为模式']
python
from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time

class ComputerUseAgent:
    """
    Perception-Reasoning-Action loop implementation.
    Based on Anthropic Computer Use patterns.
    """

    def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
        self.client = client
        self.model = model
        self.max_steps = 50  # Prevent runaway loops
        self.action_delay = 0.5  # Seconds between actions

    def capture_screenshot(self) -> str:
        """Capture screen and return base64 encoded image."""
        screenshot = pyautogui.screenshot()
        # Resize for token efficiency (1280x800 is good balance)
        screenshot = screenshot.resize((1280, 800), Image.LANCZOS)

        import io
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        return base64.b64encode(buffer.getvalue()).decode()

    def execute_action(self, action: dict) -> dict:
        """Execute mouse/keyboard action on the computer."""
        action_type = action.get("type")

        if action_type == "click":
            x, y = action["x"], action["y"]
            button = action.get("button", "left")
            pyautogui.click(x, y, button=button)
            return {"success": True, "action": f"clicked at ({x}, {y})"}

        elif action_type == "type":
            text = action["text"]
            pyautogui.typewrite(text, interval=0.02)
            return {"success": True, "action": f"typed {len(text)} chars"}

        elif action_type == "key":
            key = action["key"]
            pyautogui.press(key)
            return {"success": True, "action": f"pressed {key}"}

        elif action_type == "scroll":
            direction = action.get("direction", "down")
            amount = action.get("amount", 3)
            scroll = -amount if direction == "down" else amount
            pyautogui.scroll(scroll)
            return {"success": True, "action": f"scrolled {dir

Sandboxed Environment Pattern

沙箱环境模式

Computer use agents MUST run in isolated, sandboxed environments. Never give agents direct access to your main system - the security risks are too high. Use Docker containers with virtual desktops.
Key isolation requirements:
  1. NETWORK: Restrict to necessary endpoints only
  2. FILESYSTEM: Read-only or scoped to temp directories
  3. CREDENTIALS: No access to host credentials
  4. SYSCALLS: Filter dangerous system calls
  5. RESOURCES: Limit CPU, memory, time
The goal is "blast radius minimization" - if the agent goes wrong, damage is contained to the sandbox.
When to use: ['Deploying any computer use agent', 'Testing agent behavior safely', 'Running untrusted automation tasks']
python
undefined
计算机操作Agent必须在隔离的沙箱环境中运行。绝不能让Agent直接访问你的主系统——安全风险过高。请使用带有虚拟桌面的Docker容器。
核心隔离要求:
  1. 网络:仅限制访问必要的端点
  2. 文件系统:只读或限定在临时目录
  3. 凭据:禁止访问主机凭据
  4. 系统调用:过滤危险的系统调用
  5. 资源:限制CPU、内存和时间
目标是“最小化爆炸半径”——如果Agent出现故障,损害将被限制在沙箱内。
适用场景:['部署任何计算机操作Agent', '安全测试Agent行为', '运行不受信任的自动化任务']
python
undefined

Dockerfile for sandboxed computer use environment

Dockerfile for sandboxed computer use environment

Based on Anthropic's reference implementation pattern

Based on Anthropic's reference implementation pattern

FROM ubuntu:22.04
FROM ubuntu:22.04

Install desktop environment

Install desktop environment

RUN apt-get update && apt-get install -y
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor
RUN apt-get update && apt-get install -y
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor

Security: Create non-root user

Security: Create non-root user

RUN useradd -m -s /bin/bash agent &&
mkdir -p /home/agent/.vnc
RUN useradd -m -s /bin/bash agent &&
mkdir -p /home/agent/.vnc

Install Python dependencies

Install Python dependencies

COPY requirements.txt /tmp/ RUN pip3 install -r /tmp/requirements.txt
COPY requirements.txt /tmp/ RUN pip3 install -r /tmp/requirements.txt

Security: Drop capabilities

Security: Drop capabilities

RUN apt-get install -y --no-install-recommends libcap2-bin &&
setcap -r /usr/bin/python3 || true
RUN apt-get install -y --no-install-recommends libcap2-bin &&
setcap -r /usr/bin/python3 || true

Copy agent code

Copy agent code

COPY --chown=agent:agent . /app WORKDIR /app
COPY --chown=agent:agent . /app WORKDIR /app

Supervisor config for virtual display + VNC

Supervisor config for virtual display + VNC

COPY supervisord.conf /etc/supervisor/conf.d/
COPY supervisord.conf /etc/supervisor/conf.d/

Expose VNC port only (not desktop directly)

Expose VNC port only (not desktop directly)

EXPOSE 5900
EXPOSE 5900

Run as non-root

Run as non-root

USER agent
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]

USER agent
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]

docker-compose.yml with security constraints

docker-compose.yml with security constraints

version: '3.8'
services: computer-use-agent: build: . ports: - "5900:5900" # VNC for observation - "8080:8080" # API for control
# Security constraints
security_opt:
  - no-new-privileges:true
  - seccomp:seccomp-profile.json

# Resource limits
deploy:
  resources:
    limits:
      cpus: '2'
      memory: 4G
    reservations:
      cpus: '0.5'
      memory: 1G

# Network isolation
networks:
  - agent-network

# No access to host filesystem
volumes:
  - agent-tmp:/tmp

# Read-only root filesystem
read_only: true
tmpfs:
  - /run
  - /var/run

# Environment
environment:
  - DISPLAY=:99
  - NO_PROXY=localhost
networks: agent-network: driver: bridge internal: true # No internet by default
volumes: agent-tmp:

version: '3.8'
services: computer-use-agent: build: . ports: - "5900:5900" # VNC for observation - "8080:8080" # API for control
# Security constraints
security_opt:
  - no-new-privileges:true
  - seccomp:seccomp-profile.json

# Resource limits
deploy:
  resources:
    limits:
      cpus: '2'
      memory: 4G
    reservations:
      cpus: '0.5'
      memory: 1G

# Network isolation
networks:
  - agent-network

# No access to host filesystem
volumes:
  - agent-tmp:/tmp

# Read-only root filesystem
read_only: true
tmpfs:
  - /run
  - /var/run

# Environment
environment:
  - DISPLAY=:99
  - NO_PROXY=localhost
networks: agent-network: driver: bridge internal: true # No internet by default
volumes: agent-tmp:

Python wrapper with additional runtime sandboxing

Python wrapper with additional runtime sandboxing

import subprocess import os from dataclasses im
undefined
import subprocess import os from dataclasses im
undefined

Anthropic Computer Use Implementation

Anthropic Computer Use 实现

Official implementation pattern using Claude's computer use capability. Claude 3.5 Sonnet was the first frontier model to offer computer use. Claude Opus 4.5 is now the "best model in the world for computer use."
Key capabilities:
  • screenshot: Capture current screen state
  • mouse: Click, move, drag operations
  • keyboard: Type text, press keys
  • bash: Run shell commands
  • text_editor: View and edit files
Tool versions:
  • computer_20251124 (Opus 4.5): Adds zoom action for detailed inspection
  • computer_20250124 (All other models): Standard capabilities
Critical limitation: "Some UI elements (like dropdowns and scrollbars) might be tricky for Claude to manipulate" - Anthropic docs
When to use: ['Building production computer use agents', 'Need highest quality vision understanding', 'Full desktop control (not just browser)']
python
from anthropic import Anthropic
from anthropic.types.beta import (
    BetaToolComputerUse20241022,
    BetaToolBash20241022,
    BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io

class AnthropicComputerUse:
    """
    Official Anthropic Computer Use implementation.

    Requires:
    - Docker container with virtual display
    - VNC for viewing agent actions
    - Proper tool implementations
    """

    def __init__(self):
        self.client = Anthropic()
        self.model = "claude-sonnet-4-20250514"  # Best for computer use
        self.screen_size = (1280, 800)

    def get_tools(self) -> list:
        """Define computer use tools."""
        return [
            BetaToolComputerUse20241022(
                type="computer_20241022",
                name="computer",
                display_width_px=self.screen_size[0],
                display_height_px=self.screen_size[1],
            ),
            BetaToolBash20241022(
                type="bash_20241022",
                name="bash",
            ),
            BetaToolTextEditor20241022(
                type="text_editor_20241022",
                name="str_replace_editor",
            ),
        ]

    def execute_tool(self, name: str, input: dict) -> dict:
        """Execute a tool and return result."""

        if name == "computer":
            return self._handle_computer_action(input)
        elif name == "bash":
            return self._handle_bash(input)
        elif name == "str_replace_editor":
            return self._handle_editor(input)
        else:
            return {"error": f"Unknown tool: {name}"}

    def _handle_computer_action(self, input: dict) -> dict:
        """Handle computer control actions."""
        action = input.get("action")

        if action == "screenshot":
            # Capture via xdotool/scrot
            subprocess.run(["scrot", "/tmp/screenshot.png"])

            with open("/tmp/screenshot.png", "rb") as f:
            
使用Claude的计算机操作能力的官方实现模式。Claude 3.5 Sonnet是首个提供计算机操作能力的前沿模型。Claude Opus 4.5现在是“全球最适合计算机操作的模型”。
核心功能:
  • 截图:捕获当前屏幕状态
  • 鼠标:点击、移动、拖拽操作
  • 键盘:输入文本、按键
  • Bash:运行Shell命令
  • 文本编辑器:查看和编辑文件
工具版本:
  • computer_20251124(Opus 4.5):添加了缩放操作以进行详细检查
  • computer_20250124(其他所有模型):标准功能
关键限制:“部分UI元素(如下拉菜单和滚动条)可能对Claude来说难以操作”——Anthropic文档
适用场景:['构建生产级计算机操作Agent', '需要最高质量的视觉理解', '完整桌面控制(不仅仅是浏览器)']
python
from anthropic import Anthropic
from anthropic.types.beta import (
    BetaToolComputerUse20241022,
    BetaToolBash20241022,
    BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io

class AnthropicComputerUse:
    """
    Official Anthropic Computer Use implementation.

    Requires:
    - Docker container with virtual display
    - VNC for viewing agent actions
    - Proper tool implementations
    """

    def __init__(self):
        self.client = Anthropic()
        self.model = "claude-sonnet-4-20250514"  # Best for computer use
        self.screen_size = (1280, 800)

    def get_tools(self) -> list:
        """Define computer use tools."""
        return [
            BetaToolComputerUse20241022(
                type="computer_20241022",
                name="computer",
                display_width_px=self.screen_size[0],
                display_height_px=self.screen_size[1],
            ),
            BetaToolBash20241022(
                type="bash_20241022",
                name="bash",
            ),
            BetaToolTextEditor20241022(
                type="text_editor_20241022",
                name="str_replace_editor",
            ),
        ]

    def execute_tool(self, name: str, input: dict) -> dict:
        """Execute a tool and return result."""

        if name == "computer":
            return self._handle_computer_action(input)
        elif name == "bash":
            return self._handle_bash(input)
        elif name == "str_replace_editor":
            return self._handle_editor(input)
        else:
            return {"error": f"Unknown tool: {name}"}

    def _handle_computer_action(self, input: dict) -> dict:
        """Handle computer control actions."""
        action = input.get("action")

        if action == "screenshot":
            # Capture via xdotool/scrot
            subprocess.run(["scrot", "/tmp/screenshot.png"])

            with open("/tmp/screenshot.png", "rb") as f:
            

⚠️ Sharp Edges

⚠️ 注意事项

IssueSeveritySolution
Issuecritical## Defense in depth - no single solution works
Issuemedium## Add human-like variance to actions
Issuehigh## Use keyboard alternatives when possible
Issuemedium## Accept the tradeoff
Issuehigh## Implement context management
Issuehigh## Monitor and limit costs
Issuecritical## ALWAYS use sandboxing
问题严重程度解决方案
问题严重## 深度防御——单一解决方案无效
问题中等## 为操作添加类人的随机性
问题## 尽可能使用键盘替代方案
问题中等## 接受这种权衡
问题## 实现上下文管理
问题## 监控并限制成本
问题严重## 始终使用沙箱化