computer-use-agents

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Computer Use Agents

计算机操作AI Agent

Patterns

模式

Perception-Reasoning-Action Loop

感知-推理-行动循环

The fundamental architecture of computer use agents: observe screen, reason about next action, execute action, repeat. This loop integrates vision models with action execution through an iterative pipeline.

Key components:

PERCEPTION: Screenshot captures current screen state
REASONING: Vision-language model analyzes and plans
ACTION: Execute mouse/keyboard operations
FEEDBACK: Observe result, continue or correct

Critical insight: Vision agents are completely still during "thinking" phase (1-5 seconds), creating a detectable pause pattern.

When to use: ['Building any computer use agent from scratch', 'Integrating vision models with desktop control', 'Understanding agent behavior patterns']

python

from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time

class ComputerUseAgent:
    """
    Perception-Reasoning-Action loop implementation.
    Based on Anthropic Computer Use patterns.
    """

    def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
        self.client = client
        self.model = model
        self.max_steps = 50  # Prevent runaway loops
        self.action_delay = 0.5  # Seconds between actions

    def capture_screenshot(self) -> str:
        """Capture screen and return base64 encoded image."""
        screenshot = pyautogui.screenshot()
        # Resize for token efficiency (1280x800 is good balance)
        screenshot = screenshot.resize((1280, 800), Image.LANCZOS)

        import io
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        return base64.b64encode(buffer.getvalue()).decode()

    def execute_action(self, action: dict) -> dict:
        """Execute mouse/keyboard action on the computer."""
        action_type = action.get("type")

        if action_type == "click":
            x, y = action["x"], action["y"]
            button = action.get("button", "left")
            pyautogui.click(x, y, button=button)
            return {"success": True, "action": f"clicked at ({x}, {y})"}

        elif action_type == "type":
            text = action["text"]
            pyautogui.typewrite(text, interval=0.02)
            return {"success": True, "action": f"typed {len(text)} chars"}

        elif action_type == "key":
            key = action["key"]
            pyautogui.press(key)
            return {"success": True, "action": f"pressed {key}"}

        elif action_type == "scroll":
            direction = action.get("direction", "down")
            amount = action.get("amount", 3)
            scroll = -amount if direction == "down" else amount
            pyautogui.scroll(scroll)
            return {"success": True, "action": f"scrolled {dir

计算机操作Agent的基础架构：观察屏幕，推理下一步行动，执行行动，重复该循环。此循环通过迭代流程将视觉模型与行动执行集成在一起。

核心组件：

感知：截图捕获当前屏幕状态
推理：视觉语言模型进行分析和规划
行动：执行鼠标/键盘操作
反馈：观察结果，继续或修正操作

关键见解：视觉Agent在“思考”阶段（1-5秒）完全处于静止状态，会形成可检测到的停顿模式。

适用场景：['从零开始构建任何计算机操作Agent', '将视觉模型与桌面控制集成', '理解Agent行为模式']

python

from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time

class ComputerUseAgent:
    """
    Perception-Reasoning-Action loop implementation.
    Based on Anthropic Computer Use patterns.
    """

    def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
        self.client = client
        self.model = model
        self.max_steps = 50  # Prevent runaway loops
        self.action_delay = 0.5  # Seconds between actions

    def capture_screenshot(self) -> str:
        """Capture screen and return base64 encoded image."""
        screenshot = pyautogui.screenshot()
        # Resize for token efficiency (1280x800 is good balance)
        screenshot = screenshot.resize((1280, 800), Image.LANCZOS)

        import io
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        return base64.b64encode(buffer.getvalue()).decode()

    def execute_action(self, action: dict) -> dict:
        """Execute mouse/keyboard action on the computer."""
        action_type = action.get("type")

        if action_type == "click":
            x, y = action["x"], action["y"]
            button = action.get("button", "left")
            pyautogui.click(x, y, button=button)
            return {"success": True, "action": f"clicked at ({x}, {y})"}

        elif action_type == "type":
            text = action["text"]
            pyautogui.typewrite(text, interval=0.02)
            return {"success": True, "action": f"typed {len(text)} chars"}

        elif action_type == "key":
            key = action["key"]
            pyautogui.press(key)
            return {"success": True, "action": f"pressed {key}"}

        elif action_type == "scroll":
            direction = action.get("direction", "down")
            amount = action.get("amount", 3)
            scroll = -amount if direction == "down" else amount
            pyautogui.scroll(scroll)
            return {"success": True, "action": f"scrolled {dir

Sandboxed Environment Pattern

沙箱环境模式

Computer use agents MUST run in isolated, sandboxed environments. Never give agents direct access to your main system - the security risks are too high. Use Docker containers with virtual desktops.

Key isolation requirements:

NETWORK: Restrict to necessary endpoints only
FILESYSTEM: Read-only or scoped to temp directories
CREDENTIALS: No access to host credentials
SYSCALLS: Filter dangerous system calls
RESOURCES: Limit CPU, memory, time

The goal is "blast radius minimization" - if the agent goes wrong, damage is contained to the sandbox.

When to use: ['Deploying any computer use agent', 'Testing agent behavior safely', 'Running untrusted automation tasks']

python

undefined

计算机操作Agent必须在隔离的沙箱环境中运行。绝不能让Agent直接访问你的主系统——安全风险过高。请使用带有虚拟桌面的Docker容器。

核心隔离要求：

网络：仅限制访问必要的端点
文件系统：只读或限定在临时目录
凭据：禁止访问主机凭据
系统调用：过滤危险的系统调用
资源：限制CPU、内存和时间

目标是“最小化爆炸半径”——如果Agent出现故障，损害将被限制在沙箱内。

适用场景：['部署任何计算机操作Agent', '安全测试Agent行为', '运行不受信任的自动化任务']

python

undefined

Dockerfile for sandboxed computer use environment

Based on Anthropic's reference implementation pattern

FROM ubuntu:22.04

Install desktop environment

RUN apt-get update && apt-get install -y
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor

Security: Create non-root user

RUN useradd -m -s /bin/bash agent &&
mkdir -p /home/agent/.vnc

Install Python dependencies

COPY requirements.txt /tmp/ RUN pip3 install -r /tmp/requirements.txt

Security: Drop capabilities

RUN apt-get install -y --no-install-recommends libcap2-bin &&
setcap -r /usr/bin/python3 || true

Copy agent code

COPY --chown=agent:agent . /app WORKDIR /app

Supervisor config for virtual display + VNC

COPY supervisord.conf /etc/supervisor/conf.d/

Expose VNC port only (not desktop directly)

EXPOSE 5900

Run as non-root

USER agent

CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]

USER agent

CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]

docker-compose.yml with security constraints

version: '3.8'

services: computer-use-agent: build: . ports: - "5900:5900" # VNC for observation - "8080:8080" # API for control

# Security constraints
security_opt:
  - no-new-privileges:true
  - seccomp:seccomp-profile.json

# Resource limits
deploy:
  resources:
    limits:
      cpus: '2'
      memory: 4G
    reservations:
      cpus: '0.5'
      memory: 1G

# Network isolation
networks:
  - agent-network

# No access to host filesystem
volumes:
  - agent-tmp:/tmp

# Read-only root filesystem
read_only: true
tmpfs:
  - /run
  - /var/run

# Environment
environment:
  - DISPLAY=:99
  - NO_PROXY=localhost

networks: agent-network: driver: bridge internal: true # No internet by default

volumes: agent-tmp:

version: '3.8'

services: computer-use-agent: build: . ports: - "5900:5900" # VNC for observation - "8080:8080" # API for control

# Security constraints
security_opt:
  - no-new-privileges:true
  - seccomp:seccomp-profile.json

# Resource limits
deploy:
  resources:
    limits:
      cpus: '2'
      memory: 4G
    reservations:
      cpus: '0.5'
      memory: 1G

# Network isolation
networks:
  - agent-network

# No access to host filesystem
volumes:
  - agent-tmp:/tmp

# Read-only root filesystem
read_only: true
tmpfs:
  - /run
  - /var/run

# Environment
environment:
  - DISPLAY=:99
  - NO_PROXY=localhost

networks: agent-network: driver: bridge internal: true # No internet by default

volumes: agent-tmp:

Python wrapper with additional runtime sandboxing

import subprocess import os from dataclasses im

undefined

import subprocess import os from dataclasses im

undefined

Anthropic Computer Use Implementation

Anthropic Computer Use 实现

Official implementation pattern using Claude's computer use capability. Claude 3.5 Sonnet was the first frontier model to offer computer use. Claude Opus 4.5 is now the "best model in the world for computer use."

Key capabilities:

screenshot: Capture current screen state
mouse: Click, move, drag operations
keyboard: Type text, press keys
bash: Run shell commands
text_editor: View and edit files

Tool versions:

computer_20251124 (Opus 4.5): Adds zoom action for detailed inspection
computer_20250124 (All other models): Standard capabilities

Critical limitation: "Some UI elements (like dropdowns and scrollbars) might be tricky for Claude to manipulate" - Anthropic docs

When to use: ['Building production computer use agents', 'Need highest quality vision understanding', 'Full desktop control (not just browser)']

python

from anthropic import Anthropic
from anthropic.types.beta import (
    BetaToolComputerUse20241022,
    BetaToolBash20241022,
    BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io

class AnthropicComputerUse:
    """
    Official Anthropic Computer Use implementation.

    Requires:
    - Docker container with virtual display
    - VNC for viewing agent actions
    - Proper tool implementations
    """

    def __init__(self):
        self.client = Anthropic()
        self.model = "claude-sonnet-4-20250514"  # Best for computer use
        self.screen_size = (1280, 800)

    def get_tools(self) -> list:
        """Define computer use tools."""
        return [
            BetaToolComputerUse20241022(
                type="computer_20241022",
                name="computer",
                display_width_px=self.screen_size[0],
                display_height_px=self.screen_size[1],
            ),
            BetaToolBash20241022(
                type="bash_20241022",
                name="bash",
            ),
            BetaToolTextEditor20241022(
                type="text_editor_20241022",
                name="str_replace_editor",
            ),
        ]

    def execute_tool(self, name: str, input: dict) -> dict:
        """Execute a tool and return result."""

        if name == "computer":
            return self._handle_computer_action(input)
        elif name == "bash":
            return self._handle_bash(input)
        elif name == "str_replace_editor":
            return self._handle_editor(input)
        else:
            return {"error": f"Unknown tool: {name}"}

    def _handle_computer_action(self, input: dict) -> dict:
        """Handle computer control actions."""
        action = input.get("action")

        if action == "screenshot":
            # Capture via xdotool/scrot
            subprocess.run(["scrot", "/tmp/screenshot.png"])

            with open("/tmp/screenshot.png", "rb") as f:

使用Claude的计算机操作能力的官方实现模式。Claude 3.5 Sonnet是首个提供计算机操作能力的前沿模型。Claude Opus 4.5现在是“全球最适合计算机操作的模型”。

核心功能：

截图：捕获当前屏幕状态
鼠标：点击、移动、拖拽操作
键盘：输入文本、按键
Bash：运行Shell命令
文本编辑器：查看和编辑文件

工具版本：

computer_20251124（Opus 4.5）：添加了缩放操作以进行详细检查
computer_20250124（其他所有模型）：标准功能

关键限制：“部分UI元素（如下拉菜单和滚动条）可能对Claude来说难以操作”——Anthropic文档

适用场景：['构建生产级计算机操作Agent', '需要最高质量的视觉理解', '完整桌面控制（不仅仅是浏览器）']

python

from anthropic import Anthropic
from anthropic.types.beta import (
    BetaToolComputerUse20241022,
    BetaToolBash20241022,
    BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io

class AnthropicComputerUse:
    """
    Official Anthropic Computer Use implementation.

    Requires:
    - Docker container with virtual display
    - VNC for viewing agent actions
    - Proper tool implementations
    """

    def __init__(self):
        self.client = Anthropic()
        self.model = "claude-sonnet-4-20250514"  # Best for computer use
        self.screen_size = (1280, 800)

    def get_tools(self) -> list:
        """Define computer use tools."""
        return [
            BetaToolComputerUse20241022(
                type="computer_20241022",
                name="computer",
                display_width_px=self.screen_size[0],
                display_height_px=self.screen_size[1],
            ),
            BetaToolBash20241022(
                type="bash_20241022",
                name="bash",
            ),
            BetaToolTextEditor20241022(
                type="text_editor_20241022",
                name="str_replace_editor",
            ),
        ]

    def execute_tool(self, name: str, input: dict) -> dict:
        """Execute a tool and return result."""

        if name == "computer":
            return self._handle_computer_action(input)
        elif name == "bash":
            return self._handle_bash(input)
        elif name == "str_replace_editor":
            return self._handle_editor(input)
        else:
            return {"error": f"Unknown tool: {name}"}

    def _handle_computer_action(self, input: dict) -> dict:
        """Handle computer control actions."""
        action = input.get("action")

        if action == "screenshot":
            # Capture via xdotool/scrot
            subprocess.run(["scrot", "/tmp/screenshot.png"])

            with open("/tmp/screenshot.png", "rb") as f:

⚠️ Sharp Edges

⚠️ 注意事项

Issue	Severity	Solution
Issue	critical	## Defense in depth - no single solution works
Issue	medium	## Add human-like variance to actions
Issue	high	## Use keyboard alternatives when possible
Issue	medium	## Accept the tradeoff
Issue	high	## Implement context management
Issue	high	## Monitor and limit costs
Issue	critical	## ALWAYS use sandboxing

问题	严重程度	解决方案
问题	严重	## 深度防御——单一解决方案无效
问题	中等	## 为操作添加类人的随机性
问题	高	## 尽可能使用键盘替代方案
问题	中等	## 接受这种权衡
问题	高	## 实现上下文管理
问题	高	## 监控并限制成本
问题	严重	## 始终使用沙箱化