desktop-test-agent-tauri

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Desktop Test Agent (Tauri / Electron)

桌面测试Agent(Tauri / Electron)

Two desktop surfaces:
EngineWhen
tauri-docker-testingCicero Tauri app in Docker — AppImage extraction, WebKitGTK virtual display, Gemini Computer Use automation, DOCX export verification
agent-browser electron subcommandElectron desktop apps via
agent-browser skills get electron
(VS Code, Slack, Discord, Figma, Notion, Spotify)
两种桌面测试场景:
引擎适用场景
tauri-docker-testingDocker环境中的Cicero Tauri应用——包含AppImage提取、WebKitGTK虚拟显示、Gemini Computer Use自动化、DOCX导出验证
agent-browser electron子命令通过
agent-browser skills get electron
实现Electron桌面应用自动化(支持VS Code、Slack、Discord、Figma、Notion、Spotify)

⚙️ Default Workflow (start here)

⚙️ 默认工作流(从这里开始)

When invoked, say:
"I'll start with the default workflow and assess what stage we're at, then continue from there. If everything is done, I'll come back and ask for your decisions. I can also do A/B/C alternatives — let me know if you want me to lay out capabilities and trade-offs."
Default flow for a Tauri/Electron app:
  1. Probe the build — does the AppImage / Electron binary exist and launch?
  2. Set up environment — virtual display (Xvfb), required system packages, env vars.
  3. Launch the app under instrumentation (Gemini Computer Use or agent-browser CDP).
  4. Capture baseline — screenshot of initial window, log dump.
  5. Run the user's specific check (or "does the export feature work end-to-end" if none specified).
  6. Diagnose — distinguish app crashes vs WebKitGTK issues vs missing system deps.
  7. Iterate — fix and re-run (max 2 retry cycles).
Wait between every step before moving to the next. Don't batch.
调用时,请说明:
"我将从默认工作流开始,评估当前所处阶段,然后继续推进。如果所有步骤完成,我会返回并询问您的决策。我也可以提供A/B/C替代方案——如果您希望我列出功能和权衡,请告知我。"
Tauri/Electron应用的默认流程:
  1. 探测构建产物——AppImage/Electron二进制文件是否存在并可启动?
  2. 搭建环境——虚拟显示(Xvfb)、所需系统包、环境变量。
  3. 在监控下启动应用——基于Gemini Computer Use或agent-browser CDP。
  4. 捕获基线数据——初始窗口截图、日志转储。
  5. 执行用户指定的检查(如果未指定,则执行“导出功能端到端是否可用”检查)。
  6. 诊断问题——区分应用崩溃、WebKitGTK问题与缺失系统依赖。
  7. 迭代优化——修复后重新运行(最多2次重试循环)。
每一步之间都要等待,再进入下一步。不要批量执行。

Watching for human comments while waiting

等待期间监控人工评论

When a step is waiting on the user, a CI run, or any external event, set up a polling watcher:
bash
/loop 10m "check for new comments on PR #<N> via gh CLI; if none, re-ping reviewers"
/schedule "in 20 minutes, re-check comments and continue"
Cadence: start at 10 minutes, back off to 30 minutes if nothing lands. If 30 minutes pass with no comments, repeat the request (re-ping CR, re-ask the user) and continue iterating.
当步骤等待用户、CI运行或任何外部事件时,设置轮询监控器:
bash
/loop 10m "check for new comments on PR #<N> via gh CLI; if none, re-ping reviewers"
/schedule "in 20 minutes, re-check comments and continue"
轮询节奏:初始为10分钟,如果无响应则延长至30分钟。如果30分钟后仍无评论,重复请求(重新提醒代码评审人员、重新询问用户)并继续迭代。

A / B / C alternative approaches

A / B / C替代方案

PathCapabilityTrade-off
A — Docker + Xvfb + Gemini Computer UseFully reproducible CI; works on headless LinuxSlow startup; Gemini token cost
B — Local native execution + agent-browser CDPFastest iteration; uses real Chrome of the appOnly works on a graphical session; OS-specific
C — Manual screenshot + visual diffLowest infra costBrittle; doesn't catch interaction bugs
Default to A for Cicero Tauri (Docker-reproducible), B for Electron apps you're developing locally.

方案功能权衡
A — Docker + Xvfb + Gemini Computer Use完全可复现的CI流程;适用于无头Linux环境启动缓慢;产生Gemini令牌成本
B — 本地原生执行 + agent-browser CDP迭代速度最快;使用应用的真实Chrome内核仅适用于图形化会话;依赖特定操作系统
C — 手动截图 + 视觉差异对比基础设施成本最低稳定性差;无法捕获交互类bug
针对Cicero Tauri应用默认选择A(Docker可复现),针对本地开发的Electron应用默认选择B

Tauri Docker Testing

Tauri Docker测试

Tauri Docker Testing

Tauri Docker测试

Test Cicero's Tauri AppImage inside a Docker container with virtual display and Gemini Computer Use for vision-driven UI automation.
在Docker容器中测试Cicero的Tauri AppImage,结合虚拟显示与Gemini Computer Use实现视觉驱动的UI自动化。

Step-by-Step: The Full Pipeline

分步指南:完整流程

Follow these steps IN ORDER. Every command is copy-pasteable. Do NOT skip steps.
请按顺序执行以下步骤。所有命令均可直接复制粘贴。请勿跳过任何步骤。

STEP 1: Build the AppImage

步骤1:构建AppImage

bash
cd /home/arthrod/workspace/potion_deploy
git checkout main && git pull
Build the Vite frontend first (REQUIRED before cargo build):
bash
NODE_ENV=production VITE_ENVIRONMENT=production bun run build:tauri
Then build the Tauri AppImage:
bash
cd src-tauri && cargo tauri build --bundles appimage
Output will be at:
src-tauri/target/release/bundle/appimage/Cicero_0.1.0_amd64.AppImage
(~83MB)
If cargo build fails with "beforeBuildCommand" error: The Vite build above didn't run. Run it again.
bash
cd /home/arthrod/workspace/potion_deploy
git checkout main && git pull
首先构建Vite前端(必须在cargo build之前执行):
bash
NODE_ENV=production VITE_ENVIRONMENT=production bun run build:tauri
然后构建Tauri AppImage:
bash
cd src-tauri && cargo tauri build --bundles appimage
输出文件路径:
src-tauri/target/release/bundle/appimage/Cicero_0.1.0_amd64.AppImage
(约83MB)
如果cargo build出现"beforeBuildCommand"错误: 上述Vite构建未执行,请重新运行。

STEP 2: Build the Docker Image

步骤2:构建Docker镜像

bash
cd /home/arthrod/workspace/potion_deploy
docker build -t cicero-test -f Dockerfile.computer-use .
If "no such file" error: You're in the wrong directory.
cd
to the repo root.
What the Dockerfile does:
  • Installs Ubuntu 24.04 + Xvfb + VNC + noVNC + Chromium + WebKitGTK deps
  • Copies the AppImage into
    /app/
  • Extracts it with
    --appimage-extract
    (FUSE doesn't work in Docker — never try to run the AppImage directly)
  • Registers
    cicero://
    deep link scheme via xdg-mime
  • Sets up supervisor to manage Xvfb, fluxbox, x11vnc, noVNC, dbus, and the Cicero app
bash
cd /home/arthrod/workspace/potion_deploy
docker build -t cicero-test -f Dockerfile.computer-use .
如果出现"文件不存在"错误: 您处于错误目录,请切换到仓库根目录。
Dockerfile的作用:
  • 安装Ubuntu 24.04 + Xvfb + VNC + noVNC + Chromium + WebKitGTK依赖
  • 将AppImage复制到
    /app/
    目录
  • 使用
    --appimage-extract
    提取AppImage(Docker中无法使用FUSE——切勿直接运行AppImage)
  • 通过xdg-mime注册
    cicero://
    深度链接协议
  • 配置supervisor管理Xvfb、fluxbox、x11vnc、noVNC、dbus和Cicero应用

STEP 3: Start the Container

步骤3:启动容器

bash
docker rm -f cicero-test 2>/dev/null
docker run -d --name cicero-test -p 5901:5900 -p 6081:6080 --shm-size=1g cicero-test
Port 5900 in use? That's why we use 5901:5900. If 5901 is also taken, use any free port.
--shm-size=1g
is REQUIRED.
Without it, Chromium crashes with "insufficient shared memory".
bash
docker rm -f cicero-test 2>/dev/null
docker run -d --name cicero-test -p 5901:5900 -p 6081:6080 --shm-size=1g cicero-test
端口5900已被占用? 这就是我们使用5901:5900的原因。如果5901也被占用,可使用任意空闲端口。
--shm-size=1g
是必填项。
没有它,Chromium会因"共享内存不足"崩溃。

STEP 4: Wait for the App to Load

步骤4:等待应用加载

The app takes ~15-30 seconds to start. Supervisor will show
cicero (exit status 101)
warnings — THIS IS NORMAL. The app crashes 2-4 times because Xvfb/dbus aren't ready yet. Supervisor retries and it eventually starts.
bash
sleep 30
Check if the app is running:
bash
docker exec cicero-test ps aux | grep cicero_desktop
You should see
cicero_desktop
in the process list. If not, start it manually:
bash
docker exec -d -e DISPLAY=:99 -e DBUS_SESSION_BUS_ADDRESS=unix:path=/tmp/dbus-session \
  -e XDG_RUNTIME_DIR=/tmp/runtime-agent -e NO_AT_BRIDGE=1 \
  -e WEBKIT_DISABLE_DMABUF_RENDERER=1 -u agent cicero-test /app/cicero/AppRun
sleep 15
ALL of these env vars are REQUIRED:
  • DISPLAY=:99
    — virtual display
  • DBUS_SESSION_BUS_ADDRESS=unix:path=/tmp/dbus-session
    — WebKitGTK needs dbus
  • XDG_RUNTIME_DIR=/tmp/runtime-agent
    — XDG runtime
  • NO_AT_BRIDGE=1
    — suppress accessibility warnings
  • WEBKIT_DISABLE_DMABUF_RENDERER=1
    CRITICAL: without this, the app crashes with GPU/DMA errors
应用启动约需15-30秒。Supervisor会显示
cicero (exit status 101)
警告——这是正常现象。由于Xvfb/dbus尚未就绪,应用会崩溃2-4次。Supervisor会重试,最终会成功启动。
bash
sleep 30
检查应用是否运行:
bash
docker exec cicero-test ps aux | grep cicero_desktop
您应在进程列表中看到
cicero_desktop
。如果没有,手动启动:
bash
docker exec -d -e DISPLAY=:99 -e DBUS_SESSION_BUS_ADDRESS=unix:path=/tmp/dbus-session \
  -e XDG_RUNTIME_DIR=/tmp/runtime-agent -e NO_AT_BRIDGE=1 \
  -e WEBKIT_DISABLE_DMABUF_RENDERER=1 -u agent cicero-test /app/cicero/AppRun
sleep 15
所有这些环境变量都是必填项:
  • DISPLAY=:99
    —— 虚拟显示
  • DBUS_SESSION_BUS_ADDRESS=unix:path=/tmp/dbus-session
    —— WebKitGTK需要dbus
  • XDG_RUNTIME_DIR=/tmp/runtime-agent
    —— XDG运行时目录
  • NO_AT_BRIDGE=1
    —— 禁用辅助功能警告
  • WEBKIT_DISABLE_DMABUF_RENDERER=1
    —— 关键:没有它,应用会因GPU/DMA错误崩溃

STEP 5: Close WebKit Inspector (CRITICAL)

步骤5:关闭WebKit检查器(关键)

THE APP OPENS WEBKIT INSPECTOR BY DEFAULT. Inspector steals ALL keyboard focus. If you skip this step,
xdotool type
and Gemini Computer Use typing will go to the inspector console, NOT the app.
bash
docker exec -e DISPLAY=:99 -u agent cicero-test xdotool key F12
sleep 1
应用默认会打开WebKit检查器。 检查器会占用所有键盘焦点。如果跳过此步骤,
xdotool type
和Gemini Computer Use的输入会发送到检查器控制台,而非应用。
bash
docker exec -e DISPLAY=:99 -u agent cicero-test xdotool key F12
sleep 1

STEP 6: Take a Screenshot to Verify

步骤6:截图验证

bash
docker exec -e DISPLAY=:99 -u agent cicero-test scrot /tmp/verify.png
docker cp cicero-test:/tmp/verify.png ./verify.png
You should see the Cicero sign-in page: "Contracts from the future" on the left, email/password form on the right.
If you see a blank/gray desktop: The app didn't start. Go back to Step 4 and start manually.
If you see the WebKit Inspector taking up half the screen: Go back to Step 5.
bash
docker exec -e DISPLAY=:99 -u agent cicero-test scrot /tmp/verify.png
docker cp cicero-test:/tmp/verify.png ./verify.png
您应看到Cicero登录页面:左侧显示"Contracts from the future",右侧是邮箱/密码表单。
如果看到空白/灰色桌面: 应用未启动,请返回步骤4手动启动。
如果看到WebKit检查器占据半个屏幕: 返回步骤5重新操作。

STEP 7: Run Gemini Computer Use Agent

步骤7:运行Gemini Computer Use Agent

This is the main automation tool. It takes screenshots, sends them to Gemini, and executes the model's actions via xdotool.
Prerequisites:
  • google-genai
    Python package:
    uv pip install google-genai
    (or
    pip install google-genai
    )
  • GEMINI_API_KEY
    env var set
bash
GEMINI_API_KEY=$GEMINI_API_KEY python3 tooling/scripts/tauri-computer-use.py \
  --container cicero-test \
  --goal "Create a new account with email test@cicero.im, password TDDisthesolution, first name Test, last name User. After login, click the WRITE card to create a document. Type a haiku in the editor. Then export as DOCX." \
  --model gemini-3-flash-preview \
  --max-turns 20
The agent will:
  1. Find the Sign Up link and click it
  2. Fill in the sign-up form fields
  3. Accept terms and submit
  4. Click the WRITE card on the dashboard
  5. Type text in the editor
  6. Find the export button and trigger DOCX download
  7. Handle the GTK save dialog
If the agent gets stuck in a safety confirmation loop: The Gemini CU model keeps asking for confirmation on downloads. The script auto-confirms, but sometimes the model loops. Kill it (Ctrl+C) and handle the remaining steps manually.
If
response.candidates
is None:
Safety block from Gemini. The script handles this gracefully and retries.
这是主要的自动化工具。它会截取屏幕截图,发送给Gemini,并通过xdotool执行模型返回的操作。
前置条件:
  • 安装
    google-genai
    Python包:
    uv pip install google-genai
    (或
    pip install google-genai
  • 设置
    GEMINI_API_KEY
    环境变量
bash
GEMINI_API_KEY=$GEMINI_API_KEY python3 tooling/scripts/tauri-computer-use.py \
  --container cicero-test \
  --goal "Create a new account with email test@cicero.im, password TDDisthesolution, first name Test, last name User. After login, click the WRITE card to create a document. Type a haiku in the editor. Then export as DOCX." \
  --model gemini-3-flash-preview \
  --max-turns 20
Agent会执行以下操作:
  1. 找到注册链接并点击
  2. 填写注册表单字段
  3. 接受条款并提交
  4. 点击仪表板上的WRITE卡片创建文档
  5. 在编辑器中输入文本
  6. 找到导出按钮并触发DOCX下载
  7. 处理GTK保存对话框
如果Agent陷入安全确认循环: Gemini CU模型会反复要求确认下载。脚本会自动确认,但有时模型会循环。此时请终止脚本(Ctrl+C)并手动完成剩余步骤。
如果
response.candidates
为None:
Gemini触发了安全拦截。脚本会优雅处理并重试。

STEP 8: Manual Typing Fallback

步骤8:手动输入备用方案

If the Computer Use agent can't type in the Plate editor (text doesn't appear), do it manually:
bash
undefined
如果Computer Use Agent无法在Plate编辑器中输入(文本不显示),请手动执行:
bash
undefined

MUST close inspector first (Step 5)

必须先关闭检查器(步骤5)

docker exec -e DISPLAY=:99 -u agent cicero-test bash -c " xdotool mousemove 600 300 && sleep 0.3 && xdotool click 1 && sleep 0.5 xdotool type --delay 30 'Contracts from the past' xdotool key Return xdotool type --delay 30 'AI writes the future now' xdotool key Return xdotool type --delay 30 'Cicero guides all' "

**MUST use `xdotool type --delay 30 'text'`.** Individual `xdotool key X` calls do NOT trigger Slate/Plate input events. The `type` command fires the proper IME/input pipeline that Plate.js listens to.

**`xdotool type` mangles uppercase** — it uses `--clearmodifiers` which strips Shift. "AI" becomes "ai", "Cicero" becomes "cicero". This is cosmetic and acceptable for testing.
docker exec -e DISPLAY=:99 -u agent cicero-test bash -c " xdotool mousemove 600 300 && sleep 0.3 && xdotool click 1 && sleep 0.5 xdotool type --delay 30 'Contracts from the past' xdotool key Return xdotool type --delay 30 'AI writes the future now' xdotool key Return xdotool type --delay 30 'Cicero guides all' "

**必须使用`xdotool type --delay 30 'text'`。** 单独的`xdotool key X`调用无法触发Slate/Plate输入事件。`type`命令会触发Plate.js监听的正确IME/输入流程。

**`xdotool type`会混淆大写字母**——它使用`--clearmodifiers`清除Shift键。"AI"会变成"ai","Cicero"会变成"cicero"。这属于外观问题,在测试中是可接受的。

STEP 9: DOCX Export

步骤9:DOCX导出

After typing content in the editor:
  1. Use the Computer Use agent to click the export button:
bash
GEMINI_API_KEY=$GEMINI_API_KEY python3 tooling/scripts/tauri-computer-use.py \
  --container cicero-test \
  --goal "Click the export/download icon in the toolbar and export as DOCX" \
  --model gemini-3-flash-preview \
  --max-turns 5
  1. If the agent gets stuck, do it manually — the export flow is:
    • Click export icon in toolbar → "Download" dialog appears (format: WORD)
    • Click red "Download" button → "Export to DOCX" confirmation dialog
    • Click red "Continue" button → GTK "Save File" dialog
    • Press Enter to save with default filename
  2. Check if the DOCX was saved:
bash
docker exec cicero-test find / -name "*.docx" -type f 2>/dev/null
  1. Copy it out:
bash
docker cp cicero-test:/path/to/file.docx ./output.docx
  1. Verify it's valid:
bash
python3 -c "
import zipfile, re
z = zipfile.ZipFile('./output.docx')
doc = z.read('word/document.xml').decode('utf-8')
texts = re.findall(r'<w:t[^>]*>([^<]+)</w:t>', doc)
for t in texts: print(t)
"
If DOCX export fails (yellow warning in Download dialog):
  • Check
    fs:allow-write-file
    is in
    src-tauri/capabilities/default.json
  • Without this permission, the GTK save dialog appears but
    writeFile()
    silently fails
  • This was the bug we found —
    fs:default
    only grants READ, not write
在编辑器中输入内容后:
  1. 使用Computer Use Agent点击导出按钮:
bash
GEMINI_API_KEY=$GEMINI_API_KEY python3 tooling/scripts/tauri-computer-use.py \
  --container cicero-test \
  --goal "Click the export/download icon in the toolbar and export as DOCX" \
  --model gemini-3-flash-preview \
  --max-turns 5
  1. 如果Agent卡住,手动执行——导出流程为:
    • 点击工具栏中的导出图标 → 弹出“Download”对话框(格式:WORD)
    • 点击红色“Download”按钮 → 弹出“Export to DOCX”确认对话框
    • 点击红色“Continue”按钮 → 弹出GTK“Save File”对话框
    • 按Enter键使用默认文件名保存
  2. 检查DOCX是否保存成功:
bash
docker exec cicero-test find / -name "*.docx" -type f 2>/dev/null
  1. 将文件复制到本地:
bash
docker cp cicero-test:/path/to/file.docx ./output.docx
  1. 验证文件有效性:
bash
python3 -c "
import zipfile, re
z = zipfile.ZipFile('./output.docx')
doc = z.read('word/document.xml').decode('utf-8')
texts = re.findall(r'<w:t[^>]*>([^<]+)</w:t>', doc)
for t in texts: print(t)
"
如果DOCX导出失败(Download对话框显示黄色警告):
  • 检查
    src-tauri/capabilities/default.json
    中是否包含
    fs:allow-write-file
  • 没有此权限的话,GTK保存对话框会显示,但
    writeFile()
    会静默失败
  • 这是我们发现的bug——
    fs:default
    仅授予读取权限,不包含写入权限

STEP 10: Verify & Clean Up

步骤10:验证与清理

Take final screenshot:
bash
docker exec -e DISPLAY=:99 -u agent cicero-test scrot /tmp/final.png
docker cp cicero-test:/tmp/final.png ./final.png
Stop container:
bash
docker rm -f cicero-test
截取最终截图:
bash
docker exec -e DISPLAY=:99 -u agent cicero-test scrot /tmp/final.png
docker cp cicero-test:/tmp/final.png ./final.png
停止容器:
bash
docker rm -f cicero-test

Gemini Computer Use: How
tauri-computer-use.py
Works

Gemini Computer Use:
tauri-computer-use.py
工作原理

Architecture

架构

Host machine                          Docker container (Ubuntu 24.04)
┌──────────────────┐                  ┌─────────────────────────────┐
│ tauri-computer-   │  docker exec    │ Xvfb :99 (virtual display)  │
│ use.py            │ ──────────────> │ Cicero AppImage (WebKitGTK) │
│                   │  scrot → PNG    │ x11vnc (VNC on :5900)       │
│ Gemini CU API    │ <────────────── │ noVNC (web on :6080)        │
│ (google-genai)   │  xdotool cmds   │ fluxbox (window manager)    │
│                   │ ──────────────> │ dbus-daemon                 │
└──────────────────┘                  └─────────────────────────────┘
宿主机器                          Docker容器(Ubuntu 24.04)
┌──────────────────┐                  ┌─────────────────────────────┐
│ tauri-computer-   │  docker exec    │ Xvfb :99 (虚拟显示)         │
│ use.py            │ ──────────────> │ Cicero AppImage (WebKitGTK) │
│                   │  scrot → PNG    │ x11vnc (VNC监听:5900)       │
│ Gemini CU API    │ <────────────── │ noVNC (Web端监听:6080)      │
│ (google-genai)   │  xdotool命令    │ fluxbox (窗口管理器)         │
│                   │ ──────────────> │ dbus-daemon                 │
└──────────────────┘                  └─────────────────────────────┘

Agent Loop

Agent循环

  1. Takes screenshot from Docker via
    docker exec scrot
  2. Sends screenshot + goal to Gemini Computer Use API (native
    google-genai
    SDK)
  3. Model returns
    function_call
    actions with normalized coordinates (0-999)
  4. Script denormalizes:
    actual_x = x / 1000 * 1440
    ,
    actual_y = y / 1000 * 960
  5. Executes via
    xdotool
    (click, type, scroll, key combos)
  6. Takes new screenshot, sends back as
    FunctionResponse
    with:
    • url
      :
      "cicero://localhost"
      (REQUIRED — 400 error without it)
    • safety_acknowledgement
      :
      "true"
      (REQUIRED when model sends
      require_confirmation
      )
    • Screenshot as
      FunctionResponsePart
      with
      inline_data
      blob
  7. Loop until model says done or max turns reached
  1. 通过
    docker exec scrot
    从Docker截取屏幕截图
  2. 将截图+目标发送给Gemini Computer Use API(原生
    google-genai
    SDK)
  3. 模型返回带标准化坐标(0-999)的
    function_call
    操作
  4. 脚本转换坐标:
    actual_x = x / 1000 * 1440
    actual_y = y / 1000 * 960
  5. 通过
    xdotool
    执行操作(点击、输入、滚动、组合键)
  6. 截取新截图,作为
    FunctionResponse
    返回,包含:
    • url
      :
      "cicero://localhost"
      (必填项——缺少会返回400错误)
    • safety_acknowledgement
      :
      "true"
      (当模型发送
      require_confirmation
      时必填)
    • 截图作为
      FunctionResponsePart
      inline_data
      二进制数据
  7. 循环直到模型标记完成或达到最大轮次

Supported Gemini Models

支持的Gemini模型

ModelUse Case
gemini-3-flash-preview
Fast, good for most tasks
gemini-2.5-computer-use-preview-10-2025
Dedicated CU model, more accurate
模型适用场景
gemini-3-flash-preview
速度快,适用于大多数任务
gemini-2.5-computer-use-preview-10-2025
专用CU模型,精度更高

What DOESN'T Work for Automation

自动化不适用的方案

ApproachWhy It FailsWhat Happens
MidsceneUses OpenAI-compatible endpoint (
/v1beta/openai/
) which does NOT support Computer Use
Returns "empty content from AI model" on every Gemini model
agent-browser via noVNCVNC canvas is a single
<canvas>
element
agent-browser sees noVNC controls (disconnect, clipboard) but NOT the app UI inside
xdotool coordinate guessingNo vision — you're guessing pixel positionsBreaks when layout changes, wastes time iterating
xdotool key
per character
Individual key events don't trigger Slate/Plate inputText never appears in the editor contenteditable
方案失败原因现象
Midscene使用OpenAI兼容端点(
/v1beta/openai/
),不支持Computer Use
调用任何Gemini模型都会返回"empty content from AI model"
通过noVNC使用agent-browserVNC画布是单个
<canvas>
元素
agent-browser只能看到noVNC控件(断开连接、剪贴板),无法看到内部的应用UI
xdotool坐标猜测无视觉能力——只能猜测像素位置布局变化时失效,浪费时间迭代
xdotool key
逐个输入字符
单个按键事件无法触发Slate/Plate输入文本永远不会显示在编辑器的contenteditable区域

Dockerfile Requirements

Dockerfile要求

Dockerfile.computer-use
must include ALL of these:
dockerfile
RUN apt-get install -y \
    # Virtual display + window manager
    xvfb fluxbox xterm \
    # VNC + noVNC (browser-based VNC viewer)
    x11vnc novnc websockify \
    # OAuth browser (for Google OAuth roundtrip)
    chromium-browser \
    # WebKitGTK — Tauri's Linux rendering engine
    libwebkit2gtk-4.1-0 libgtk-3-0t64 \
    libappindicator3-1 librsvg2-2 libsoup-3.0-0 \
    # D-Bus + Accessibility (WebKitGTK REFUSES to start without dbus)
    dbus-x11 at-spi2-core libatk-bridge2.0-0 libatspi2.0-0 \
    # AppImage extraction (FUSE doesn't work in Docker)
    libfuse2t64 \
    # UI automation tools
    scrot xdotool wmctrl \
    # Deep link registration (cicero:// scheme)
    xdg-utils desktop-file-utils \
    # Midscene deps (if you want to try Midscene — it won't work for CU but connect/screenshot work)
    nodejs npm imagemagick x11-xserver-utils \
    # Process manager
    supervisor \
    # Fonts (without these, screenshots show boxes instead of text)
    fonts-liberation fonts-noto-color-emoji fonts-dejavu \
    # Misc
    curl ca-certificates
Dockerfile.computer-use
必须包含以下所有内容:
dockerfile
RUN apt-get install -y \
    # 虚拟显示 + 窗口管理器
    xvfb fluxbox xterm \
    # VNC + noVNC(基于浏览器的VNC查看器)
    x11vnc novnc websockify \
    # OAuth浏览器(用于Google OAuth跳转)
    chromium-browser \
    # WebKitGTK —— Tauri的Linux渲染引擎
    libwebkit2gtk-4.1-0 libgtk-3-0t64 \
    libappindicator3-1 librsvg2-2 libsoup-3.0-0 \
    # D-Bus + 辅助功能(WebKitGTK没有dbus无法启动)
    dbus-x11 at-spi2-core libatk-bridge2.0-0 libatspi2.0-0 \
    # AppImage提取(Docker中无法使用FUSE)
    libfuse2t64 \
    # UI自动化工具
    scrot xdotool wmctrl \
    # 深度链接注册(cicero://协议)
    xdg-utils desktop-file-utils \
    # Midscene依赖(如果尝试使用Midscene——CU功能不可用,但连接/截图可用)
    nodejs npm imagemagick x11-xserver-utils \
    # 进程管理器
    supervisor \
    # 字体(没有这些,截图会显示方块而非文字)
    fonts-liberation fonts-noto-color-emoji fonts-dejavu \
    # 其他
    curl ca-certificates

Deep Link Registration

深度链接注册

Register
cicero://
so OAuth deep links work:
dockerfile
RUN cat > /usr/share/applications/cicero-handler.desktop << 'EOF'
[Desktop Entry]
Name=Cicero Deep Link Handler
Exec=/app/cicero/AppRun %u
Type=Application
MimeType=x-scheme-handler/cicero;
NoDisplay=true
EOF
RUN update-desktop-database /usr/share/applications/
注册
cicero://
以支持OAuth深度链接:
dockerfile
RUN cat > /usr/share/applications/cicero-handler.desktop << 'EOF'
[Desktop Entry]
Name=Cicero Deep Link Handler
Exec=/app/cicero/AppRun %u
Type=Application
MimeType=x-scheme-handler/cicero;
NoDisplay=true
EOF
RUN update-desktop-database /usr/share/applications/

Supervisor Config

Supervisor配置

ini
[program:cicero]
command=/app/cicero/AppRun
environment=DISPLAY=":99",DBUS_SESSION_BUS_ADDRESS="unix:path=/tmp/dbus-session",XDG_RUNTIME_DIR="/tmp/runtime-agent",NO_AT_BRIDGE="1",WEBKIT_DISABLE_DMABUF_RENDERER="1"
autorestart=true
user=agent
priority=40
startsecs=5
startretries=10
startretries=10
because the app crashes on startup until Xvfb is ready.
startsecs=5
gives it time to actually initialize.
ini
[program:cicero]
command=/app/cicero/AppRun
environment=DISPLAY=":99",DBUS_SESSION_BUS_ADDRESS="unix:path=/tmp/dbus-session",XDG_RUNTIME_DIR="/tmp/runtime-agent",NO_AT_BRIDGE="1",WEBKIT_DISABLE_DMABUF_RENDERER="1"
autorestart=true
user=agent
priority=40
startsecs=5
startretries=10
startretries=10
是因为Xvfb就绪前应用会启动失败。
startsecs=5
给应用足够的初始化时间。

Tauri Capabilities (Permissions)

Tauri权限配置

File:
src-tauri/capabilities/default.json
These permissions MUST be present:
json
{
  "permissions": [
    "core:default",
    "deep-link:default",
    "dialog:default",
    "fs:default",
    "fs:allow-write-file",
    "fs:allow-write-text-file",
    "http:default",
    "opener:default",
    "os:default",
    "sql:default",
    "sql:allow-execute",
    "clipboard-manager:default",
    "notification:default",
    "store:default",
    "upload:default",
    "websocket:default"
  ]
}
fs:allow-write-file
is CRITICAL.
Without it:
  • The GTK "Save File" dialog appears (from
    @tauri-apps/plugin-dialog save()
    )
  • User picks a filename and clicks Save
  • @tauri-apps/plugin-fs writeFile()
    silently fails — no error, no file
  • The DOCX export appears to work but the file is never written
ALL app methods are CLIENT-SIDE. DOCX export, file save, clipboard, notifications — these all use Tauri plugins (
@tauri-apps/plugin-fs
,
@tauri-apps/plugin-dialog
). They do NOT depend on Cloudflare Worker bindings. The server only provides auth, document CRUD, and AI endpoints.
文件:
src-tauri/capabilities/default.json
必须包含以下权限:
json
{
  "permissions": [
    "core:default",
    "deep-link:default",
    "dialog:default",
    "fs:default",
    "fs:allow-write-file",
    "fs:allow-write-text-file",
    "http:default",
    "opener:default",
    "os:default",
    "sql:default",
    "sql:allow-execute",
    "clipboard-manager:default",
    "notification:default",
    "store:default",
    "upload:default",
    "websocket:default"
  ]
}
fs:allow-write-file
是关键。
没有它:
  • GTK“Save File”对话框会显示(来自
    @tauri-apps/plugin-dialog save()
  • 用户选择文件名并点击保存
  • @tauri-apps/plugin-fs writeFile()
    会静默失败——无错误提示,无文件生成
  • DOCX导出看似成功,但文件从未写入
所有应用方法均为客户端操作。 DOCX导出、文件保存、剪贴板、通知——这些都使用Tauri插件(
@tauri-apps/plugin-fs
@tauri-apps/plugin-dialog
)。它们不依赖Cloudflare Worker绑定。服务器仅提供认证、文档CRUD和AI端点。

All Known Pitfalls (Complete List)

所有已知陷阱(完整列表)

#PitfallSymptomFix
1WebKit Inspector openTyping goes to console, not editorPress F12 before any text input
2Missing
WEBKIT_DISABLE_DMABUF_RENDERER=1
App crashes immediately with GPU errorAdd to env vars in supervisor and manual start
3Missing
dbus-x11
package
WebKitGTK refuses to start, exit 101Install in Dockerfile
4AppImage run directly (not extracted)"FUSE not available" errorAlways
--appimage-extract
first
5
fs:default
without
fs:allow-write-file
DOCX save dialog works but file never writesAdd
fs:allow-write-file
to capabilities
6
xdotool key
per character in Plate
Text never appears in editorUse
xdotool type --delay 30 'text'
7Port 5900 already in useContainer fails to startUse
-p 5901:5900 -p 6081:6080
8Missing
--shm-size=1g
Chromium crashesAdd to
docker run
9Supervisor gives up on cicero
FATAL state, too many retries
Increase
startretries=10
or start manually
10Midscene for computer use"empty content from AI model"Use
tauri-computer-use.py
with native Gemini CU API
11agent-browser via noVNCCan only see canvas, not app UIUse Gemini CU or direct xdotool
12Missing
safety_acknowledgement
in CU response
Gemini returns 400 INVALID_ARGUMENTAdd
"safety_acknowledgement": "true"
to FunctionResponse
13
response.candidates
is None
Safety block or rate limitHandle gracefully, retry with new screenshot
14Tab once from email to passwordLands on "Forgot password?" linkTab TWICE
15Protocol detection
tauri://
Auth shim not installed, CORS blocked, splash hangsUse
!/^https?:$/.test()
not
=== "tauri:"
16
handleAuthDeepLink
follows redirects
Gets 404 instead of processing callbackAdd
{ redirect: "manual" }
17Special chars in password via xdotool
&
,
!
,
#
interpreted as shell metacharacters
Use Gemini CU or escape properly
18
.env
test password wrong
"Invalid email or password"Create new account via sign-up instead
19
xdotool type
mangles uppercase
"AI" becomes "ai" due to
--clearmodifiers
Cosmetic — acceptable for testing
20Gemini CU calls
wait_5_seconds
/
scroll_document
"Unimplemented action" in scriptThese are implemented in the script now
编号陷阱症状修复方案
1WebKit检查器处于打开状态输入内容发送到控制台而非编辑器在任何文本输入前按F12关闭
2缺少
WEBKIT_DISABLE_DMABUF_RENDERER=1
应用立即因GPU错误崩溃在supervisor配置和手动启动命令中添加该环境变量
3缺少
dbus-x11
WebKitGTK无法启动,退出码101在Dockerfile中安装该包
4直接运行AppImage(未提取)出现"FUSE not available"错误始终先使用
--appimage-extract
提取
5仅配置
fs:default
未添加
fs:allow-write-file
DOCX保存对话框正常显示,但文件从未写入在权限配置中添加
fs:allow-write-file
6在Plate编辑器中使用
xdotool key
逐个输入字符
文本从未显示在编辑器中使用
xdotool type --delay 30 'text'
7端口5900已被占用容器无法启动使用
-p 5901:5900 -p 6081:6080
映射端口
8缺少
--shm-size=1g
Chromium崩溃
docker run
命令中添加该参数
9Supervisor放弃重启cicero显示
FATAL state, too many retries
增加
startretries=10
或手动启动应用
10使用Midscene进行Computer Use操作出现"empty content from AI model"使用
tauri-computer-use.py
结合原生Gemini CU API
11通过noVNC使用agent-browser只能看到画布,无法看到应用UI使用Gemini CU或直接xdotool操作
12CU响应中缺少
safety_acknowledgement
Gemini返回400 INVALID_ARGUMENT错误在FunctionResponse中添加
"safety_acknowledgement": "true"
13
response.candidates
为None
安全拦截或速率限制优雅处理并重试,重新截取截图
14从邮箱输入框按一次Tab键到密码框焦点落在“Forgot password?”链接上按两次Tab键
15协议检测
tauri://
未安装认证垫片、CORS被拦截、启动页卡住使用
!/^https?:$/.test()
而非
=== "tauri:"
16
handleAuthDeepLink
跟随重定向
得到404错误而非处理回调添加
{ redirect: "manual" }
17通过xdotool输入包含特殊字符的密码
&
!
#
被解释为shell元字符
使用Gemini CU或正确转义字符
18
.env
中的测试密码错误
显示"Invalid email or password"通过注册流程创建新账户
19
xdotool type
混淆大写字母
--clearmodifiers
导致"AI"变成"ai"
属于外观问题——测试中可接受
20Gemini CU调用
wait_5_seconds
/
scroll_document
脚本中显示"Unimplemented action"这些操作现已在脚本中实现