ai-vision

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Vision

AI视觉

Overview

概述

This skill provides a standalone CLI to call multimodal models for UI querying, assertion, and single-step planning. It does not depend on device type; you supply a screenshot and receive structured output (coordinates, decisions, or next actions). Execution and multi-step loops are handled externally by agents using adb/hdc or other drivers. Prefer storing screenshots in
~/.eval/screenshots/
and add timestamps to avoid overwriting.
本Skill提供独立的CLI工具,用于调用多模态模型进行UI查询、断言以及单步规划。它不依赖设备类型:你只需提供截图,即可获得结构化输出(坐标、决策或下一步操作)。执行逻辑与多步循环由外部Agent通过adb/hdc或其他驱动处理。建议将截图存储在
~/.eval/screenshots/
目录下,并添加时间戳以避免覆盖。

Path Convention

路径约定

Canonical install and execution directory:
~/.agents/skills/ai-vision/
. Run commands from this directory:
bash
cd ~/.agents/skills/ai-vision
One-off (safe in scripts/loops from any working directory):
bash
(cd ~/.agents/skills/ai-vision && npx tsx scripts/ai_vision.ts --help)
标准安装与执行目录:
~/.agents/skills/ai-vision/
。请在此目录下运行命令:
bash
cd ~/.agents/skills/ai-vision
单次执行(可在脚本/循环中从任意工作目录安全运行):
bash
(cd ~/.agents/skills/ai-vision && npx tsx scripts/ai_vision.ts --help)

Model Configuration

模型配置

Default Doubao configuration via environment variables:
  • ARK_BASE_URL
    (e.g.
    https://ark.cn-beijing.volces.com/api/v3
    )
  • ARK_API_KEY
  • ARK_MODEL_NAME
For non-Doubao providers, pass explicit flags:
  • --base-url
    ,
    --api-key
    ,
    --model
Default model if none provided:
doubao-seed-1-6-vision-250815
.
通过环境变量配置默认Doubao模型:
  • ARK_BASE_URL
    (例如
    https://ark.cn-beijing.volces.com/api/v3
  • ARK_API_KEY
  • ARK_MODEL_NAME
对于非Doubao提供商,可传入显式参数:
  • --base-url
    ,
    --api-key
    ,
    --model
若未指定模型,默认使用:
doubao-seed-1-6-vision-250815

Script

脚本

Path:
scripts/ai_vision.ts
Run with:
bash
npx tsx scripts/ai_vision.ts --help
Log level (for troubleshooting raw model response):
bash
npx tsx scripts/ai_vision.ts --log-level debug <command> [flags]
Output formatting:
  • When
    --log-json
    is set, logs are emitted as JSON.
  • Otherwise, the final result is pretty-printed JSON, and logs are colorized when TTY is available.
路径:
scripts/ai_vision.ts
运行方式:
bash
npx tsx scripts/ai_vision.ts --help
日志级别(用于排查原始模型响应问题):
bash
npx tsx scripts/ai_vision.ts --log-level debug <command> [flags]
输出格式:
  • 当设置
    --log-json
    时,日志将以JSON格式输出。
  • 否则,最终结果将以格式化JSON打印,且当处于TTY环境时,日志会带有颜色标识。

AIQuery

AIQuery

bash
npx tsx scripts/ai_vision.ts query \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "请识别屏幕上的‘搜索’按钮,并返回其坐标"
bash
npx tsx scripts/ai_vision.ts query \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "请识别屏幕上的‘搜索’按钮,并返回其坐标"

AIAssert

AIAssert

bash
npx tsx scripts/ai_vision.ts assert \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "当前页面包含搜索框"
bash
npx tsx scripts/ai_vision.ts assert \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "当前页面包含搜索框"

plan-next (single-step planning)

plan-next(单步规划)

bash
npx tsx scripts/ai_vision.ts plan-next \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "点击放大镜图标进入搜索页"
bash
npx tsx scripts/ai_vision.ts plan-next \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "点击放大镜图标进入搜索页"

Output Notes

输出说明

  • plan-next
    returns a normalized next action with absolute pixel coordinates.
  • If the model outputs relative coordinates (1000x1000), the script scales to screen pixels.
  • Combine with adb/hdc actions (e.g.,
    adb shell input tap X Y
    ) for device control.
  • Use
    --log-level debug
    to print the raw model response for troubleshooting.
  • plan-next
    返回标准化的下一步操作,包含绝对像素坐标。
  • 若模型输出相对坐标(基于1000x1000分辨率),脚本会自动将其缩放至屏幕像素。
  • 可结合adb/hdc操作(例如
    adb shell input tap X Y
    )实现设备控制。
  • 使用
    --log-level debug
    可打印原始模型响应,用于问题排查。

Default Models (Doubao)

默认模型(Doubao)

  • doubao-seed-1-8-251228
  • doubao-seed-1-6-vision-250815
  • doubao-seed-1-8-251228
  • doubao-seed-1-6-vision-250815

References

参考文档

  • references/doubao-api.md
  • references/doubao-api.md