ai-vision
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Vision
AI视觉
Overview
概述
This skill provides a standalone CLI to call multimodal models for UI querying, assertion, and single-step planning. It does not depend on device type; you supply a screenshot and receive structured output (coordinates, decisions, or next actions). Execution and multi-step loops are handled externally by agents using adb/hdc or other drivers. Prefer storing screenshots in and add timestamps to avoid overwriting.
~/.eval/screenshots/本Skill提供独立的CLI工具,用于调用多模态模型进行UI查询、断言以及单步规划。它不依赖设备类型:你只需提供截图,即可获得结构化输出(坐标、决策或下一步操作)。执行逻辑与多步循环由外部Agent通过adb/hdc或其他驱动处理。建议将截图存储在目录下,并添加时间戳以避免覆盖。
~/.eval/screenshots/Path Convention
路径约定
Canonical install and execution directory: . Run commands from this directory:
~/.agents/skills/ai-vision/bash
cd ~/.agents/skills/ai-visionOne-off (safe in scripts/loops from any working directory):
bash
(cd ~/.agents/skills/ai-vision && npx tsx scripts/ai_vision.ts --help)标准安装与执行目录:。请在此目录下运行命令:
~/.agents/skills/ai-vision/bash
cd ~/.agents/skills/ai-vision单次执行(可在脚本/循环中从任意工作目录安全运行):
bash
(cd ~/.agents/skills/ai-vision && npx tsx scripts/ai_vision.ts --help)Model Configuration
模型配置
Default Doubao configuration via environment variables:
- (e.g.
ARK_BASE_URL)https://ark.cn-beijing.volces.com/api/v3 ARK_API_KEYARK_MODEL_NAME
For non-Doubao providers, pass explicit flags:
- ,
--base-url,--api-key--model
Default model if none provided: .
doubao-seed-1-6-vision-250815通过环境变量配置默认Doubao模型:
- (例如
ARK_BASE_URL)https://ark.cn-beijing.volces.com/api/v3 ARK_API_KEYARK_MODEL_NAME
对于非Doubao提供商,可传入显式参数:
- ,
--base-url,--api-key--model
若未指定模型,默认使用:。
doubao-seed-1-6-vision-250815Script
脚本
Path:
scripts/ai_vision.tsRun with:
bash
npx tsx scripts/ai_vision.ts --helpLog level (for troubleshooting raw model response):
bash
npx tsx scripts/ai_vision.ts --log-level debug <command> [flags]Output formatting:
- When is set, logs are emitted as JSON.
--log-json - Otherwise, the final result is pretty-printed JSON, and logs are colorized when TTY is available.
路径:
scripts/ai_vision.ts运行方式:
bash
npx tsx scripts/ai_vision.ts --help日志级别(用于排查原始模型响应问题):
bash
npx tsx scripts/ai_vision.ts --log-level debug <command> [flags]输出格式:
- 当设置时,日志将以JSON格式输出。
--log-json - 否则,最终结果将以格式化JSON打印,且当处于TTY环境时,日志会带有颜色标识。
AIQuery
AIQuery
bash
npx tsx scripts/ai_vision.ts query \
--screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
--prompt "请识别屏幕上的‘搜索’按钮,并返回其坐标"bash
npx tsx scripts/ai_vision.ts query \
--screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
--prompt "请识别屏幕上的‘搜索’按钮,并返回其坐标"AIAssert
AIAssert
bash
npx tsx scripts/ai_vision.ts assert \
--screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
--prompt "当前页面包含搜索框"bash
npx tsx scripts/ai_vision.ts assert \
--screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
--prompt "当前页面包含搜索框"plan-next (single-step planning)
plan-next(单步规划)
bash
npx tsx scripts/ai_vision.ts plan-next \
--screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
--prompt "点击放大镜图标进入搜索页"bash
npx tsx scripts/ai_vision.ts plan-next \
--screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
--prompt "点击放大镜图标进入搜索页"Output Notes
输出说明
- returns a normalized next action with absolute pixel coordinates.
plan-next - If the model outputs relative coordinates (1000x1000), the script scales to screen pixels.
- Combine with adb/hdc actions (e.g., ) for device control.
adb shell input tap X Y - Use to print the raw model response for troubleshooting.
--log-level debug
- 返回标准化的下一步操作,包含绝对像素坐标。
plan-next - 若模型输出相对坐标(基于1000x1000分辨率),脚本会自动将其缩放至屏幕像素。
- 可结合adb/hdc操作(例如 )实现设备控制。
adb shell input tap X Y - 使用可打印原始模型响应,用于问题排查。
--log-level debug
Default Models (Doubao)
默认模型(Doubao)
doubao-seed-1-8-251228doubao-seed-1-6-vision-250815
doubao-seed-1-8-251228doubao-seed-1-6-vision-250815
References
参考文档
references/doubao-api.md
references/doubao-api.md