glmv-grounding

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GLMV-Grounding Skill

Extract and visualize grounding results produced by GLM-V. Depending on the user prompt, grounding coordinates in model outputs may appear in different forms, including 2D bounding boxes, Objects Detection JSON, 2D points, 3D bounding boxes, and target-tracking JSON.

Note: GLM-V outputs coordinates where x and y are relative coordinates normalized from pixel coordinates x_pixel and y_pixel using image width W and height H (range 0-1000), i.e., x=round(x_pixel/W1000), y=round(y_pixel/H1000). The origin of the pixel coordinate system is the top-left corner. Note: If the prompt does not explicitly specify a grounding format (for example, "find the location of xxx" or "draw a box around xxx"), treat the request as 2D bounding boxes by default.

提取并可视化GLM-V生成的定位结果。根据用户prompt的不同，模型输出中的定位坐标可能以不同形式呈现，包括2D边界框、目标检测JSON、2D点、3D边界框和目标跟踪JSON。

注意：GLM-V输出的坐标中，x和y是由像素坐标x_pixel、y_pixel基于图像宽度W和高度H归一化得到的相对坐标（范围0-1000），即x=round(x_pixel/W1000)，y=round(y_pixel/H1000)。像素坐标系的原点为左上角。注意：如果prompt没有明确指定定位格式（例如“查找xxx的位置”或“在xxx周围画框”），默认将请求视为2D边界框格式处理。

When to use

适用场景

Use GLM-V to ground targets in images: obtain grounding results in an image for any prompt-described target, with output formats such as 2D bounding box (default), 2D points, and 3D bounding box.
Use GLM-V to track targets in videos: obtain tracking results in a video for any prompt-described target, with output format like {"0": [{"label": ..., "bbox_2d": ...}, ...], ...}.
Use utility functions for extraction, conversion, and visualization: extract coordinates, points, and JSON from natural text; normalize and de-normalize coordinates; visualize boxes, points, 3D boxes, and video tracking results.

使用GLM-V定位图像中的目标：获取图像中任意prompt描述目标的定位结果，输出格式包括2D边界框（默认）、2D点、3D边界框。
使用GLM-V跟踪视频中的目标：获取视频中任意prompt描述目标的跟踪结果，输出格式如
```
{"0": [{"label": ..., "bbox_2d": ...}, ...], ...}
```
。
使用工具函数进行提取、转换和可视化：从自然文本中提取坐标、点和JSON；对坐标进行归一化和反归一化；可视化框、点、3D框和视频跟踪结果。

Setup your API Key

配置你的API Key

Configure ZHIPU_API_KEY to call the GLM-V API.

Get your API key: https://www.bigmodel.cn/usercenter/proj-mgmt/apikeys
Configure it with:

python scripts/config_setup.py setup --api-key YOUR_KEY

配置ZHIPU_API_KEY以调用GLM-V API。

获取你的API key：https://www.bigmodel.cn/usercenter/proj-mgmt/apikeys
使用以下命令配置：

python scripts/config_setup.py setup --api-key YOUR_KEY

Security & Transparency

安全与透明度

Primary API key env:
```
ZHIPU_API_KEY
```
(required).
Timeout env:
```
GLM_GROUNDING_TIMEOUT
```
(optional, seconds, default
```
60
```
).
API endpoint: fixed to official Zhipu Chat Completions endpoint in CLI implementation.
No dynamic key name switching: the skill expects
```
ZHIPU_API_KEY
```
consistently.
URL/local file handling: the skill can read local files or fetch user-provided URLs for processing/visualization; URL inputs are restricted to public http/https targets (localhost/private network targets are rejected).

主API key环境变量：
```
ZHIPU_API_KEY
```
（必填）。
超时环境变量：
```
GLM_GROUNDING_TIMEOUT
```
（可选，单位为秒，默认值
```
60
```
）。
API端点：CLI实现中固定为智谱官方Chat Completions端点。
无动态密钥名称切换：该Skill始终需要
```
ZHIPU_API_KEY
```
环境变量。
URL/本地文件处理：该Skill可以读取本地文件或拉取用户提供的URL进行处理/可视化；URL输入仅限公共http/https目标（会拒绝localhost/私有网络目标）。

Runtime Dependencies

运行时依赖

Install dependencies before use:

bash

pip install -r scripts/requirements.txt

Main packages used by this skill:

```
requests
```
```
Pillow
```
```
opencv-python
```
```
numpy
```
```
matplotlib
```
```
decord
```

System dependency for video visualization:

```
ffmpeg
```

使用前安装依赖：

bash

pip install -r scripts/requirements.txt

该Skill使用的主要包：

```
requests
```
```
Pillow
```
```
opencv-python
```
```
numpy
```
```
matplotlib
```
```
decord
```

视频可视化的系统依赖：

```
ffmpeg
```

General workflow

通用工作流程

	Input (image or video + Prompt)
		|
		▼
	Run glm_grounding_cli.py to get grounding results (natural language)
		|
		▼
	Return results (grounding results, visualized image or video)

	输入（图像或视频 + Prompt）
		|
		▼
	运行glm_grounding_cli.py获取定位结果（自然语言）
		|
		▼
	返回结果（定位结果、可视化后的图像或视频）

How to Use

使用方法

Run glm_grounding_cli.py to get grounding results

运行glm_grounding_cli.py获取定位结果

Ground any target in an image

python scripts/glm_grounding_cli.py --image-url "URL provided by user" --prompt "description of target for grounding"

Track any target in a video

python scripts/glm_grounding_cli.py --video-url /path/to/image.jpg --prompt "description of target for tracking" --visualize --visualization-dir "./vis"

定位图像中的任意目标

python scripts/glm_grounding_cli.py --image-url "用户提供的URL" --prompt "待定位目标的描述"

跟踪视频中的任意目标

python scripts/glm_grounding_cli.py --video-url /path/to/image.jpg --prompt "待跟踪目标的描述" --visualize --visualization-dir "./vis"

Reply with grounding results

返回定位结果

After receiving a grounding prompt from the user, your direct reply should be natural language that includes grounding coordinates. Coordinates $x$ and $y$ are relative values in [0, 1000], computed as:

$$ x = round(x_{pixel} / W * 1000) \ y = round(y_{pixel}/H*1000) $$

where $x_{pixel}, y_{pixel}$ are pixel coordinates with origin (0, 0) at the top-left corner of the image, and W/H are the image width/height.

Unless otherwise specified, grounding results should use the following Python data formats:

2D bounding boxes:
```
[[x1, y1, x2, y2], ...]
```
, extracted grounding result is a list of boxes, each box has 4 coordinate values
2D points:
```
[[x, y], ...]
```
, extracted grounding result is a list of points, each point has 2 coordinate values
2D polygon:
```
[[x1, y1], [x2, y2], ...]
```
, extracted grounding result is a polygon coordinate list, each vertex has 2 coordinate values
3D bounding boxes:
```
[{"bbox_3d":[x_center, y_center, z_center, x_size, y_size, z_size, roll, pitch, yaw],"label":"category"}, ...]
```
, extracted grounding result is a JSON list where each object contains a category label and one 3D box with 8 coordinate values
Objects Detection JSON:
```
[{'label': 'category', 'bbox_2d': [x1, y1, x2, y2]}, ...]
```
, extracted grounding result is a JSON list where each object contains a category label and one box
Video Objects Tracking JSON:
```
{0: [{'label': 'car-1', 'bbox_2d': [1,2,3,4]}, {'label': 'car-2', 'bbox_2d': [2,3,4,5]}], 1: [{'label': 'car-2', 'bbox_2d': [4,5,6,7]}, {'label': 'person-1', 'bbox_2d': [10,20,30,40]}]}
```
, extracted grounding result is a JSON object whose keys are video frame indices and values are lists of JSON objects, each containing a category label and one 2D box

收到用户的定位prompt后，你的直接回复应该是包含定位坐标的自然语言。坐标$x$和$y$是[0, 1000]范围内的相对值，计算方式如下：

$$ x = round(x_{pixel} / W * 1000) \ y = round(y_{pixel}/H*1000) $$

其中$x_{pixel}, y_{pixel}$是像素坐标，原点(0, 0)位于图像左上角，W/H是图像的宽度/高度。

除非另有说明，定位结果应使用以下Python数据格式：

2D边界框：
```
[[x1, y1, x2, y2], ...]
```
，提取的定位结果是框的列表，每个框包含4个坐标值
2D点：
```
[[x, y], ...]
```
，提取的定位结果是点的列表，每个点包含2个坐标值
2D多边形：
```
[[x1, y1], [x2, y2], ...]
```
，提取的定位结果是多边形坐标列表，每个顶点包含2个坐标值
3D边界框：
```
[{"bbox_3d":[x_center, y_center, z_center, x_size, y_size, z_size, roll, pitch, yaw],"label":"类别"}, ...]
```
，提取的定位结果是JSON列表，每个对象包含类别标签和一个带8个坐标值的3D框
目标检测JSON：
```
[{'label': '类别', 'bbox_2d': [x1, y1, x2, y2]}, ...]
```
，提取的定位结果是JSON列表，每个对象包含类别标签和一个框

视频目标跟踪JSON：

{0: [{'label': 'car-1', 'bbox_2d': [1,2,3,4]}, {'label': 'car-2', 'bbox_2d': [2,3,4,5]}], 1: [{'label': 'car-2', 'bbox_2d': [4,5,6,7]}, {'label': 'person-1', 'bbox_2d': [10,20,30,40]}]}

，提取的定位结果是JSON对象，键为视频帧索引，值为JSON对象列表，每个对象包含类别标签和一个2D框

Python example

Python示例

shell

undefined

shell

undefined

1. User grounding request and your reply

1. 用户定位请求和你的回复

image=https://example.com/image.jpg prompt="Please box all people wearing Santa hats in the image and tell me their coordinates. Use red boxes, line thickness 3, and label format 'SantaHat-i'."

image=https://example.com/image.jpg prompt="请框出图像中所有戴圣诞帽的人并告诉我它们的坐标。使用红色框，线宽3，标签格式为'SantaHat-i'。"

2. Get grounding results

2. 获取定位结果

python scripts/glm_grounding_cli.py --image-url $image --prompt $prompt --visualize --visualization-dir "./vis"

{

"ok": True,

"grounding_result": [[100, 200, 300, 400], [500, 600, 700, 800]],

"visualizations_result": (

{"visualized_image": "./vis/image_vis.jpg"}

),

"raw_result": "1. Person 1: box [100, 200, 300, 400]\n2. Person 2: box [500, 600, 700, 800]. The box format is [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner.",

"raw_result": "1. 人员1：框 [100, 200, 300, 400]\n2. 人员2：框 [500, 600, 700, 800]。框格式为[x1, y1, x2, y2]，其中(x1, y1)是左上角坐标，(x2, y2)是右下角坐标。",

"error": None,

"source": source,

}

undefined

undefined

Utility function quick reference

工具函数速查表

Function	Purpose
`parse_coordinates_from_response(response_str, coords_type='bbox', init_context_window=2000, max_context_window=-1)`	Parse and extract all coordinate results from model responses (supports 2D bbox, point, polygon)
`parse_3d_boxes_from_response(response_str, max_context_window=-1)`	Parse and extract all 3D boxes and labels from model responses (strict and loose matching)
`parse_detection_from_response(response_str, max_context_window=-1)`	Parse and extract all 2D detection results from model responses (Objects Detection JSON format)
`parse_mot_from_response(response_str, max_context_window=-1)`	Parse and extract all video object tracking results from model responses (Video Objects Tracking JSON format)
`visualize_boxes(img_path=None, img_bytes=None, boxes=[], labels=None, renormalize=False, save_path=None, return_b64=False, save_optimized=True, **kwargs)`	Draw 2D boxes on images with labels, custom colors, and line thickness
`visualize_points(img_path=None, img_bytes=None, points=[], labels=None, renormalize=False, diameters=None, save_path=None, return_b64=False, save_optimized=True, distinct_colors=False, colors=None)`	Draw points on images with labels, custom size, and colors
`visualize_3d_boxes_glmv_simple(image_path, cam_params, bbox_3d_list, image_bytes=None, coord_format='xyzwhlpyr', save_path=None, save_optimized=False, return_b64=False, **kwargs)`	Draw projected 3D boxes on images using camera intrinsics (supports rotation and multiple coordinate formats)
`visualize_mot(video_path=None, video_bytes=None, mot_js=None, renormalize=False, save_path=None, return_b64=False, distinct_colors=True, **kwargs)`	Draw Video Objects Tracking boxes on each video frame with labels

函数	用途
`parse_coordinates_from_response(response_str, coords_type='bbox', init_context_window=2000, max_context_window=-1)`	从模型响应中解析并提取所有坐标结果（支持2D边界框、点、多边形）
`parse_3d_boxes_from_response(response_str, max_context_window=-1)`	从模型响应中解析并提取所有3D框和标签（支持严格和宽松匹配）
`parse_detection_from_response(response_str, max_context_window=-1)`	从模型响应中解析并提取所有2D检测结果（目标检测JSON格式）
`parse_mot_from_response(response_str, max_context_window=-1)`	从模型响应中解析并提取所有视频目标跟踪结果（视频目标跟踪JSON格式）
`visualize_boxes(img_path=None, img_bytes=None, boxes=[], labels=None, renormalize=False, save_path=None, return_b64=False, save_optimized=True, **kwargs)`	在图像上绘制2D框，支持自定义标签、颜色和线宽
`visualize_points(img_path=None, img_bytes=None, points=[], labels=None, renormalize=False, diameters=None, save_path=None, return_b64=False, save_optimized=True, distinct_colors=False, colors=None)`	在图像上绘制点，支持自定义标签、大小和颜色
`visualize_3d_boxes_glmv_simple(image_path, cam_params, bbox_3d_list, image_bytes=None, coord_format='xyzwhlpyr', save_path=None, save_optimized=False, return_b64=False, **kwargs)`	使用相机内参在图像上绘制投影后的3D框（支持旋转和多种坐标格式）
`visualize_mot(video_path=None, video_bytes=None, mot_js=None, renormalize=False, save_path=None, return_b64=False, distinct_colors=True, **kwargs)`	在每个视频帧上绘制视频目标跟踪框并添加标签

Common errors

常见错误

Coordinate values exceed 1000: if extracted coordinate values are greater than 1000, the model may have produced unnormalized coordinates due to prompt effects. Extract the target phrase from the user request (for example, "people wearing Santa hats"), then query the model again and explicitly require output coordinates to be relative values normalized to 0-1000 based on image size (for example, "Please box all people wearing Santa hats in the image and tell me their coordinates. Ensure the output coordinates are relative values normalized to 0-1000 based on image size.").

坐标值超过1000：如果提取的坐标值大于1000，可能是prompt影响导致模型输出了未归一化的坐标。从用户请求中提取目标短语（例如“戴圣诞帽的人”），然后再次查询模型并明确要求输出坐标为基于图像尺寸归一化到0-1000的相对值（例如“请框出图像中所有戴圣诞帽的人并告诉我它们的坐标。确保输出坐标为基于图像尺寸归一化到0-1000的相对值。”）。