silicon-paddle-ocr

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OCR - Image Text Recognition

OCR - 图片文字识别

Use PaddleOCR to extract text content from images. Supports single image or batch processing.

使用PaddleOCR从图片中提取文本内容，支持单张图片或批量处理。

Overview

概述

This skill provides optical character recognition (OCR) capabilities using the PaddlePaddle/PaddleOCR-VL-1.5 model via the SiliconFlow API. Extract text from JPG, PNG, WebP, BMP, and GIF images.

本技能通过SiliconFlow API调用PaddlePaddle/PaddleOCR-VL-1.5模型，提供光学字符识别（OCR）能力。可从JPG、PNG、WebP、BMP和GIF格式的图片中提取文字。

When to Use

使用场景

Invoke this skill when:

User wants to extract text from an image
User asks to OCR a screenshot or photo
User needs to read text from an image file
User mentions text recognition from images

在以下情况调用本技能：

用户需要从图片中提取文字
用户要求对截图或照片进行OCR识别
用户需要读取图片文件中的文字
用户提及图片文字识别

How to Use

使用方法

Prerequisites

前置条件

Ensure the

SILICONFLOW_API_KEY

environment variable is set:

bash

export SILICONFLOW_API_KEY="your_api_key"

确保已设置

SILICONFLOW_API_KEY

环境变量：

bash

export SILICONFLOW_API_KEY="your_api_key"

Basic Usage

基础用法

Execute the OCR script:

bash

python3 scripts/ocr_skill.py [options] image_path

执行OCR脚本：

bash

python3 scripts/ocr_skill.py [options] image_path

Arguments

参数说明

Argument	Description
`images`	Image file path(s) or glob pattern (required)
`-k, --api-key`	API key (default: from SILICONFLOW_API_KEY env)
`-m, --model`	OCR model name (default: PaddlePaddle/PaddleOCR-VL-1.5)
`-p, --prompt`	Recognition prompt for custom behavior
`-j, --json`	Output results in JSON format
`-o, --output`	Save results to specified file
`--max-tokens`	Maximum tokens in response (default: 2000)

参数	描述
`images`	图片文件路径或通配符模式（必填）
`-k, --api-key`	API密钥（默认：从SILICONFLOW_API_KEY环境变量读取）
`-m, --model`	OCR模型名称（默认：PaddlePaddle/PaddleOCR-VL-1.5）
`-p, --prompt`	用于自定义识别行为的提示词
`-j, --json`	以JSON格式输出结果
`-o, --output`	将结果保存到指定文件
`--max-tokens`	响应的最大令牌数（默认：2000）

Examples

示例

Single image:

bash

python3 scripts/ocr_skill.py /path/to/image.jpg

Multiple images with glob:

bash

python3 scripts/ocr_skill.py /path/to/images/*.png

JSON output format:

bash

python3 scripts/ocr_skill.py --json /path/to/image.jpg

Custom prompt for table extraction:

bash

python3 scripts/ocr_skill.py -p "Please identify and format table content as Markdown" /path/to/table.jpg

Save to file:

bash

python3 scripts/ocr_skill.py --json --output results.json /path/to/images/*.jpg

单张图片识别：

bash

python3 scripts/ocr_skill.py /path/to/image.jpg

通配符批量处理图片：

bash

python3 scripts/ocr_skill.py /path/to/images/*.png

JSON格式输出：

bash

python3 scripts/ocr_skill.py --json /path/to/image.jpg

自定义提示词提取表格内容：

bash

python3 scripts/ocr_skill.py -p "请识别表格内容并格式化为Markdown格式" /path/to/table.jpg

保存结果到文件：

bash

python3 scripts/ocr_skill.py --json --output results.json /path/to/images/*.jpg

Output Format

输出格式

Text output (default):

--- image.jpg ---
识别到的文字内容
识别到 X 处文字区域

JSON output:

json

{
  "image.jpg": {
    "image_path": "/path/to/image.jpg",
    "image_size": [width, height],
    "texts": [
      {
        "text": "识别的文字",
        "box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
      }
    ],
    "full_text": "所有文本的组合"
  },
  "image2.png": { ... }
}

Coordinates Explanation:

LOC values are normalized coordinates converted to pixel coordinates
Conversion: pixel = LOC × (image_size / LOC_max_value)
LOC max_value is approximately 972 (may vary by model/image)
The
```
box
```
field provides the four corner coordinates of each text region in pixel format

文本输出（默认）：

--- image.jpg ---
识别到的文字内容
识别到 X 处文字区域

JSON输出：

json

{
  "image.jpg": {
    "image_path": "/path/to/image.jpg",
    "image_size": [width, height],
    "texts": [
      {
        "text": "识别的文字",
        "box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
      }
    ],
    "full_text": "所有文本的组合"
  },
  "image2.png": { ... }
}

坐标说明：

LOC值为归一化坐标，已转换为像素坐标
转换公式：像素坐标 = LOC × (图片尺寸 / LOC最大值)
LOC最大值约为972（可能因模型/图片而异）
```
box
```
字段提供每个文字区域的四个角点像素坐标

Supported Image Formats

支持的图片格式

JPG/JPEG
PNG
WebP
BMP
GIF

JPG/JPEG
PNG
WebP
BMP
GIF

Error Handling

错误处理

If processing fails:

Check that the image file exists
Verify the SILICONFLOW_API_KEY is valid
Ensure the API endpoint is reachable

Images that fail to process will show an error message, and other images will continue processing.

若处理失败：

检查图片文件是否存在
验证SILICONFLOW_API_KEY是否有效
确保API端点可访问

处理失败的图片会显示错误信息，其他图片将继续处理。

Additional Resources

额外资源

Reference Files

参考文件

references/api-configuration.md
- API configuration details

references/api-configuration.md
- API配置详情

Example Files

示例文件

examples/sample-usage.sh
- Example usage script

examples/sample-usage.sh
- 示例使用脚本

Scripts

脚本文件

scripts/ocr_skill.py
- The main OCR implementation

scripts/ocr_skill.py
- OCR核心实现脚本