silicon-paddle-ocr

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OCR - Image Text Recognition

OCR - 图片文字识别

Use PaddleOCR to extract text content from images. Supports single image or batch processing.
使用PaddleOCR从图片中提取文本内容,支持单张图片或批量处理。

Overview

概述

This skill provides optical character recognition (OCR) capabilities using the PaddlePaddle/PaddleOCR-VL-1.5 model via the SiliconFlow API. Extract text from JPG, PNG, WebP, BMP, and GIF images.
本技能通过SiliconFlow API调用PaddlePaddle/PaddleOCR-VL-1.5模型,提供光学字符识别(OCR)能力。可从JPG、PNG、WebP、BMP和GIF格式的图片中提取文字。

When to Use

使用场景

Invoke this skill when:
  • User wants to extract text from an image
  • User asks to OCR a screenshot or photo
  • User needs to read text from an image file
  • User mentions text recognition from images
在以下情况调用本技能:
  • 用户需要从图片中提取文字
  • 用户要求对截图或照片进行OCR识别
  • 用户需要读取图片文件中的文字
  • 用户提及图片文字识别

How to Use

使用方法

Prerequisites

前置条件

Ensure the
SILICONFLOW_API_KEY
environment variable is set:
bash
export SILICONFLOW_API_KEY="your_api_key"
确保已设置
SILICONFLOW_API_KEY
环境变量:
bash
export SILICONFLOW_API_KEY="your_api_key"

Basic Usage

基础用法

Execute the OCR script:
bash
python3 scripts/ocr_skill.py [options] image_path
执行OCR脚本:
bash
python3 scripts/ocr_skill.py [options] image_path

Arguments

参数说明

ArgumentDescription
images
Image file path(s) or glob pattern (required)
-k, --api-key
API key (default: from SILICONFLOW_API_KEY env)
-m, --model
OCR model name (default: PaddlePaddle/PaddleOCR-VL-1.5)
-p, --prompt
Recognition prompt for custom behavior
-j, --json
Output results in JSON format
-o, --output
Save results to specified file
--max-tokens
Maximum tokens in response (default: 2000)
参数描述
images
图片文件路径或通配符模式(必填)
-k, --api-key
API密钥(默认:从SILICONFLOW_API_KEY环境变量读取)
-m, --model
OCR模型名称(默认:PaddlePaddle/PaddleOCR-VL-1.5)
-p, --prompt
用于自定义识别行为的提示词
-j, --json
以JSON格式输出结果
-o, --output
将结果保存到指定文件
--max-tokens
响应的最大令牌数(默认:2000)

Examples

示例

Single image:
bash
python3 scripts/ocr_skill.py /path/to/image.jpg
Multiple images with glob:
bash
python3 scripts/ocr_skill.py /path/to/images/*.png
JSON output format:
bash
python3 scripts/ocr_skill.py --json /path/to/image.jpg
Custom prompt for table extraction:
bash
python3 scripts/ocr_skill.py -p "Please identify and format table content as Markdown" /path/to/table.jpg
Save to file:
bash
python3 scripts/ocr_skill.py --json --output results.json /path/to/images/*.jpg
单张图片识别:
bash
python3 scripts/ocr_skill.py /path/to/image.jpg
通配符批量处理图片:
bash
python3 scripts/ocr_skill.py /path/to/images/*.png
JSON格式输出:
bash
python3 scripts/ocr_skill.py --json /path/to/image.jpg
自定义提示词提取表格内容:
bash
python3 scripts/ocr_skill.py -p "请识别表格内容并格式化为Markdown格式" /path/to/table.jpg
保存结果到文件:
bash
python3 scripts/ocr_skill.py --json --output results.json /path/to/images/*.jpg

Output Format

输出格式

Text output (default):
--- image.jpg ---
识别到的文字内容
识别到 X 处文字区域
JSON output:
json
{
  "image.jpg": {
    "image_path": "/path/to/image.jpg",
    "image_size": [width, height],
    "texts": [
      {
        "text": "识别的文字",
        "box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
      }
    ],
    "full_text": "所有文本的组合"
  },
  "image2.png": { ... }
}
Coordinates Explanation:
  • LOC values are normalized coordinates converted to pixel coordinates
  • Conversion: pixel = LOC × (image_size / LOC_max_value)
  • LOC max_value is approximately 972 (may vary by model/image)
  • The
    box
    field provides the four corner coordinates of each text region in pixel format
文本输出(默认):
--- image.jpg ---
识别到的文字内容
识别到 X 处文字区域
JSON输出
json
{
  "image.jpg": {
    "image_path": "/path/to/image.jpg",
    "image_size": [width, height],
    "texts": [
      {
        "text": "识别的文字",
        "box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
      }
    ],
    "full_text": "所有文本的组合"
  },
  "image2.png": { ... }
}
坐标说明
  • LOC值为归一化坐标,已转换为像素坐标
  • 转换公式:像素坐标 = LOC × (图片尺寸 / LOC最大值)
  • LOC最大值约为972(可能因模型/图片而异)
  • box
    字段提供每个文字区域的四个角点像素坐标

Supported Image Formats

支持的图片格式

  • JPG/JPEG
  • PNG
  • WebP
  • BMP
  • GIF
  • JPG/JPEG
  • PNG
  • WebP
  • BMP
  • GIF

Error Handling

错误处理

If processing fails:
  • Check that the image file exists
  • Verify the SILICONFLOW_API_KEY is valid
  • Ensure the API endpoint is reachable
Images that fail to process will show an error message, and other images will continue processing.
若处理失败:
  • 检查图片文件是否存在
  • 验证SILICONFLOW_API_KEY是否有效
  • 确保API端点可访问
处理失败的图片会显示错误信息,其他图片将继续处理。

Additional Resources

额外资源

Reference Files

参考文件

  • references/api-configuration.md
    - API configuration details
  • references/api-configuration.md
    - API配置详情

Example Files

示例文件

  • examples/sample-usage.sh
    - Example usage script
  • examples/sample-usage.sh
    - 示例使用脚本

Scripts

脚本文件

  • scripts/ocr_skill.py
    - The main OCR implementation
  • scripts/ocr_skill.py
    - OCR核心实现脚本