deepseek-ocr
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDeepSeek-OCR
DeepSeek-OCR
Skill by ara.so — Daily 2026 Skills collection.
DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown.
由ara.so提供的技能——2026每日技能合集。
DeepSeek-OCR是一款具备“上下文光学压缩”功能的光学字符识别(OCR)视觉语言模型。它支持原生和动态分辨率,提供多种提示模式(文档转Markdown、自由OCR、图表解析、定位识别),可通过vLLM(高吞吐量)或HuggingFace Transformers运行。它能处理图片和PDF,输出结构化文本或Markdown格式内容。
Installation
安装
Prerequisites
前置要求
- CUDA 11.8+, PyTorch 2.6.0
- Python 3.12.9 (via conda recommended)
- CUDA 11.8+、PyTorch 2.6.0
- Python 3.12.9(推荐通过conda安装)
Setup
安装步骤
bash
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocrbash
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocrInstall PyTorch with CUDA 11.8
安装支持CUDA 11.8的PyTorch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
--index-url https://download.pytorch.org/whl/cu118
--index-url https://download.pytorch.org/whl/cu118
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
--index-url https://download.pytorch.org/whl/cu118
--index-url https://download.pytorch.org/whl/cu118
Download vllm-0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
undefinedpip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
undefinedAlternative: upstream vLLM (nightly)
替代方案:上游vLLM( nightly版本)
bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightlybash
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightlyModel Download
模型下载
Model is available on HuggingFace:
deepseek-ai/DeepSeek-OCRpython
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")模型可在HuggingFace获取:
deepseek-ai/DeepSeek-OCRpython
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")Inference: vLLM (Recommended for Production)
推理:vLLM(生产环境推荐)
Single Image — Streaming
单图片——流式输出
python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822}, # <td>, </td> for table support
),
skip_special_tokens=False,
)
outputs = llm.generate(
[{"prompt": prompt, "multi_modal_data": {"image": image}}],
sampling_params
)
print(outputs[0].outputs[0].text)python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822}, # 表格支持所需的<td>, </td>
),
skip_special_tokens=False,
)
outputs = llm.generate(
[{"prompt": prompt, "multi_modal_data": {"image": image}}],
sampling_params
)
print(outputs[0].outputs[0].text)Batch Images
批量图片处理
python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
model_input = [
{
"prompt": prompt,
"multi_modal_data": {"image": Image.open(p).convert("RGB")}
}
for p in image_paths
]
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822},
),
skip_special_tokens=False,
)
outputs = llm.generate(model_input, sampling_params)
for path, output in zip(image_paths, outputs):
print(f"=== {path} ===")
print(output.outputs[0].text)python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
model_input = [
{
"prompt": prompt,
"multi_modal_data": {"image": Image.open(p).convert("RGB")}
}
for p in image_paths
]
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822},
),
skip_special_tokens=False,
)
outputs = llm.generate(model_input, sampling_params)
for path, output in zip(image_paths, outputs):
print(f"=== {path} ===")
print(output.outputs[0].text)PDF Processing (via vLLM scripts)
PDF处理(通过vLLM脚本)
bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllmbash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllmEdit config.py: set INPUT_PATH, OUTPUT_PATH, model path, etc.
编辑config.py:设置INPUT_PATH、OUTPUT_PATH、模型路径等
python run_dpsk_ocr_pdf.py # ~2500 tokens/s on A100-40G
undefinedpython run_dpsk_ocr_pdf.py # 在A100-40G上约2500 tokens/秒
undefinedBenchmark Evaluation
基准测试评估
bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.pybash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.pyInference: HuggingFace Transformers
推理:HuggingFace Transformers
python
import os
import torch
from transformers import AutoModel, AutoTokenizer
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)python
import os
import torch
from transformers import AutoModel, AutoTokenizer
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)Document to markdown
文档转Markdown
res = model.infer(
tokenizer,
prompt="<image>\n<|grounding|>Convert the document to markdown. ",
image_file="document.jpg",
output_path="./output/",
base_size=1024,
image_size=640,
crop_mode=True,
save_results=True,
test_compress=True,
)
print(res)
undefinedres = model.infer(
tokenizer,
prompt="<image>\n<|grounding|>Convert the document to markdown. ",
image_file="document.jpg",
output_path="./output/",
base_size=1024,
image_size=640,
crop_mode=True,
save_results=True,
test_compress=True,
)
print(res)
undefinedTransformers Script
Transformers脚本
bash
cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.pybash
cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.pyPrompt Reference
提示词参考
| Use Case | Prompt |
|---|---|
| Document → Markdown | `<image>\n< |
| General OCR | `<image>\n< |
| Free OCR (no layout) | |
| Parse figure/chart | |
| General description | |
| Grounded REC | |
python
PROMPTS = {
"document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
"ocr_image": "<image>\n<|grounding|>OCR this image. ",
"free_ocr": "<image>\nFree OCR. ",
"parse_figure": "<image>\nParse the figure. ",
"describe": "<image>\nDescribe this image in detail. ",
"rec": "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}| 使用场景 | 提示词 |
|---|---|
| 文档转Markdown | `<image>\n< |
| 通用OCR | `<image>\n< |
| 自由OCR(无布局) | |
| 图表解析 | |
| 详细描述 | |
| 定位识别(REC) | |
python
PROMPTS = {
"document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
"ocr_image": "<image>\n<|grounding|>OCR this image. ",
"free_ocr": "<image>\nFree OCR. ",
"parse_figure": "<image>\nParse the figure. ",
"describe": "<image>\nDescribe this image in detail. ",
"rec": "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}Supported Resolutions
支持的分辨率
| Mode | Resolution | Vision Tokens |
|---|---|---|
| Tiny | 512×512 | 64 |
| Small | 640×640 | 100 |
| Base | 1024×1024 | 256 |
| Large | 1280×1280 | 400 |
| Gundam (dynamic) | n×640×640 + 1×1024×1024 | variable |
python
undefined| 模式 | 分辨率 | 视觉Tokens数 |
|---|---|---|
| Tiny | 512×512 | 64 |
| Small | 640×640 | 100 |
| Base | 1024×1024 | 256 |
| Large | 1280×1280 | 400 |
| Gundam(动态) | n×640×640 + 1×1024×1024 | 可变 |
python
undefinedTransformers: control resolution via infer() params
Transformers:通过infer()参数控制分辨率
res = model.infer(
tokenizer,
prompt=prompt,
image_file="image.jpg",
base_size=1024, # 512, 640, 1024, or 1280
image_size=640, # patch size for dynamic mode
crop_mode=True, # True = Gundam dynamic resolution
)
---res = model.infer(
tokenizer,
prompt=prompt,
image_file="image.jpg",
base_size=1024, # 可选512、640、1024或1280
image_size=640, # 动态模式下的补丁尺寸
crop_mode=True, # True = Gundam动态分辨率
)
---Configuration (vLLM)
配置(vLLM)
Edit :
DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.pypython
undefined编辑:
DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.pypython
undefinedKey config fields (example)
关键配置字段(示例)
MODEL_PATH = "deepseek-ai/DeepSeek-OCR" # or local path
INPUT_PATH = "/data/input_images/"
OUTPUT_PATH = "/data/output/"
TENSOR_PARALLEL_SIZE = 1 # GPUs for tensor parallelism
MAX_TOKENS = 8192
TEMPERATURE = 0.0
NGRAM_SIZE = 30
WINDOW_SIZE = 90
---MODEL_PATH = "deepseek-ai/DeepSeek-OCR" # 或本地路径
INPUT_PATH = "/data/input_images/"
OUTPUT_PATH = "/data/output/"
TENSOR_PARALLEL_SIZE = 1 # 用于张量并行的GPU数量
MAX_TOKENS = 8192
TEMPERATURE = 0.0
NGRAM_SIZE = 30
WINDOW_SIZE = 90
---Common Patterns
常见使用场景
Process a Directory of Images
处理目录下的所有图片
python
import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
Path(output_dir).mkdir(parents=True, exist_ok=True)
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
skip_special_tokens=False,
)
image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
inputs = [
{"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
for f in image_files
]
outputs = llm.generate(inputs, sampling_params)
for img_path, output in zip(image_files, outputs):
out_file = Path(output_dir) / (img_path.stem + ".txt")
out_file.write_text(output.outputs[0].text)
print(f"Saved: {out_file}")
batch_ocr("/data/scans/", "/data/results/")python
import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
Path(output_dir).mkdir(parents=True, exist_ok=True)
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
skip_special_tokens=False,
)
image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
inputs = [
{"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
for f in image_files
]
outputs = llm.generate(inputs, sampling_params)
for img_path, output in zip(image_files, outputs):
out_file = Path(output_dir) / (img_path.stem + ".txt")
out_file.write_text(output.outputs[0].text)
print(f"已保存:{out_file}")
batch_ocr("/data/scans/", "/data/results/")Convert PDF Pages to Markdown
将PDF页面转换为Markdown
python
import fitz # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
def pdf_to_markdown(pdf_path: str) -> list[str]:
doc = fitz.open(pdf_path)
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
skip_special_tokens=False,
)
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
inputs = []
for page in doc:
pix = page.get_pixmap(dpi=150)
img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
outputs = llm.generate(inputs, sampling_params)
return [o.outputs[0].text for o in outputs]
pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)python
import fitz # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
def pdf_to_markdown(pdf_path: str) -> list[str]:
doc = fitz.open(pdf_path)
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
skip_special_tokens=False,
)
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
inputs = []
for page in doc:
pix = page.get_pixmap(dpi=150)
img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
outputs = llm.generate(inputs, sampling_params)
return [o.outputs[0].text for o in outputs]
pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)Grounded Text Location (REC)
文本定位识别(REC)
python
import torch
from transformers import AutoModel, AutoTokenizer
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True,
).eval().cuda().to(torch.bfloat16)
target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "
res = model.infer(
tokenizer,
prompt=prompt,
image_file="invoice.jpg",
output_path="./output/",
base_size=1024,
image_size=640,
crop_mode=False,
save_results=True,
)
print(res) # Returns bounding box / location infopython
import torch
from transformers import AutoModel, AutoTokenizer
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True,
).eval().cuda().to(torch.bfloat16)
target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "
res = model.infer(
tokenizer,
prompt=prompt,
image_file="invoice.jpg",
output_path="./output/",
base_size=1024,
image_size=640,
crop_mode=False,
save_results=True,
)
print(res) # 返回边界框/位置信息Troubleshooting
常见问题排查
transformers
version conflict with vLLM
transformerstransformers
版本与vLLM冲突
transformersvLLM 0.8.5 requires — if running both in the same env, this error is safe to ignore per the project docs.
transformers>=4.51.1vLLM 0.8.5要求——根据项目文档,若在同一环境中运行两者,该错误可忽略。
transformers>=4.51.1Flash Attention build errors
Flash Attention构建错误
bash
undefinedbash
undefinedEnsure torch is installed before flash-attn
确保先安装torch再安装flash-attn
pip install flash-attn==2.7.3 --no-build-isolation
undefinedpip install flash-attn==2.7.3 --no-build-isolation
undefinedCUDA out of memory
CUDA内存不足
- Use smaller resolution: or
base_size=512base_size=640 - Disable to avoid multi-crop dynamic resolution
crop_mode=False - Reduce batch size in vLLM inputs
- 使用更小的分辨率:或
base_size=512base_size=640 - 禁用以避免多裁剪动态分辨率
crop_mode=False - 减少vLLM输入中的批量大小
Model output is garbled / repetitive
模型输出乱码/重复
Ensure is passed to — this is required for proper decoding:
NGramPerReqLogitsProcessorLLMpython
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])确保将传入——这是正确解码的必要条件:
NGramPerReqLogitsProcessorLLMpython
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])Tables not rendering correctly
表格渲染不正确
Add table token IDs to the whitelist:
python
whitelist_token_ids={128821, 128822} # <td> and </td>将表格token ID添加到白名单:
python
whitelist_token_ids={128821, 128822} # <td>和</td>Multi-GPU inference
多GPU推理
python
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
tensor_parallel_size=4, # number of GPUs
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)python
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
tensor_parallel_size=4, # GPU数量
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)Key Files
关键文件
DeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│ ├── config.py # vLLM configuration
│ ├── run_dpsk_ocr_image.py # Single image inference
│ ├── run_dpsk_ocr_pdf.py # PDF batch inference
│ └── run_dpsk_ocr_eval_batch.py # Benchmark evaluation
└── DeepSeek-OCR-hf/
└── run_dpsk_ocr.py # HuggingFace Transformers inferenceDeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│ ├── config.py # vLLM配置文件
│ ├── run_dpsk_ocr_image.py # 单图片推理脚本
│ ├── run_dpsk_ocr_pdf.py # PDF批量推理脚本
│ └── run_dpsk_ocr_eval_batch.py # 基准测试评估脚本
└── DeepSeek-OCR-hf/
└── run_dpsk_ocr.py # HuggingFace Transformers推理脚本