deepseek-ocr.rs 🚀

Rust implementation of the DeepSeek-OCR inference stack with a fast CLI and an OpenAI-compatible HTTP server. The workspace packages multiple OCR backends, prompt tooling, and a serving layer so you can build document understanding pipelines that run locally on CPU, Apple Metal, or (alpha) NVIDIA CUDA GPUs.

中文文档请看 README_CN.md。

Want ready-made binaries? Latest macOS (Metal-enabled) and Windows bundles live in the build-binaries workflow artifacts. Grab them from the newest green run.

Choosing a Model 🔬

Model	Memory footprint*	Best on	When to pick it
DeepSeek‑OCR	≈6.3GB FP16 weights, ≈13GB RAM/VRAM with cache & activations (512-token budget)	Apple Silicon + Metal (FP16), high-VRAM NVIDIA GPUs, 32GB+ RAM desktops	Highest accuracy, SAM+CLIP global/local context, MoE DeepSeek‑V2 decoder (3B params, ~570M active per token). Use when latency is secondary to quality.
PaddleOCR‑VL	≈4.7GB FP16 weights, ≈9GB RAM/VRAM with cache & activations	16GB laptops, CPU-only boxes, mid-range GPUs	Dense 0.9B Ernie decoder with SigLIP vision tower. Faster startup, lower memory, great for batch jobs or lightweight deployments.
DotsOCR	≈9GB FP16 weights, but expect 30–50GB RAM/VRAM for high-res docs due to huge vision tokens	Apple Silicon + Metal BF16, ≥24GB CUDA cards, or 64GB RAM CPU workstations	Unified VLM (DotsVision + Qwen2) that nails layout, reading order, grounding, and multilingual math if you can tolerate the latency and memory bill.

*Measured from the default FP16 safetensors. Runtime footprint varies with sequence length.

Guidance:

Need maximum fidelity, multi-region reasoning, or already have 16–24GB VRAM? Use DeepSeek‑OCR. The hybrid SAM+CLIP tower plus DeepSeek‑V2 MoE decoder handles complex layouts best, but expect higher memory/latency.
Deploying to CPU-only nodes, 16GB laptops, or latency-sensitive services? Choose PaddleOCR‑VL. Its dense Ernie decoder (18 layers, hidden 1024) activates fewer parameters per token and keeps memory under 10GB while staying close in quality on most docs.
Chasing reading-order accuracy, layout grounding, or multi-page multilingual PDFs on roomy hardware? Pick DotsOCR with BF16 on Metal/CUDA. Prefill runs around 40–50 tok/s on M-series GPUs but can fall to ~12 tok/s on CPU because of the heavy vision tower.

Why Rust? 💡

The original DeepSeek-OCR ships as a Python + Transformers stack—powerful, but hefty to deploy and awkward to embed. Rewriting the pipeline in Rust gives us:

Smaller deployable artifacts with zero Python runtime or conda baggage.
Memory-safe, thread-friendly infrastructure that blends into native Rust backends.
Unified tooling (CLI + server) running on Candle + Rocket without the Python GIL overhead.
Drop-in compatibility with OpenAI-style clients while tuned for single-turn OCR prompts.

Technical Stack ⚙️

Candle for tensor compute, with Metal and CUDA backends and FlashAttention support.
Rocket + async streaming for OpenAI-compatible /v1/responses and /v1/chat/completions.
tokenizers (upstream DeepSeek release) wrapped by crates/assets for deterministic caching via Hugging Face and ModelScope mirrors.
Pure Rust vision/prompt pipeline shared by CLI and server to avoid duplicated logic.

Advantages over the Python Release 🥷

Faster cold-start on Apple Silicon, lower RSS, and native binary distribution.
Deterministic dual-source (Hugging Face + ModelScope) asset download + verification built into the workspace.
Automatic single-turn chat compaction so OCR outputs stay stable even when clients send history.
Ready-to-use OpenAI compatibility for tools like Open WebUI without adapters.

Highlights ✨

One repo, two entrypoints – a batteries-included CLI for batch jobs and a Rocket-based server that speaks /v1/responses and /v1/chat/completions.
Works out of the box – pulls model weights, configs, and tokenizer from whichever of Hugging Face or ModelScope responds fastest on first run.
Optimised for Apple Silicon – optional Metal backend with FP16 execution for real-time OCR on laptops.
CUDA (alpha) – experimental support via --features cuda + --device cuda --dtype f16; expect rough edges while we finish kernel coverage.
Intel MKL (preview) – faster BLAS on x86 via --features mkl (install Intel oneMKL beforehand).
OpenAI client compatibility – drop-in replacement for popular SDKs; the server automatically collapses chat history to the latest user turn for OCR-friendly prompts.

Model Matrix 📦

The workspace exposes three base model IDs plus DSQ-quantized variants for DeepSeek‑OCR, PaddleOCR‑VL, and DotsOCR:

Model ID	Base Model	Precision	Suggested Use Case
`deepseek-ocr`	`deepseek-ocr`	FP16 (select via `--dtype`)	Full-fidelity DeepSeek‑OCR stack with SAM+CLIP + MoE decoder; use when you prioritise quality on capable Metal/CUDA/CPU hosts.
`deepseek-ocr-q4k`	`deepseek-ocr`	`Q4_K`	Tight VRAM, local deployments, and batch jobs that still want DeepSeek’s SAM+CLIP pipeline.
`deepseek-ocr-q6k`	`deepseek-ocr`	`Q6_K`	Day‑to‑day balance of quality and size on mid‑range GPUs.
`deepseek-ocr-q8k`	`deepseek-ocr`	`Q8_0`	Stay close to full‑precision quality with manageable memory savings.
`paddleocr-vl`	`paddleocr-vl`	FP16 (select via `--dtype`)	Default choice for lighter hardware; 0.9B Ernie + SigLIP tower with strong doc/table OCR and low latency.
`paddleocr-vl-q4k`	`paddleocr-vl`	`Q4_K`	Heavily compressed doc/table deployments with aggressive memory budgets.
`paddleocr-vl-q6k`	`paddleocr-vl`	`Q6_K`	Common engineering setups; blends accuracy and footprint.
`paddleocr-vl-q8k`	`paddleocr-vl`	`Q8_0`	Accuracy‑leaning deployments that still want a smaller footprint than FP16.
`dots-ocr`	`dots-ocr`	FP16 / BF16 (via `--dtype`)	DotsVision + Qwen2 VLM for high‑precision layout, reading order, grounding, and multilingual docs; expect high memory (30–50GB on large pages).
`dots-ocr-q4k`	`dots-ocr`	`Q4_K`	Sidecar DSQ snapshot over the DotsOCR baseline; reduces weight memory/compute while keeping the heavy vision token profile unchanged.
`dots-ocr-q6k`	`dots-ocr`	`Q6_K`	Recommended balance of size and quality when you already accept DotsOCR’s memory footprint but want cheaper weights.
`dots-ocr-q8k`	`dots-ocr`	`Q8_0`	Accuracy‑leaning DotsOCR deployment that stays close to FP16/BF16 quality with modest memory savings.

Quick Start 🏁

Prerequisites

Rust 1.78+ (edition 2024 support)
Git
Optional: Apple Silicon running macOS 13+ for Metal acceleration
Optional: CUDA 12.2+ toolkit + driver for experimental NVIDIA GPU acceleration on Linux/Windows
Optional: Intel oneAPI MKL for preview x86 acceleration (see below)
(Recommended) Hugging Face account with HF_TOKEN when pulling from the deepseek-ai/DeepSeek-OCR repo (ModelScope is used automatically when it’s faster/reachable).

Clone the Workspace

git clone https://github.com/TimmyOVO/deepseek-ocr.rs.git
cd deepseek-ocr.rs
cargo fetch

Model Assets

The first invocation of the CLI or server downloads the config, tokenizer, and model-00001-of-000001.safetensors (~6.3GB) into DeepSeek-OCR/. To prefetch manually:

cargo run -p deepseek-ocr-cli --release -- --help # dev profile is extremely slow; always prefer --release

Always include --release when running from source; debug builds on this model are extremely slow. Set HF_HOME/HF_TOKEN if you store Hugging Face caches elsewhere (ModelScope downloads land alongside the same asset tree). The full model package is ~6.3GB on disk and typically requires ~13GB of RAM headroom during inference (model + activations).

Configuration & Overrides 🗂️

The CLI and server share the same configuration. On first launch we create a config.toml populated with defaults; later runs reuse it so both entrypoints stay in sync.

Platform	Config file (default)	Model cache root
Linux	`~/.config/deepseek-ocr/config.toml`	`~/.cache/deepseek-ocr/models/<id>/…`
macOS	`~/Library/Application Support/deepseek-ocr/config.toml`	`~/Library/Caches/deepseek-ocr/models/<id>/…`
Windows	`%APPDATA%\deepseek-ocr\config.toml`	`%LOCALAPPDATA%\deepseek-ocr\models\<id>\…`

Override the location with --config /path/to/config.toml (available on both CLI and server). Missing files are created automatically.
Each [models.entries."<id>"] record can point to custom config, tokenizer, or weights files. When omitted we fall back to the cache directory above and download/update assets as required.
Runtime values resolve in this order: command-line flags → values stored in config.toml → built-in defaults. The HTTP API adds a final layer where request payload fields (for example max_tokens) override everything else for that call.

The generated file starts with the defaults below; adjust them to persistently change behaviour:

[models]
active = "deepseek-ocr"

[models.entries.deepseek-ocr]

[inference]
device = "cpu"
template = "plain"
base_size = 1024
image_size = 640
crop_mode = true
max_new_tokens = 512
use_cache = true

[server]
host = "0.0.0.0"
port = 8000

[models] picks the active model and lets you add more entries (each entry can point to its own config/tokenizer/weights).
[inference] controls notebook-friendly defaults shared by the CLI and server (device, template, vision sizing, decoding budget, cache usage).
[server] sets the network binding and the model identifier reported by /v1/models.

See crates/cli/README.md and crates/server/README.md for concise override tables.

Benchmark Snapshot 📊

Single-request Rust CLI (Accelerate backend on macOS) compared with the reference Python pipeline on the same prompt and image:

Stage	ref total (ms)	ref avg (ms)	python total	python/ref
Decode – Overall (`decode.generate`)	30077.840	30077.840	56554.873	1.88x
Decode – Token Loop (`decode.iterative`)	26930.216	26930.216	39227.974	1.46x
Decode – Prompt Prefill (`decode.prefill`)	3147.337	3147.337	5759.684	1.83x
Prompt – Build Tokens (`prompt.build_tokens`)	0.466	0.466	45.434	97.42x
Prompt – Render Template (`prompt.render`)	0.005	0.005	0.019	3.52x
Vision – Embed Images (`vision.compute_embeddings`)	6391.435	6391.435	3953.459	0.62x
Vision – Prepare Inputs (`vision.prepare_inputs`)	62.524	62.524	45.438	0.73x

Command-Line Interface 🖥️

Build and run directly from the workspace:

cargo run -p deepseek-ocr-cli --release -- \
  --prompt "<image>\n<|grounding|>Convert this receipt to markdown." \
  --image baselines/sample/images/test.png \
  --device cpu --max-new-tokens 512

Tip: --release is required for reasonable throughput; debug builds can be 10x slower.

macOS tip: append --features metal to the cargo run/cargo build commands to compile with Accelerate + Metal backends.

CUDA tip (Linux/Windows): append --features cuda and run with --device cuda --dtype f16 to target NVIDIA GPUs—feature is still alpha, so be ready for quirks.

Intel MKL preview: install Intel oneMKL, then build with --features mkl for faster CPU matmuls on x86.

Install the CLI as a binary:

cargo install --path crates/cli
deepseek-ocr-cli --help

Key flags:

--prompt / --prompt-file: text with <image> slots
--image: path(s) matching <image> placeholders
--device and --dtype: choose metal + f16 on Apple Silicon or cuda + f16 on NVIDIA GPUs
--max-new-tokens: decoding budget
Sampling controls: --do-sample, --temperature, --top-p, --top-k, --repetition-penalty, --no-repeat-ngram-size, --seed
- By default decoding stays deterministic (do_sample=false, temperature=0.0, no_repeat_ngram_size=20)
- To use stochastic sampling set --do-sample true --temperature 0.8 (and optionally adjust the other knobs)

Switching Models

The autogenerated config.toml now lists three entries:

deepseek-ocr (default) – the original DeepSeek vision-language stack.
paddleocr-vl – the PaddleOCR-VL 0.9B SigLIP + Ernie release.
dots-ocr – the Candle port of dots.ocr with DotsVision + Qwen2 (use BF16 on Metal/CUDA if possible; see the release matrix for memory notes).

Pick which one to load via --model:

deepseek-ocr-cli --model paddleocr-vl --prompt "<image> Summarise"

The CLI (and server) will download the matching config/tokenizer/weights from the appropriate repository (deepseek-ai/DeepSeek-OCR, PaddlePaddle/PaddleOCR-VL, or dots-ocr) into your cache on first use. You can still override paths with --model-config, --tokenizer, or --weights if you maintain local fine-tunes.

HTTP Server ☁️

Launch an OpenAI-compatible endpoint:

cargo run -p deepseek-ocr-server --release -- \
  --host 0.0.0.0 --port 8000 \
  --device cpu --max-new-tokens 512

Keep --release on the server as well; the debug profile is far too slow for inference workloads. macOS tip: add --features metal to the cargo run -p deepseek-ocr-server command when you want the server binary to link against Accelerate + Metal (and pair it with --device metal at runtime).

CUDA tip: add --features cuda and start the server with --device cuda --dtype f16 to offload inference to NVIDIA GPUs (alpha-quality support).

Intel MKL preview: install Intel oneMKL before building with --features mkl to accelerate CPU workloads on x86.

Notes:

Use data: URLs or remote http(s) links; local paths are rejected.
The server collapses multi-turn chat inputs to the latest user message to keep prompts OCR-friendly.
Works out of the box with tools such as Open WebUI or any OpenAI-compatible client—just point the base URL to your server (http://localhost:8000/v1) and select either the deepseek-ocr or paddleocr-vl model ID exposed in /v1/models.
Adjust the request body limit with Rocket config if you routinely send large images.

GPU Acceleration ⚡

Metal (macOS 13+ Apple Silicon) – pass --device metal --dtype f16 and build binaries with --features metal so Candle links against Accelerate + Metal.
CUDA (alpha, NVIDIA GPUs) – install CUDA 12.2+ toolkits, build with --features cuda, and launch the CLI/server with --device cuda --dtype f16; still experimental.
Intel MKL (preview) – install Intel oneMKL and build with --features mkl to speed up CPU workloads on x86.
For either backend, prefer release builds (e.g. cargo build --release -p deepseek-ocr-cli --features metal|cuda) to maximise throughput.
Combine GPU runs with --max-new-tokens and crop tuning flags to balance latency vs. quality.

Repository Layout 🗂️

crates/core – shared inference pipeline, model loaders, conversation templates.
crates/cli – command-line frontend (deepseek-ocr-cli).
crates/server – Rocket server exposing OpenAI-compatible endpoints.
crates/assets – asset management (configuration, tokenizer, Hugging Face + ModelScope download helpers).
baselines/ – reference inputs and outputs for regression testing.

Detailed CLI usage lives in crates/cli/README.md. The server’s OpenAI-compatible interface is covered in crates/server/README.md.

Troubleshooting 🛠️

Where do assets come from? – downloads automatically pick between Hugging Face and ModelScope based on latency; the CLI prints the chosen source for each file.
Slow first response – model load and GPU warm-up (Metal/CUDA alpha) happen on the initial request; later runs are faster.
Large image rejection – increase Rocket JSON limits in crates/server/src/main.rs or downscale the input.

Roadmap 🗺️

✅ Apple Metal backend with FP16 support and CLI/server parity on macOS.
✅ NVIDIA CUDA backend (alpha) – build with --features cuda, run with --device cuda --dtype f16 for Linux/Windows GPUs; polishing in progress.
🔄 Parity polish – finish projector normalisation + crop tiling alignment; extend intermediate-tensor diff suite beyond the current sample baseline.
🔄 Grounding & streaming – port the Python post-processing helpers (box extraction, markdown polish) and refine SSE streaming ergonomics.
🔄 Cross-platform acceleration – continue tuning CUDA kernels, add automatic device detection across CPU/Metal/CUDA, and publish opt-in GPU benchmarks.
🔄 Packaging & Ops – ship binary releases with deterministic asset checksums, richer logging/metrics, and Helm/docker references for server deploys.
🔜 Structured outputs – optional JSON schema tools for downstream automation once parity gaps close.

License 📄

This repository inherits the licenses of its dependencies and the upstream DeepSeek-OCR model. Refer to DeepSeek-OCR/LICENSE for model terms and apply the same restrictions to downstream use.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github		.github
assets		assets
baselines		baselines
crates		crates
docker		docker
scripts		scripts
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

deepseek-ocr.rs 🚀

Choosing a Model 🔬

Why Rust? 💡

Technical Stack ⚙️

Advantages over the Python Release 🥷

Highlights ✨

Model Matrix 📦

Quick Start 🏁

Prerequisites

Clone the Workspace

Model Assets

Configuration & Overrides 🗂️

Benchmark Snapshot 📊

Command-Line Interface 🖥️

Switching Models

HTTP Server ☁️

GPU Acceleration ⚡

Repository Layout 🗂️

Troubleshooting 🛠️

Roadmap 🗺️

License 📄

About

Uh oh!

Releases

Packages

Languages

License

0xSojalSec/deepseek-ocr.rs

Folders and files

Latest commit

History

Repository files navigation

deepseek-ocr.rs 🚀

Choosing a Model 🔬

Why Rust? 💡

Technical Stack ⚙️

Advantages over the Python Release 🥷

Highlights ✨

Model Matrix 📦

Quick Start 🏁

Prerequisites

Clone the Workspace

Model Assets

Configuration & Overrides 🗂️

Benchmark Snapshot 📊

Command-Line Interface 🖥️

Switching Models

HTTP Server ☁️

GPU Acceleration ⚡

Repository Layout 🗂️

Troubleshooting 🛠️

Roadmap 🗺️

License 📄

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages