Rust implementation of the DeepSeek-OCR inference stack with a fast CLI and an OpenAI-compatible HTTP server. The workspace packages multiple OCR backends, prompt tooling, and a serving layer so you can build document understanding pipelines that run locally on CPU, Apple Metal, or (alpha) NVIDIA CUDA GPUs.
中文文档请看 README_CN.md。
Want ready-made binaries? Latest macOS (Metal-enabled) and Windows bundles live in the build-binaries workflow artifacts. Grab them from the newest green run.
| Model | Memory footprint* | Best on | When to pick it |
|---|---|---|---|
| DeepSeek‑OCR | ≈6.3GB FP16 weights, ≈13GB RAM/VRAM with cache & activations (512-token budget) | Apple Silicon + Metal (FP16), high-VRAM NVIDIA GPUs, 32GB+ RAM desktops | Highest accuracy, SAM+CLIP global/local context, MoE DeepSeek‑V2 decoder (3B params, ~570M active per token). Use when latency is secondary to quality. |
| PaddleOCR‑VL | ≈4.7GB FP16 weights, ≈9GB RAM/VRAM with cache & activations | 16GB laptops, CPU-only boxes, mid-range GPUs | Dense 0.9B Ernie decoder with SigLIP vision tower. Faster startup, lower memory, great for batch jobs or lightweight deployments. |
| DotsOCR | ≈9GB FP16 weights, but expect 30–50GB RAM/VRAM for high-res docs due to huge vision tokens | Apple Silicon + Metal BF16, ≥24GB CUDA cards, or 64GB RAM CPU workstations | Unified VLM (DotsVision + Qwen2) that nails layout, reading order, grounding, and multilingual math if you can tolerate the latency and memory bill. |
*Measured from the default FP16 safetensors. Runtime footprint varies with sequence length.
Guidance:
- Need maximum fidelity, multi-region reasoning, or already have 16–24GB VRAM? Use DeepSeek‑OCR. The hybrid SAM+CLIP tower plus DeepSeek‑V2 MoE decoder handles complex layouts best, but expect higher memory/latency.
- Deploying to CPU-only nodes, 16GB laptops, or latency-sensitive services? Choose PaddleOCR‑VL. Its dense Ernie decoder (18 layers, hidden 1024) activates fewer parameters per token and keeps memory under 10GB while staying close in quality on most docs.
- Chasing reading-order accuracy, layout grounding, or multi-page multilingual PDFs on roomy hardware? Pick DotsOCR with BF16 on Metal/CUDA. Prefill runs around 40–50 tok/s on M-series GPUs but can fall to ~12 tok/s on CPU because of the heavy vision tower.
The original DeepSeek-OCR ships as a Python + Transformers stack—powerful, but hefty to deploy and awkward to embed. Rewriting the pipeline in Rust gives us:
- Smaller deployable artifacts with zero Python runtime or conda baggage.
- Memory-safe, thread-friendly infrastructure that blends into native Rust backends.
- Unified tooling (CLI + server) running on Candle + Rocket without the Python GIL overhead.
- Drop-in compatibility with OpenAI-style clients while tuned for single-turn OCR prompts.
- Candle for tensor compute, with Metal and CUDA backends and FlashAttention support.
- Rocket + async streaming for OpenAI-compatible
/v1/responsesand/v1/chat/completions. - tokenizers (upstream DeepSeek release) wrapped by
crates/assetsfor deterministic caching via Hugging Face and ModelScope mirrors. - Pure Rust vision/prompt pipeline shared by CLI and server to avoid duplicated logic.
- Faster cold-start on Apple Silicon, lower RSS, and native binary distribution.
- Deterministic dual-source (Hugging Face + ModelScope) asset download + verification built into the workspace.
- Automatic single-turn chat compaction so OCR outputs stay stable even when clients send history.
- Ready-to-use OpenAI compatibility for tools like Open WebUI without adapters.
- One repo, two entrypoints – a batteries-included CLI for batch jobs and a Rocket-based server that speaks
/v1/responsesand/v1/chat/completions. - Works out of the box – pulls model weights, configs, and tokenizer from whichever of Hugging Face or ModelScope responds fastest on first run.
- Optimised for Apple Silicon – optional Metal backend with FP16 execution for real-time OCR on laptops.
- CUDA (alpha) – experimental support via
--features cuda+--device cuda --dtype f16; expect rough edges while we finish kernel coverage. - Intel MKL (preview) – faster BLAS on x86 via
--features mkl(install Intel oneMKL beforehand). - OpenAI client compatibility – drop-in replacement for popular SDKs; the server automatically collapses chat history to the latest user turn for OCR-friendly prompts.
The workspace exposes three base model IDs plus DSQ-quantized variants for DeepSeek‑OCR, PaddleOCR‑VL, and DotsOCR:
| Model ID | Base Model | Precision | Suggested Use Case |
|---|---|---|---|
deepseek-ocr |
deepseek-ocr |
FP16 (select via --dtype) |
Full-fidelity DeepSeek‑OCR stack with SAM+CLIP + MoE decoder; use when you prioritise quality on capable Metal/CUDA/CPU hosts. |
deepseek-ocr-q4k |
deepseek-ocr |
Q4_K |
Tight VRAM, local deployments, and batch jobs that still want DeepSeek’s SAM+CLIP pipeline. |
deepseek-ocr-q6k |
deepseek-ocr |
Q6_K |
Day‑to‑day balance of quality and size on mid‑range GPUs. |
deepseek-ocr-q8k |
deepseek-ocr |
Q8_0 |
Stay close to full‑precision quality with manageable memory savings. |
paddleocr-vl |
paddleocr-vl |
FP16 (select via --dtype) |
Default choice for lighter hardware; 0.9B Ernie + SigLIP tower with strong doc/table OCR and low latency. |
paddleocr-vl-q4k |
paddleocr-vl |
Q4_K |
Heavily compressed doc/table deployments with aggressive memory budgets. |
paddleocr-vl-q6k |
paddleocr-vl |
Q6_K |
Common engineering setups; blends accuracy and footprint. |
paddleocr-vl-q8k |
paddleocr-vl |
Q8_0 |
Accuracy‑leaning deployments that still want a smaller footprint than FP16. |
dots-ocr |
dots-ocr |
FP16 / BF16 (via --dtype) |
DotsVision + Qwen2 VLM for high‑precision layout, reading order, grounding, and multilingual docs; expect high memory (30–50GB on large pages). |
dots-ocr-q4k |
dots-ocr |
Q4_K |
Sidecar DSQ snapshot over the DotsOCR baseline; reduces weight memory/compute while keeping the heavy vision token profile unchanged. |
dots-ocr-q6k |
dots-ocr |
Q6_K |
Recommended balance of size and quality when you already accept DotsOCR’s memory footprint but want cheaper weights. |
dots-ocr-q8k |
dots-ocr |
Q8_0 |
Accuracy‑leaning DotsOCR deployment that stays close to FP16/BF16 quality with modest memory savings. |
- Rust 1.78+ (edition 2024 support)
- Git
- Optional: Apple Silicon running macOS 13+ for Metal acceleration
- Optional: CUDA 12.2+ toolkit + driver for experimental NVIDIA GPU acceleration on Linux/Windows
- Optional: Intel oneAPI MKL for preview x86 acceleration (see below)
- (Recommended) Hugging Face account with
HF_TOKENwhen pulling from thedeepseek-ai/DeepSeek-OCRrepo (ModelScope is used automatically when it’s faster/reachable).
git clone https://github.com/TimmyOVO/deepseek-ocr.rs.git
cd deepseek-ocr.rs
cargo fetchThe first invocation of the CLI or server downloads the config, tokenizer, and model-00001-of-000001.safetensors (~6.3GB) into DeepSeek-OCR/. To prefetch manually:
cargo run -p deepseek-ocr-cli --release -- --help # dev profile is extremely slow; always prefer --releaseAlways include
--releasewhen running from source; debug builds on this model are extremely slow. SetHF_HOME/HF_TOKENif you store Hugging Face caches elsewhere (ModelScope downloads land alongside the same asset tree). The full model package is ~6.3GB on disk and typically requires ~13GB of RAM headroom during inference (model + activations).
The CLI and server share the same configuration. On first launch we create a config.toml populated with defaults; later runs reuse it so both entrypoints stay in sync.
| Platform | Config file (default) | Model cache root |
|---|---|---|
| Linux | ~/.config/deepseek-ocr/config.toml |
~/.cache/deepseek-ocr/models/<id>/… |
| macOS | ~/Library/Application Support/deepseek-ocr/config.toml |
~/Library/Caches/deepseek-ocr/models/<id>/… |
| Windows | %APPDATA%\deepseek-ocr\config.toml |
%LOCALAPPDATA%\deepseek-ocr\models\<id>\… |
- Override the location with
--config /path/to/config.toml(available on both CLI and server). Missing files are created automatically. - Each
[models.entries."<id>"]record can point to customconfig,tokenizer, orweightsfiles. When omitted we fall back to the cache directory above and download/update assets as required. - Runtime values resolve in this order: command-line flags → values stored in
config.toml→ built-in defaults. The HTTP API adds a final layer where request payload fields (for examplemax_tokens) override everything else for that call.
The generated file starts with the defaults below; adjust them to persistently change behaviour:
[models]
active = "deepseek-ocr"
[models.entries.deepseek-ocr]
[inference]
device = "cpu"
template = "plain"
base_size = 1024
image_size = 640
crop_mode = true
max_new_tokens = 512
use_cache = true
[server]
host = "0.0.0.0"
port = 8000[models]picks the active model and lets you add more entries (each entry can point to its own config/tokenizer/weights).[inference]controls notebook-friendly defaults shared by the CLI and server (device, template, vision sizing, decoding budget, cache usage).[server]sets the network binding and the model identifier reported by/v1/models.
See crates/cli/README.md and crates/server/README.md for concise override tables.
Single-request Rust CLI (Accelerate backend on macOS) compared with the reference Python pipeline on the same prompt and image:
| Stage | ref total (ms) | ref avg (ms) | python total | python/ref |
|---|---|---|---|---|
Decode – Overall (decode.generate) |
30077.840 | 30077.840 | 56554.873 | 1.88x |
Decode – Token Loop (decode.iterative) |
26930.216 | 26930.216 | 39227.974 | 1.46x |
Decode – Prompt Prefill (decode.prefill) |
3147.337 | 3147.337 | 5759.684 | 1.83x |
Prompt – Build Tokens (prompt.build_tokens) |
0.466 | 0.466 | 45.434 | 97.42x |
Prompt – Render Template (prompt.render) |
0.005 | 0.005 | 0.019 | 3.52x |
Vision – Embed Images (vision.compute_embeddings) |
6391.435 | 6391.435 | 3953.459 | 0.62x |
Vision – Prepare Inputs (vision.prepare_inputs) |
62.524 | 62.524 | 45.438 | 0.73x |
Build and run directly from the workspace:
cargo run -p deepseek-ocr-cli --release -- \
--prompt "<image>\n<|grounding|>Convert this receipt to markdown." \
--image baselines/sample/images/test.png \
--device cpu --max-new-tokens 512Tip:
--releaseis required for reasonable throughput; debug builds can be 10x slower.
macOS tip: append
--features metalto thecargo run/cargo buildcommands to compile with Accelerate + Metal backends.CUDA tip (Linux/Windows): append
--features cudaand run with--device cuda --dtype f16to target NVIDIA GPUs—feature is still alpha, so be ready for quirks.Intel MKL preview: install Intel oneMKL, then build with
--features mklfor faster CPU matmuls on x86.
Install the CLI as a binary:
cargo install --path crates/cli
deepseek-ocr-cli --helpKey flags:
--prompt/--prompt-file: text with<image>slots--image: path(s) matching<image>placeholders--deviceand--dtype: choosemetal+f16on Apple Silicon orcuda+f16on NVIDIA GPUs--max-new-tokens: decoding budget- Sampling controls:
--do-sample,--temperature,--top-p,--top-k,--repetition-penalty,--no-repeat-ngram-size,--seed- By default decoding stays deterministic (
do_sample=false,temperature=0.0,no_repeat_ngram_size=20) - To use stochastic sampling set
--do-sample true --temperature 0.8(and optionally adjust the other knobs)
- By default decoding stays deterministic (
The autogenerated config.toml now lists three entries:
deepseek-ocr(default) – the original DeepSeek vision-language stack.paddleocr-vl– the PaddleOCR-VL 0.9B SigLIP + Ernie release.dots-ocr– the Candle port of dots.ocr with DotsVision + Qwen2 (use BF16 on Metal/CUDA if possible; see the release matrix for memory notes).
Pick which one to load via --model:
deepseek-ocr-cli --model paddleocr-vl --prompt "<image> Summarise"The CLI (and server) will download the matching config/tokenizer/weights from the appropriate repository (deepseek-ai/DeepSeek-OCR, PaddlePaddle/PaddleOCR-VL, or dots-ocr) into your cache on first use. You can still override paths with --model-config, --tokenizer, or --weights if you maintain local fine-tunes.
Launch an OpenAI-compatible endpoint:
cargo run -p deepseek-ocr-server --release -- \
--host 0.0.0.0 --port 8000 \
--device cpu --max-new-tokens 512Keep
--releaseon the server as well; the debug profile is far too slow for inference workloads. macOS tip: add--features metalto thecargo run -p deepseek-ocr-servercommand when you want the server binary to link against Accelerate + Metal (and pair it with--device metalat runtime).CUDA tip: add
--features cudaand start the server with--device cuda --dtype f16to offload inference to NVIDIA GPUs (alpha-quality support).Intel MKL preview: install Intel oneMKL before building with
--features mklto accelerate CPU workloads on x86.
Notes:
- Use
data:URLs or remotehttp(s)links; local paths are rejected. - The server collapses multi-turn chat inputs to the latest user message to keep prompts OCR-friendly.
- Works out of the box with tools such as Open WebUI or any OpenAI-compatible client—just point the base URL to your server (
http://localhost:8000/v1) and select either thedeepseek-ocrorpaddleocr-vlmodel ID exposed in/v1/models. - Adjust the request body limit with Rocket config if you routinely send large images.
- Metal (macOS 13+ Apple Silicon) – pass
--device metal --dtype f16and build binaries with--features metalso Candle links against Accelerate + Metal. - CUDA (alpha, NVIDIA GPUs) – install CUDA 12.2+ toolkits, build with
--features cuda, and launch the CLI/server with--device cuda --dtype f16; still experimental. - Intel MKL (preview) – install Intel oneMKL and build with
--features mklto speed up CPU workloads on x86. - For either backend, prefer release builds (e.g.
cargo build --release -p deepseek-ocr-cli --features metal|cuda) to maximise throughput. - Combine GPU runs with
--max-new-tokensand crop tuning flags to balance latency vs. quality.
crates/core– shared inference pipeline, model loaders, conversation templates.crates/cli– command-line frontend (deepseek-ocr-cli).crates/server– Rocket server exposing OpenAI-compatible endpoints.crates/assets– asset management (configuration, tokenizer, Hugging Face + ModelScope download helpers).baselines/– reference inputs and outputs for regression testing.
Detailed CLI usage lives in crates/cli/README.md. The server’s OpenAI-compatible interface is covered in crates/server/README.md.
- Where do assets come from? – downloads automatically pick between Hugging Face and ModelScope based on latency; the CLI prints the chosen source for each file.
- Slow first response – model load and GPU warm-up (Metal/CUDA alpha) happen on the initial request; later runs are faster.
- Large image rejection – increase Rocket JSON limits in
crates/server/src/main.rsor downscale the input.
- ✅ Apple Metal backend with FP16 support and CLI/server parity on macOS.
- ✅ NVIDIA CUDA backend (alpha) – build with
--features cuda, run with--device cuda --dtype f16for Linux/Windows GPUs; polishing in progress. - 🔄 Parity polish – finish projector normalisation + crop tiling alignment; extend intermediate-tensor diff suite beyond the current sample baseline.
- 🔄 Grounding & streaming – port the Python post-processing helpers (box extraction, markdown polish) and refine SSE streaming ergonomics.
- 🔄 Cross-platform acceleration – continue tuning CUDA kernels, add automatic device detection across CPU/Metal/CUDA, and publish opt-in GPU benchmarks.
- 🔄 Packaging & Ops – ship binary releases with deterministic asset checksums, richer logging/metrics, and Helm/docker references for server deploys.
- 🔜 Structured outputs – optional JSON schema tools for downstream automation once parity gaps close.
This repository inherits the licenses of its dependencies and the upstream DeepSeek-OCR model. Refer to DeepSeek-OCR/LICENSE for model terms and apply the same restrictions to downstream use.
