OpenDataLoader PDF

PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU

Convert PDFs into LLM-ready Markdown and JSON with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.

Why developers choose OpenDataLoader:

Deterministic — Same input always produces same output (no LLM hallucinations)
Fast — Process 100+ pages per second on CPU
Private — 100% local, zero data transmission
Accurate — Bounding boxes for every element, correct multi-column reading order

pip install -U opendataloader-pdf

import opendataloader_pdf

# PDF to Markdown for RAG
opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="output/",
    format="markdown,json"
)

Why OpenDataLoader?

Building RAG pipelines? You've probably hit these problems:

Problem	How We Solve It
Multi-column text reads left-to-right incorrectly	XY-Cut++ algorithm preserves correct reading order
Tables lose structure	Border + cluster detection keeps rows/columns intact
Headers/footers pollute context	Auto-filtered before output
No coordinates for citations	Bounding box for every element
Cloud APIs = privacy concerns	100% local, no data leaves your machine
GPU required	Pure CPU, rule-based — runs anywhere

Key Features

For RAG & LLM Pipelines

Structured Output — JSON with semantic types (heading, paragraph, table, list, caption)
Bounding Boxes — Every element includes [x1, y1, x2, y2] coordinates for citations
Reading Order — XY-Cut++ algorithm handles multi-column layouts correctly
Noise Filtering — Headers, footers, hidden text, watermarks auto-removed
LangChain Integration — Official document loader

Performance & Privacy

No GPU — Fast, rule-based heuristics
Local-First — Your documents never leave your machine
High Throughput — Process thousands of PDFs efficiently
Multi-Language SDK — Python, Node.js, Java, Docker

Document Understanding

Tables — Detects borders, handles merged cells
Lists — Numbered, bulleted, nested
Headings — Auto-detects hierarchy levels
Images — Extracts with captions linked
Tagged PDF Support — Uses native PDF structure when available
AI Safety — Auto-filters prompt injection content

Output Formats

Format	Use Case
JSON	Structured data with bounding boxes, semantic types
Markdown	Clean text for LLM context, RAG chunks
HTML	Web display with styling
Annotated PDF	Visual debugging — see detected structures (sample)

JSON Output Example

{
  "type": "heading",
  "id": 42,
  "level": "Title",
  "page number": 1,
  "bounding box": [72.0, 700.0, 540.0, 730.0],
  "heading level": 1,
  "font": "Helvetica-Bold",
  "font size": 24.0,
  "text color": "[0.0]",
  "content": "Introduction"
}

Field	Description
`type`	Element type: heading, paragraph, table, list, image, caption
`id`	Unique identifier for cross-referencing
`page number`	1-indexed page reference
`bounding box`	`[left, bottom, right, top]` in PDF points
`heading level`	Heading depth (1+)
`font`, `font size`	Typography info
`content`	Extracted text

Full JSON Schema →

Quick Start

Advanced Options

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="output/",
    format="json,markdown,pdf",

    # Image output mode: "off", "embedded" (Base64), or "external" (default)
    image_output="embedded",

    # Image format: "png" or "jpeg"
    image_format="jpeg",

    # Tagged PDF
    use_struct_tree=True,            # Use native PDF structure
)

Full CLI Options Reference →

AI Safety

PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:

Hidden text (transparent, zero-size)
Off-page content
Suspicious invisible layers

This is enabled by default. Learn more →

Tagged PDF Support

Why it matters: The European Accessibility Act (EAA) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.

OpenDataLoader leverages this:

When a PDF has structure tags, we extract the exact layout the author intended
Headings, lists, tables, reading order — all preserved from the source
No guessing, no heuristics needed — pixel-perfect semantic extraction

opendataloader_pdf.convert(
    input_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure tags
)

Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.

Learn more about Tagged PDF →

LangChain Integration

OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.

pip install -U langchain-opendataloader-pdf

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["document.pdf"],
    format="text"
)
documents = loader.load()

# Use with any LangChain pipeline
for doc in documents:
    print(doc.page_content[:100])

Benchmarks

We continuously benchmark against real-world documents.

View full benchmark results →

Quick Comparison

Engine	Speed (s/page)	Reading Order	Table	Heading
opendataloader	0.05	0.91	0.49	0.65
docling	0.73	0.90	0.89	0.80
pymupdf4llm	0.09	0.89	0.40	0.41
markitdown	0.04	0.88	0.00	0.00

Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. Bold indicates best performance.

When to Use Each Engine

Use Case	Recommended Engine	Why
Best overall balance	opendataloader	Fast with high reading order accuracy
Maximum accuracy	docling	Highest scores for tables and headings, but 16x slower
Speed-critical pipelines	markitdown	Fastest, but no table/heading extraction
PyMuPDF ecosystem	pymupdf4llm	Good balance if already using PyMuPDF

Visual Comparison

Roadmap

See our upcoming features and priorities →

Documentation

Frequently Asked Questions

What is the best PDF parser for RAG?

For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.

How do I extract tables from PDF for LLM?

OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.

Can I use this without sending data to the cloud?

Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.

What makes OpenDataLoader unique?

OpenDataLoader takes a different approach from many PDF parsers:

Rule-based extraction — Deterministic output without GPU requirements
Bounding boxes for all elements — Essential for citation systems
XY-Cut++ reading order — Handles multi-column layouts correctly
Built-in AI safety filters — Protects against prompt injection
Native Tagged PDF support — Leverages accessibility metadata

This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

Mozilla Public License 2.0

Found this useful? Give us a star to help others discover OpenDataLoader.

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
.claude/skills/beneissue		.claude/skills/beneissue
.github		.github
LICENSE_TEMPLATE		LICENSE_TEMPLATE
THIRD_PARTY		THIRD_PARTY
build-scripts		build-scripts
content/docs		content/docs
examples/python/rag		examples/python/rag
java		java
node/opendataloader-pdf		node/opendataloader-pdf
python		python
samples		samples
scripts		scripts
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
SUPPORT.md		SUPPORT.md
options.json		options.json
package.json		package.json
schema.json		schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenDataLoader PDF

Why OpenDataLoader?

Key Features

For RAG & LLM Pipelines

Performance & Privacy

Document Understanding

Output Formats

JSON Output Example

Quick Start

Advanced Options

AI Safety

Tagged PDF Support

LangChain Integration

Benchmarks

Quick Comparison

When to Use Each Engine

Visual Comparison

Roadmap

Documentation

Frequently Asked Questions

What is the best PDF parser for RAG?

How do I extract tables from PDF for LLM?

Can I use this without sending data to the cloud?

What makes OpenDataLoader unique?

Contributing

License

About

Uh oh!

Releases 30

Packages

Uh oh!

Uh oh!

Contributors 10

Uh oh!

Languages

License

opendataloader-project/opendataloader-pdf

Folders and files

Latest commit

History

Repository files navigation

OpenDataLoader PDF

Why OpenDataLoader?

Key Features

For RAG & LLM Pipelines

Performance & Privacy

Document Understanding

Output Formats

JSON Output Example

Quick Start

Advanced Options

AI Safety

Tagged PDF Support

LangChain Integration

Benchmarks

Quick Comparison

When to Use Each Engine

Visual Comparison

Roadmap

Documentation

Frequently Asked Questions

What is the best PDF parser for RAG?

How do I extract tables from PDF for LLM?

Can I use this without sending data to the cloud?

What makes OpenDataLoader unique?

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 30

Packages 0

Uh oh!

Uh oh!

Contributors 10

Uh oh!

Languages

Packages