PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU
Convert PDFs into LLM-ready Markdown and JSON with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.
Why developers choose OpenDataLoader:
- Deterministic — Same input always produces same output (no LLM hallucinations)
- Fast — Process 100+ pages per second on CPU
- Private — 100% local, zero data transmission
- Accurate — Bounding boxes for every element, correct multi-column reading order
pip install -U opendataloader-pdfimport opendataloader_pdf
# PDF to Markdown for RAG
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="output/",
format="markdown,json"
)Building RAG pipelines? You've probably hit these problems:
| Problem | How We Solve It |
|---|---|
| Multi-column text reads left-to-right incorrectly | XY-Cut++ algorithm preserves correct reading order |
| Tables lose structure | Border + cluster detection keeps rows/columns intact |
| Headers/footers pollute context | Auto-filtered before output |
| No coordinates for citations | Bounding box for every element |
| Cloud APIs = privacy concerns | 100% local, no data leaves your machine |
| GPU required | Pure CPU, rule-based — runs anywhere |
- Structured Output — JSON with semantic types (heading, paragraph, table, list, caption)
- Bounding Boxes — Every element includes
[x1, y1, x2, y2]coordinates for citations - Reading Order — XY-Cut++ algorithm handles multi-column layouts correctly
- Noise Filtering — Headers, footers, hidden text, watermarks auto-removed
- LangChain Integration — Official document loader
- No GPU — Fast, rule-based heuristics
- Local-First — Your documents never leave your machine
- High Throughput — Process thousands of PDFs efficiently
- Multi-Language SDK — Python, Node.js, Java, Docker
- Tables — Detects borders, handles merged cells
- Lists — Numbered, bulleted, nested
- Headings — Auto-detects hierarchy levels
- Images — Extracts with captions linked
- Tagged PDF Support — Uses native PDF structure when available
- AI Safety — Auto-filters prompt injection content
| Format | Use Case |
|---|---|
| JSON | Structured data with bounding boxes, semantic types |
| Markdown | Clean text for LLM context, RAG chunks |
| HTML | Web display with styling |
| Annotated PDF | Visual debugging — see detected structures (sample) |
{
"type": "heading",
"id": 42,
"level": "Title",
"page number": 1,
"bounding box": [72.0, 700.0, 540.0, 730.0],
"heading level": 1,
"font": "Helvetica-Bold",
"font size": 24.0,
"text color": "[0.0]",
"content": "Introduction"
}| Field | Description |
|---|---|
type |
Element type: heading, paragraph, table, list, image, caption |
id |
Unique identifier for cross-referencing |
page number |
1-indexed page reference |
bounding box |
[left, bottom, right, top] in PDF points |
heading level |
Heading depth (1+) |
font, font size |
Typography info |
content |
Extracted text |
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="output/",
format="json,markdown,pdf",
# Image output mode: "off", "embedded" (Base64), or "external" (default)
image_output="embedded",
# Image format: "png" or "jpeg"
image_format="jpeg",
# Tagged PDF
use_struct_tree=True, # Use native PDF structure
)PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
- Hidden text (transparent, zero-size)
- Off-page content
- Suspicious invisible layers
This is enabled by default. Learn more →
Why it matters: The European Accessibility Act (EAA) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
OpenDataLoader leverages this:
- When a PDF has structure tags, we extract the exact layout the author intended
- Headings, lists, tables, reading order — all preserved from the source
- No guessing, no heuristics needed — pixel-perfect semantic extraction
opendataloader_pdf.convert(
input_path="accessible_document.pdf",
use_struct_tree=True # Use native PDF structure tags
)Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
pip install -U langchain-opendataloader-pdffrom langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader(
file_path=["document.pdf"],
format="text"
)
documents = loader.load()
# Use with any LangChain pipeline
for doc in documents:
print(doc.page_content[:100])We continuously benchmark against real-world documents.
| Engine | Speed (s/page) | Reading Order | Table | Heading |
|---|---|---|---|---|
| opendataloader | 0.05 | 0.91 | 0.49 | 0.65 |
| docling | 0.73 | 0.90 | 0.89 | 0.80 |
| pymupdf4llm | 0.09 | 0.89 | 0.40 | 0.41 |
| markitdown | 0.04 | 0.88 | 0.00 | 0.00 |
Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. Bold indicates best performance.
| Use Case | Recommended Engine | Why |
|---|---|---|
| Best overall balance | opendataloader | Fast with high reading order accuracy |
| Maximum accuracy | docling | Highest scores for tables and headings, but 16x slower |
| Speed-critical pipelines | markitdown | Fastest, but no table/heading extraction |
| PyMuPDF ecosystem | pymupdf4llm | Good balance if already using PyMuPDF |
See our upcoming features and priorities →
For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
OpenDataLoader takes a different approach from many PDF parsers:
- Rule-based extraction — Deterministic output without GPU requirements
- Bounding boxes for all elements — Essential for citation systems
- XY-Cut++ reading order — Handles multi-column layouts correctly
- Built-in AI safety filters — Protects against prompt injection
- Native Tagged PDF support — Leverages accessibility metadata
This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
We welcome contributions! See CONTRIBUTING.md for guidelines.
Found this useful? Give us a star to help others discover OpenDataLoader.
