[MS] Add OCR layer service for embedded images and PDF scans by lesyk · Pull Request #1541 · microsoft/markitdown

lesyk · 2026-01-26T18:45:18Z

This pull request introduces the new markitdown-ocr plugin, which adds LLM Vision-based OCR capabilities to MarkItDown. The plugin enables extraction of text from images embedded in PDF, DOCX, PPTX, and XLSX files using any OpenAI-compatible client, without requiring additional ML libraries or binaries.

#1344

Output for testing files:

---

## docx_complex_layout.docx

```markdown
Complex Document

|  |  |
| --- | --- |
| Feature | Status |
| Authentication | Active |
| Encryption | Enabled |

Security notice:

*[Image OCR]
NOTICE: SSL Certificate Expires 2025-12-31
[End OCR]*

docx_image_end.docx

Report

Main findings of the report.

Details and analysis.

Recommendations.

*[Image OCR]
FOOTER: Document ID: DOC-2024-001
[End OCR]*

docx_image_middle.docx

# Introduction

This is the introduction section.

We will see an image below.

*[Image OCR]
FIGURE 1: System Architecture
[End OCR]*

# Analysis

This section comes after the image.

docx_image_start.docx

Document with Image at Start

*[Image OCR]
HEADER: Company Logo - ACME Corp
[End OCR]*

This is the main content after the header image.

More text content here.

docx_multipage.docx

# Page 1 - Mixed Content

This is the first paragraph on page 1.

BEFORE IMAGE: Important content appears here.

*[Image OCR]
DOCX PAGE 1: Section Title
[End OCR]*

AFTER IMAGE: This content follows the image.

More text on page 1.

# Page 2 - Image at End

Content on page 2.

Multiple paragraphs of text.

Building up to the image...

Final paragraph before image.

*[Image OCR]
DOCX PAGE 2: Footer Note
[End OCR]*

# Page 3 - Image at Start

*[Image OCR]
DOCX PAGE 3: Header Image
[End OCR]*

Content that follows the header image.

AFTER IMAGE: This text is after the image.

docx_multiple_images.docx

Multi-Image Document

First section

*[Image OCR]
Chart 1: Revenue Growth
[End OCR]*

Second section with another image

*[Image OCR]
Chart 2: Customer Satisfaction
[End OCR]*

Conclusion

pdf_complex_layout.pdf

## Page 1

Complex Layout Document

Table:

ItemQuantity

*[Image OCR]
WARNING: Handle with care
[End OCR]*

Widget A5

pdf_image_end.pdf

## Page 1

Main Content

This is the main text content.

The image will appear at the end.

Keep reading...

*[Image OCR]
END: Contact: support@example.com
[End OCR]*

pdf_image_middle.pdf

## Page 1

Section 1: Introduction

This document contains an image in the middle.

Here is some introductory text.

*[Image OCR]
MIDDLE: Product Code: ABC-12345
[End OCR]*

Section 2: Details

This text appears AFTER the image.

pdf_image_start.pdf

## Page 1

*[Image OCR]
START: This is the first image in PDF
[End OCR]*

This is text BEFORE the image.

The image should appear above this text.

This is more content after the image.

pdf_multiple_images.pdf

## Page 1

Document with Multiple Images

*[Image OCR]
Image 1: Serial Number SN-001
[End OCR]*

Text between first and second image.

*[Image OCR]
Image 2: Model Number M-2024
[End OCR]*

Final text after all images.

pdf_scanned_invoice.pdf

## Page 1

*[Image OCR]
# INVOICE

Company: TechCorp Industries

Invoice Number: INV-2024-001

Date: January 15, 2024

BILL TO:

Acme Corporation

123 Main Street

New York, NY 10001

DESCRIPTION:

Software Development Services

Professional Consulting

Technical Support

TOTAL AMOUNT DUE: $5,000.00
[End OCR]*

pdf_scanned_meeting_minutes.pdf

## Page 1

*[Image OCR]
# MEETING MINUTES

Date: March 10, 2024

Attendees: John Smith, Jane Doe, Bob Johnson

AGENDA ITEMS

1. Project Status Update

- Phase 1 completed successfully

- Phase 2 on track for Q2 delivery

2. Budget Review

- Current spend: 75% of allocated budget

- Forecast: Within budget

3. Action Items

- John: Finalize requirements document
[End OCR]*

pdf_scanned_minimal.pdf

## Page 1

*[Image OCR]
# NOTICE

This is a minimal test document

with just a few lines of text.

It should still be processed correctly.
[End OCR]*

pdf_scanned_report.pdf

## Page 1

*[Image OCR]
# TECHNICAL REPORT

# Page 1

EXECUTIVE SUMMARY

This document presents the findings of our

technical analysis conducted in Q1 2024.

Key highlights include:

- System performance improvements

- Security enhancements

- User experience updates

The following pages detail our methodology

and recommendations.
[End OCR]*

## Page 2

*[Image OCR]
# TECHNICAL REPORT

# Page 2

METHODOLOGY

Our analysis involved three phases:

1. Data Collection

    Gathered metrics from production systems

    over a 90-day period.

2. Performance Analysis

    Identified bottlenecks and optimization

    opportunities.

3. Security Review

    Conducted vulnerability assessment and
[End OCR]*

## Page 3

*[Image OCR]
# TECHNICAL REPORT

# Page 3

RECOMMENDATIONS

Based on our findings, we recommend:

1. Implement caching layer to improve

    response times by 40%.

2. Upgrade authentication system to

    support multi-factor authentication.

3. Optimize database queries to reduce

    server load by 30%.

CONCLUSION
[End OCR]*

pdf_scanned_sales_report.pdf

## Page 1

*[Image OCR]
# QUARTERLY SALES REPORT

Q1 2024 Performance Summary

REGIONAL BREAKDOWN

Region        Revenue        Growth
North America  $2.5M         +15%
Europe        $1.8M         +22%
Asia Pacific  $3.2M         +35%
Latin America $0.9M         +12%

TOTAL         $8.4M         +23%

Top performing products:

- Product A: $3.1M

- Product B: $2.7M
[End OCR]*

pptx_complex_layout.pptx

\n\n<!-- Slide number: 1 -->\n# Product Comparison\n\nOur products lead the market\n
*[Image OCR]
Market Share: 35%
[End OCR]*

pptx_image_end.pptx

\n\n<!-- Slide number: 1 -->\n# Presentation\n\n\n\n<!-- Slide number: 2 -->\n# Thank You\n\n
*[Image OCR]
Contact: info@techcorp.com
[End OCR]*

pptx_image_middle.pptx

\n\n<!-- Slide number: 1 -->\n# Introduction\n\n\n\n<!-- Slide number: 2 -->\n# Architecture\n\n
*[Image OCR]
Diagram: System Components
[End OCR]*\n\n<!-- Slide number: 3 -->\n# Conclusion\n\n

pptx_image_start.pptx

\n\n<!-- Slide number: 1 -->\n# Welcome\n\n
*[Image OCR]
Company: TechCorp Inc.
[End OCR]*

pptx_multiple_images.pptx

\n\n<!-- Slide number: 1 -->\n# \n
*[Image OCR]
Before: 50% Efficiency
[End OCR]*

*[Image OCR]
After: 95% Efficiency
[End OCR]*

xlsx_complex_layout.xlsx

## Complex Report

| Annual Report 2024 | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Month | Sales |
| Jan | 1000 |
| Feb | 1200 |
| NaN | NaN |
| Total | 2200 |

### Images in this sheet:

*[Image OCR]
Figure 1: Monthly Trend
[End OCR]*

*[Image OCR]
Figure 2: Year Overview
[End OCR]*

## Customers

| Customer Metrics | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| New Customers | 250 |
| Retention Rate | 92% |

### Images in this sheet:

*[Image OCR]
Customer Growth: +25% Year-over-Year
[End OCR]*

## Regions

| Regional Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Region | Revenue |
| North | $800K |
| South | $600K |

### Images in this sheet:

*[Image OCR]
Regional Map: Top Perform
[End OCR]*

xlsx_image_end.xlsx

## Sheet

| Financial Summary | Unnamed: 1 |
| --- | --- |
| Total Revenue | $500,000 |
| Total Expenses | $300,000 |
| Net Profit | $200,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Signature: | NaN |

### Images in this sheet:

*[Image OCR]
Approved by: John Doe, CFO
[End OCR]*

## Budget

| Budget Allocation | Unnamed: 1 |
| --- | --- |
| Marketing | $100,000 |
| R&D | $150,000 |
| Operations | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Approved: | NaN |

### Images in this sheet:

*[Image OCR]
viewed by: Jane Smith, CTO
[End OCR]*

xlsx_image_middle.xlsx

## Revenue

| Q1 Report | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Revenue | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Profit Margin | 40% |

### Images in this sheet:

*[Image OCR]
Growth Trend: +15%
[End OCR]*

## Expenses

| Expense Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Expenses | $30,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Savings | $5,000 |

### Images in this sheet:

*[Image OCR]
Cost Analysis: Optimized
[End OCR]*

xlsx_image_start.xlsx

## Sales Q1

| Product | Sales |
| --- | --- |
| Widget A | 100 |
| Widget B | 150 |

### Images in this sheet:

*[Image OCR]
Q1 Sales Chart
[End OCR]*

## Forecast Q2

| Projected Sales | Unnamed: 1 |
| --- | --- |
| Widget A | 120 |
| Widget B | 180 |

### Images in this sheet:

*[Image OCR]
Q2 Forecast: +20% Growth
[End OCR]*

xlsx_multiple_images.xlsx

## Overview

| Dashboard |
| --- |
| Status: Active |
| NaN |
| NaN |
| NaN |
| NaN |
| Performance Summary |

### Images in this sheet:

*[Image OCR]
KPI: 95% Success Rate
[End OCR]*

*[Image OCR]
Uptime: 99.9%
[End OCR]*

## Details

| Detailed Metrics |
| --- |
| System Health |

### Images in this sheet:

*[Image OCR]
Metric: Response Time 50ms
[End OCR]*

## Summary

| Quarter Summary |
| --- |
| Overall Performance |

### Images in this sheet:

*[Image OCR]
Q1 Results: Exceeded Goals
[End OCR]*

</details>

- Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction.

…t tests

…accordingly

…cument types

…nctionality across DOCX, PDF, PPTX, and XLSX converters

…ile URI handling

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms.

…kitdown into u/vilesyk/inline_image

… and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

… for multipage PDFs

…converter and test files

…kitdown into u/vilesyk/inline_image

lesyk and others added 4 commits January 26, 2026 19:44

Merge branch 'main' into u/vilesyk/inline_image

2e83594

Add support for scanned PDFs with full-page OCR fallback and implemen…

f4fab9b

…t tests

lesyk marked this pull request as ready for review January 27, 2026 10:21

zashed approved these changes Jan 27, 2026

View reviewed changes

lesyk changed the title ~~Add OCR test data and implement tests for various document formats~~ Jan 27, 2026

lesyk changed the title ~~Add OCR service for embedded images and PDF scans~~ Jan 27, 2026

lesyk changed the title ~~Add OCR layer service for embedded images and PDF scans~~ Jan 27, 2026

lesyk and others added 19 commits February 12, 2026 09:55

Bump version to 0.1.6b1 in __about__.py

40e0be5

Refactor OCR services to support LLM Vision, update README and tests …

9daaeff

…accordingly

Add OCR-enabled converters and ensure consistent OCR format across do…

bd9c98d

…cument types

Refactor converters to improve import organization and enhance OCR fu…

6732692

…nctionality across DOCX, PDF, PPTX, and XLSX converters

Refactor exception imports for consistency across converters and tests

678ea75

Fix OCR tests to match MockOCRService output and fix cross-platform f…

dfd57e0

…ile URI handling

Merge origin/main into u/vilesyk/inline_image

550243a

Bump version to 0.1.6b1 in __about__.py

222ec95

Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

ce21005

Merge branch 'u/vilesyk/inline_image' of https://github.com/lesyk/mar…

0816de8

…kitdown into u/vilesyk/inline_image

Refactor OCR processing in PdfConverterWithOCR and enhance unit tests…

f7ee5ef

… for multipage PDFs

Revert

fefc3b6

Revert

1ef0d50

Update REDMEs

9d485bd

Merge branch 'main' into u/vilesyk/inline_image

207e58c

Refactor import statements for consistency and improve formatting in …

b8e28c0

…converter and test files

Merge branch 'u/vilesyk/inline_image' of https://github.com/lesyk/mar…

aff82a3

…kitdown into u/vilesyk/inline_image

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MS] Add OCR layer service for embedded images and PDF scans#1541

[MS] Add OCR layer service for embedded images and PDF scans#1541
lesyk wants to merge 23 commits intomicrosoft:mainfrom
lesyk:u/vilesyk/inline_image

lesyk commented Jan 26, 2026 •

edited

Loading

Labels

2 participants

Conversation

lesyk commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!