Skip to content

[MS] Add OCR layer service for embedded images and PDF scans#1541

Open
lesyk wants to merge 23 commits intomicrosoft:mainfrom
lesyk:u/vilesyk/inline_image
Open

[MS] Add OCR layer service for embedded images and PDF scans#1541
lesyk wants to merge 23 commits intomicrosoft:mainfrom
lesyk:u/vilesyk/inline_image

Conversation

@lesyk
Copy link
Contributor

@lesyk lesyk commented Jan 26, 2026

This pull request introduces the new markitdown-ocr plugin, which adds LLM Vision-based OCR capabilities to MarkItDown. The plugin enables extraction of text from images embedded in PDF, DOCX, PPTX, and XLSX files using any OpenAI-compatible client, without requiring additional ML libraries or binaries.

#1344

Output for testing files:
---

## docx_complex_layout.docx

```markdown
Complex Document

|  |  |
| --- | --- |
| Feature | Status |
| Authentication | Active |
| Encryption | Enabled |

Security notice:

*[Image OCR]
NOTICE: SSL Certificate Expires 2025-12-31
[End OCR]*

docx_image_end.docx

Report

Main findings of the report.

Details and analysis.

Recommendations.

*[Image OCR]
FOOTER: Document ID: DOC-2024-001
[End OCR]*

docx_image_middle.docx

# Introduction

This is the introduction section.

We will see an image below.

*[Image OCR]
FIGURE 1: System Architecture
[End OCR]*

# Analysis

This section comes after the image.

docx_image_start.docx

Document with Image at Start

*[Image OCR]
HEADER: Company Logo - ACME Corp
[End OCR]*

This is the main content after the header image.

More text content here.

docx_multipage.docx

# Page 1 - Mixed Content

This is the first paragraph on page 1.

BEFORE IMAGE: Important content appears here.

*[Image OCR]
DOCX PAGE 1: Section Title
[End OCR]*

AFTER IMAGE: This content follows the image.

More text on page 1.

# Page 2 - Image at End

Content on page 2.

Multiple paragraphs of text.

Building up to the image...

Final paragraph before image.

*[Image OCR]
DOCX PAGE 2: Footer Note
[End OCR]*

# Page 3 - Image at Start

*[Image OCR]
DOCX PAGE 3: Header Image
[End OCR]*

Content that follows the header image.

AFTER IMAGE: This text is after the image.

docx_multiple_images.docx

Multi-Image Document

First section

*[Image OCR]
Chart 1: Revenue Growth
[End OCR]*

Second section with another image

*[Image OCR]
Chart 2: Customer Satisfaction
[End OCR]*

Conclusion

pdf_complex_layout.pdf

## Page 1

Complex Layout Document

Table:

ItemQuantity

*[Image OCR]
WARNING: Handle with care
[End OCR]*

Widget A5

pdf_image_end.pdf

## Page 1

Main Content

This is the main text content.

The image will appear at the end.

Keep reading...

*[Image OCR]
END: Contact: support@example.com
[End OCR]*

pdf_image_middle.pdf

## Page 1

Section 1: Introduction

This document contains an image in the middle.

Here is some introductory text.

*[Image OCR]
MIDDLE: Product Code: ABC-12345
[End OCR]*

Section 2: Details

This text appears AFTER the image.

pdf_image_start.pdf

## Page 1

*[Image OCR]
START: This is the first image in PDF
[End OCR]*

This is text BEFORE the image.

The image should appear above this text.

This is more content after the image.

pdf_multiple_images.pdf

## Page 1

Document with Multiple Images

*[Image OCR]
Image 1: Serial Number SN-001
[End OCR]*

Text between first and second image.

*[Image OCR]
Image 2: Model Number M-2024
[End OCR]*

Final text after all images.

pdf_scanned_invoice.pdf

## Page 1

*[Image OCR]
# INVOICE

Company: TechCorp Industries

Invoice Number: INV-2024-001

Date: January 15, 2024

BILL TO:

Acme Corporation

123 Main Street

New York, NY 10001

DESCRIPTION:

Software Development Services

Professional Consulting

Technical Support

TOTAL AMOUNT DUE: $5,000.00
[End OCR]*

pdf_scanned_meeting_minutes.pdf

## Page 1

*[Image OCR]
# MEETING MINUTES

Date: March 10, 2024

Attendees: John Smith, Jane Doe, Bob Johnson

AGENDA ITEMS

1. Project Status Update

- Phase 1 completed successfully

- Phase 2 on track for Q2 delivery

2. Budget Review

- Current spend: 75% of allocated budget

- Forecast: Within budget

3. Action Items

- John: Finalize requirements document
[End OCR]*

pdf_scanned_minimal.pdf

## Page 1

*[Image OCR]
# NOTICE

This is a minimal test document

with just a few lines of text.

It should still be processed correctly.
[End OCR]*

pdf_scanned_report.pdf

## Page 1

*[Image OCR]
# TECHNICAL REPORT

# Page 1

EXECUTIVE SUMMARY

This document presents the findings of our

technical analysis conducted in Q1 2024.

Key highlights include:

- System performance improvements

- Security enhancements

- User experience updates

The following pages detail our methodology

and recommendations.
[End OCR]*

## Page 2

*[Image OCR]
# TECHNICAL REPORT

# Page 2

METHODOLOGY

Our analysis involved three phases:

1. Data Collection

    Gathered metrics from production systems

    over a 90-day period.

2. Performance Analysis

    Identified bottlenecks and optimization

    opportunities.

3. Security Review

    Conducted vulnerability assessment and
[End OCR]*

## Page 3

*[Image OCR]
# TECHNICAL REPORT

# Page 3

RECOMMENDATIONS

Based on our findings, we recommend:

1. Implement caching layer to improve

    response times by 40%.

2. Upgrade authentication system to

    support multi-factor authentication.

3. Optimize database queries to reduce

    server load by 30%.

CONCLUSION
[End OCR]*

pdf_scanned_sales_report.pdf

## Page 1

*[Image OCR]
# QUARTERLY SALES REPORT

Q1 2024 Performance Summary

REGIONAL BREAKDOWN

Region        Revenue        Growth
North America  $2.5M         +15%
Europe        $1.8M         +22%
Asia Pacific  $3.2M         +35%
Latin America $0.9M         +12%

TOTAL         $8.4M         +23%

Top performing products:

- Product A: $3.1M

- Product B: $2.7M
[End OCR]*

pptx_complex_layout.pptx

\n\n<!-- Slide number: 1 -->\n# Product Comparison\n\nOur products lead the market\n
*[Image OCR]
Market Share: 35%
[End OCR]*

pptx_image_end.pptx

\n\n<!-- Slide number: 1 -->\n# Presentation\n\n\n\n<!-- Slide number: 2 -->\n# Thank You\n\n
*[Image OCR]
Contact: info@techcorp.com
[End OCR]*

pptx_image_middle.pptx

\n\n<!-- Slide number: 1 -->\n# Introduction\n\n\n\n<!-- Slide number: 2 -->\n# Architecture\n\n
*[Image OCR]
Diagram: System Components
[End OCR]*\n\n<!-- Slide number: 3 -->\n# Conclusion\n\n

pptx_image_start.pptx

\n\n<!-- Slide number: 1 -->\n# Welcome\n\n
*[Image OCR]
Company: TechCorp Inc.
[End OCR]*

pptx_multiple_images.pptx

\n\n<!-- Slide number: 1 -->\n# \n
*[Image OCR]
Before: 50% Efficiency
[End OCR]*

*[Image OCR]
After: 95% Efficiency
[End OCR]*

xlsx_complex_layout.xlsx

## Complex Report

| Annual Report 2024 | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Month | Sales |
| Jan | 1000 |
| Feb | 1200 |
| NaN | NaN |
| Total | 2200 |

### Images in this sheet:

*[Image OCR]
Figure 1: Monthly Trend
[End OCR]*

*[Image OCR]
Figure 2: Year Overview
[End OCR]*

## Customers

| Customer Metrics | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| New Customers | 250 |
| Retention Rate | 92% |

### Images in this sheet:

*[Image OCR]
Customer Growth: +25% Year-over-Year
[End OCR]*

## Regions

| Regional Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Region | Revenue |
| North | $800K |
| South | $600K |

### Images in this sheet:

*[Image OCR]
Regional Map: Top Perform
[End OCR]*

xlsx_image_end.xlsx

## Sheet

| Financial Summary | Unnamed: 1 |
| --- | --- |
| Total Revenue | $500,000 |
| Total Expenses | $300,000 |
| Net Profit | $200,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Signature: | NaN |

### Images in this sheet:

*[Image OCR]
Approved by: John Doe, CFO
[End OCR]*

## Budget

| Budget Allocation | Unnamed: 1 |
| --- | --- |
| Marketing | $100,000 |
| R&D | $150,000 |
| Operations | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Approved: | NaN |

### Images in this sheet:

*[Image OCR]
viewed by: Jane Smith, CTO
[End OCR]*

xlsx_image_middle.xlsx

## Revenue

| Q1 Report | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Revenue | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Profit Margin | 40% |

### Images in this sheet:

*[Image OCR]
Growth Trend: +15%
[End OCR]*

## Expenses

| Expense Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Expenses | $30,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Savings | $5,000 |

### Images in this sheet:

*[Image OCR]
Cost Analysis: Optimized
[End OCR]*

xlsx_image_start.xlsx

## Sales Q1

| Product | Sales |
| --- | --- |
| Widget A | 100 |
| Widget B | 150 |

### Images in this sheet:

*[Image OCR]
Q1 Sales Chart
[End OCR]*

## Forecast Q2

| Projected Sales | Unnamed: 1 |
| --- | --- |
| Widget A | 120 |
| Widget B | 180 |

### Images in this sheet:

*[Image OCR]
Q2 Forecast: +20% Growth
[End OCR]*

xlsx_multiple_images.xlsx

## Overview

| Dashboard |
| --- |
| Status: Active |
| NaN |
| NaN |
| NaN |
| NaN |
| Performance Summary |

### Images in this sheet:

*[Image OCR]
KPI: 95% Success Rate
[End OCR]*

*[Image OCR]
Uptime: 99.9%
[End OCR]*

## Details

| Detailed Metrics |
| --- |
| System Health |

### Images in this sheet:

*[Image OCR]
Metric: Response Time 50ms
[End OCR]*

## Summary

| Quarter Summary |
| --- |
| Overall Performance |

### Images in this sheet:

*[Image OCR]
Q1 Results: Exceeded Goals
[End OCR]*

</details>
lesyk and others added 4 commits January 26, 2026 19:44
- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.
- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.
@lesyk lesyk marked this pull request as ready for review January 27, 2026 10:21
@lesyk lesyk changed the title Add OCR test data and implement tests for various document formats Jan 27, 2026
@lesyk lesyk changed the title Add OCR service for embedded images and PDF scans Jan 27, 2026
@lesyk lesyk changed the title Add OCR layer service for embedded images and PDF scans Jan 27, 2026
lesyk and others added 19 commits February 12, 2026 09:55
…nctionality across DOCX, PDF, PPTX, and XLSX converters
- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.
… and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants