PDF OCR Pipeline

PDF OCR Pipeline is a command-line and programmatic tool to extract text from PDF documents using OCR (Optical Character Recognition), with optional AI‑powered analysis and summarization.

Features

Process single or multiple PDF files in one command
Configurable OCR resolution (DPI)
Support for multiple languages via Tesseract
JSON output for easy integration with other tools
AI-powered text analysis and summarization using OpenAI's GPT-4o

Requirements

Python 3.6+
External dependencies:
- pdftoppm (typically from poppler-utils)
- tesseract (Tesseract OCR engine)

Installation

From Source

Clone this repository:

git clone https://github.com/yourusername/pdf-ocr-pipeline.git
cd pdf-ocr-pipeline

Install the project environment and dependencies using uv:
```
# Install uv if not already available
pip install uv
# Sync the project (creates .venv and installs dependencies)
uv sync
```
This will set up a virtual environment, install the package in editable mode, and install all runtime dependencies.

Using pip (when published)

# Add the published package as a dependency and sync the environment
uv add pdf-ocr-pipeline
uv sync

External Dependencies

Ensure the required external tools are installed:

Ubuntu/Debian:

sudo apt-get install poppler-utils tesseract-ocr

macOS:

brew install poppler tesseract

Windows:

Install Poppler for Windows and Tesseract for Windows. Ensure both are added to your PATH.

Usage

Command Line

After installation, you can run the pdf-ocr command directly or via uv:

Basic usage:

# Using the `pdf-ocr` command
pdf-ocr path/to/document.pdf > result.json

# Using uv to run the CLI
uv run pdf-ocr path/to/document.pdf > result.json

Process multiple files:

# Using the `pdf-ocr` command
pdf-ocr file1.pdf file2.pdf file3.pdf > results.json

# Using uv to run the CLI
uv run pdf-ocr file1.pdf file2.pdf file3.pdf > results.json

Options

--dpi DPI: Set the resolution for OCR processing (default: 300)
-l, --lang LANGUAGE: Set the language for Tesseract (default: eng)

Example with options:

pdf-ocr --dpi 600 -l fra path/to/french_document.pdf > result.json

AI-Powered Text Analysis

The package includes a powerful tool to analyze OCR text using OpenAI's GPT-4o model:

# Process a PDF file and analyze the text
pdf-ocr document.pdf | pdf-ocr-summarize --pretty > analysis.json

# Use a custom prompt for the analysis
pdf-ocr document.pdf | pdf-ocr-summarize --prompt "Extract all dates and names mentioned in the text" > analysis.json

Customizing AI Analysis

You can tailor the AI analysis by providing custom prompts for different types of documents:

Legal documents: --prompt "Extract all legal entities, contract provisions, and obligations"
Academic papers: --prompt "Summarize the methodology, findings and conclusions"
Financial reports: --prompt "Extract financial figures, percentages, and trends"
Medical documents: --prompt "Extract diagnoses, treatments, and medications"

AI Configuration

API Key: Set the OPENAI_API_KEY environment variable with your OpenAI API key
Model: Uses GPT-4o by default for optimal accuracy and structured output
Output Format: Returns structured JSON with analysis organized into relevant sections
Verbose Mode: Use -v for detailed processing information

See the API reference for details on library functions and the Project Organization for an overview of the code structure.

Programmatic Usage

You can also use the pipeline directly from Python:

from pdf_ocr_pipeline import process_pdf
from pdf_ocr_pipeline.types import ProcessSettings

# Pure OCR
ocr = process_pdf("invoice.pdf", settings=ProcessSettings())

# OCR + segmentation via GPT
segments = process_pdf(
    "closing_package.pdf",
    settings=ProcessSettings(analyze=True),
)

See the examples/ directory for more in‑depth examples.

from pathlib import Path
from pdf_ocr_pipeline import process_pdf
from pdf_ocr_pipeline.types import ProcessSettings

# Pure OCR with DPI and language override
pdf_path = Path('document.pdf')
settings = ProcessSettings(dpi=300, lang='eng')
ocr_result = process_pdf(pdf_path, settings=settings)
print(ocr_result)

Example Scripts

The repository includes example scripts to demonstrate common workflows:

Batch Processing with AI Analysis
```
./examples/ocr_and_analyze.sh document.pdf "Extract key points and entities from this document" output.json
```
This script performs OCR on a PDF, sends the extracted text to OpenAI's GPT-4o for analysis, and saves the structured results to a JSON file. Perfect for automating document processing workflows.
Directory Processing
```
./examples/process_dir.sh /path/to/pdf/directory [dpi] [language]
```
This script processes all PDF files in a directory and saves individual JSON output files to an ocr_output subdirectory. Ideal for batch processing large collections of documents.

Programmatic Integration

# From examples/programmatic_usage.py
from pathlib import Path
from pdf_ocr_pipeline import process_pdf

pdf_path = Path('document.pdf')

# Pure OCR
ocr_result = process_pdf(pdf_path)
print(ocr_result)

# OCR + segmentation via GPT
segmentation_result = process_pdf(pdf_path, analyze=True)
print(segmentation_result)

Demonstrates how to integrate the PDF OCR Pipeline into your own Python applications, using the high-level process_pdf function for both OCR and optional AI analysis.

Output Format

The tool outputs JSON to stdout with the following structure:

For a single file:

[
  {
    "file": "document.pdf",
    "ocr_text": "The extracted text content..."
  }
]

For multiple files:

[
  {
    "file": "file1.pdf",
    "ocr_text": "The extracted text from file1..."
  },
  {
    "file": "file2.pdf",
    "ocr_text": "The extracted text from file2..."
  }
]

AI Analysis Output Format

When using the GPT-4o analysis feature, the output format is:

[
  {
    "file": "document.pdf",
    "analysis": {
      "summary": "Brief summary of the document content",
      "entities": [
        {"name": "John Smith", "type": "person"},
        {"date": "2023-04-15", "type": "date"}
      ],
      "key_points": [
        "First important point from the document",
        "Second important point from the document"
      ],
      "tables": [
        {
          "header": ["Column1", "Column2"],
          "rows": [
            ["Value1", "Value2"],
            ["Value3", "Value4"]
          ]
        }
      ]
    }
  }
]

Note: The exact structure of the analysis field may vary depending on the prompt used and the content of the document.

Testing

To run the test suite:

python -m unittest discover tests

The tests use mock objects to avoid dependencies on external tools, so you can run them even without installing pdftoppm or tesseract.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch: git checkout -b feature/amazing-feature
Format your code: black .
Run linters: flake8 and mypy
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Documentation

API Reference – Library and CLI reference
Project Organization – Code structure and design
Changelog – History of changes
Contributing – Guidelines for contributing

Project Structure

pdf-ocr-pipeline/
├── CHANGELOG.md           # History of changes to the project
├── CLAUDE.md              # Guidelines for Claude AI when working with this code
├── CONTRIBUTING.md        # Guidelines for contributing to the project
├── LICENSE                # MIT License
├── Makefile               # Development task automation
├── README.md              # This documentation
├── bin/                   # Executable scripts
│   ├── pdf-ocr            # OCR command-line script
│   └── summarize_text.py  # Text analysis command-line script
├── docs/                  # Documentation
│   ├── api.md              # API reference
   └── project_organization.md # Project structure and design
├── examples/              # Example scripts and usage patterns
│   ├── __init__.py        # Package indicator
│   ├── ocr_and_analyze.sh # Combined OCR and analysis script
│   ├── process_dir.sh     # Directory processing script
│   └── programmatic_usage.py # Example of programmatic usage
├── pdf-ocr                # CLI entry point
├── pyproject.toml         # Modern Python project configuration
├── requirements-dev.txt   # Development dependencies
├── requirements.lock      # Locked dependencies
├── setup.cfg              # Configuration for development tools
├── setup.py               # Package installation configuration
├── src/                   # Source code
│   └── pdf_ocr_pipeline/  # Main package
│       ├── __init__.py    # Package initialization
│       ├── __main__.py    # Entry point for running as a module
│       ├── cli.py         # OCR command-line interface
│       ├── ocr.py         # Core OCR functionality
│       └── summarize.py   # AI text analysis functionality
├── tests/                 # Unit tests
│   ├── __init__.py        # Package indicator
│   ├── test_cli.py        # Tests for CLI functionality
│   ├── test_ocr.py        # Tests for OCR functionality
│   ├── test_pipeline.py   # Integration tests for the full pipeline
│   └── test_summarize.py  # Tests for AI summarization
└── tox.ini                # Test automation configuration

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Poppler for PDF rendering
Tesseract OCR for text recognition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF OCR Pipeline

Features

Table of Contents

Requirements

Installation

From Source

Using pip (when published)

External Dependencies

Usage

Command Line

Options

AI-Powered Text Analysis

Customizing AI Analysis

AI Configuration

Programmatic Usage

Example Scripts

Output Format

AI Analysis Output Format

Testing

Contributing

Documentation

Project Structure

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/pdf_ocr_pipeline		src/pdf_ocr_pipeline
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
package-lock.json		package-lock.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.lock		requirements.lock
setup.cfg		setup.cfg
setup.py		setup.py
todos.md		todos.md
tox.ini		tox.ini
uv.lock		uv.lock

License

domfahey/pdf-ocr-pipeline

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Pipeline

Features

Table of Contents

Requirements

Installation

From Source

Using pip (when published)

External Dependencies

Usage

Command Line

Options

AI-Powered Text Analysis

Customizing AI Analysis

AI Configuration

Programmatic Usage

Example Scripts

Output Format

AI Analysis Output Format

Testing

Contributing

Documentation

Project Structure

License

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages