PDF OCR Pipeline is a command-line and programmatic tool to extract text from PDF documents using OCR (Optical Character Recognition), with optional AI‑powered analysis and summarization.
- Process single or multiple PDF files in one command
- Configurable OCR resolution (DPI)
- Support for multiple languages via Tesseract
- JSON output for easy integration with other tools
- AI-powered text analysis and summarization using OpenAI's GPT-4o
- Features
- Requirements
- Installation
- Usage
- AI‑Powered Analysis
- Programmatic Usage
- Output Format
- Testing
- Contributing
- Documentation
- Project Structure
- License
- Acknowledgements
- Python 3.6+
- External dependencies:
pdftoppm(typically from poppler-utils)tesseract(Tesseract OCR engine)
-
Clone this repository:
git clone https://github.com/yourusername/pdf-ocr-pipeline.git cd pdf-ocr-pipeline -
Install the project environment and dependencies using uv:
# Install uv if not already available pip install uv # Sync the project (creates .venv and installs dependencies) uv sync
This will set up a virtual environment, install the package in editable mode, and install all runtime dependencies.
# Add the published package as a dependency and sync the environment
uv add pdf-ocr-pipeline
uv syncEnsure the required external tools are installed:
Ubuntu/Debian:
sudo apt-get install poppler-utils tesseract-ocr
macOS:
brew install poppler tesseract
Windows:
Install Poppler for Windows and Tesseract for Windows. Ensure both are added to your PATH.
After installation, you can run the pdf-ocr command directly or via uv:
Basic usage:
# Using the `pdf-ocr` command
pdf-ocr path/to/document.pdf > result.json
# Using uv to run the CLI
uv run pdf-ocr path/to/document.pdf > result.jsonProcess multiple files:
# Using the `pdf-ocr` command
pdf-ocr file1.pdf file2.pdf file3.pdf > results.json
# Using uv to run the CLI
uv run pdf-ocr file1.pdf file2.pdf file3.pdf > results.json--dpi DPI: Set the resolution for OCR processing (default: 300)-l, --lang LANGUAGE: Set the language for Tesseract (default: eng)
Example with options:
pdf-ocr --dpi 600 -l fra path/to/french_document.pdf > result.jsonThe package includes a powerful tool to analyze OCR text using OpenAI's GPT-4o model:
# Process a PDF file and analyze the text
pdf-ocr document.pdf | pdf-ocr-summarize --pretty > analysis.json
# Use a custom prompt for the analysis
pdf-ocr document.pdf | pdf-ocr-summarize --prompt "Extract all dates and names mentioned in the text" > analysis.jsonYou can tailor the AI analysis by providing custom prompts for different types of documents:
- Legal documents:
--prompt "Extract all legal entities, contract provisions, and obligations" - Academic papers:
--prompt "Summarize the methodology, findings and conclusions" - Financial reports:
--prompt "Extract financial figures, percentages, and trends" - Medical documents:
--prompt "Extract diagnoses, treatments, and medications"
- API Key: Set the
OPENAI_API_KEYenvironment variable with your OpenAI API key - Model: Uses GPT-4o by default for optimal accuracy and structured output
- Output Format: Returns structured JSON with analysis organized into relevant sections
- Verbose Mode: Use
-vfor detailed processing information
See the API reference for details on library functions and the Project Organization for an overview of the code structure.
You can also use the pipeline directly from Python:
from pdf_ocr_pipeline import process_pdf
from pdf_ocr_pipeline.types import ProcessSettings
# Pure OCR
ocr = process_pdf("invoice.pdf", settings=ProcessSettings())
# OCR + segmentation via GPT
segments = process_pdf(
"closing_package.pdf",
settings=ProcessSettings(analyze=True),
)See the examples/ directory for more in‑depth examples.
from pathlib import Path
from pdf_ocr_pipeline import process_pdf
from pdf_ocr_pipeline.types import ProcessSettings
# Pure OCR with DPI and language override
pdf_path = Path('document.pdf')
settings = ProcessSettings(dpi=300, lang='eng')
ocr_result = process_pdf(pdf_path, settings=settings)
print(ocr_result)The repository includes example scripts to demonstrate common workflows:
-
Batch Processing with AI Analysis
./examples/ocr_and_analyze.sh document.pdf "Extract key points and entities from this document" output.jsonThis script performs OCR on a PDF, sends the extracted text to OpenAI's GPT-4o for analysis, and saves the structured results to a JSON file. Perfect for automating document processing workflows.
-
Directory Processing
./examples/process_dir.sh /path/to/pdf/directory [dpi] [language]
This script processes all PDF files in a directory and saves individual JSON output files to an
ocr_outputsubdirectory. Ideal for batch processing large collections of documents. -
Programmatic Integration
# From examples/programmatic_usage.py from pathlib import Path from pdf_ocr_pipeline import process_pdf pdf_path = Path('document.pdf') # Pure OCR ocr_result = process_pdf(pdf_path) print(ocr_result) # OCR + segmentation via GPT segmentation_result = process_pdf(pdf_path, analyze=True) print(segmentation_result)
Demonstrates how to integrate the PDF OCR Pipeline into your own Python applications, using the high-level
process_pdffunction for both OCR and optional AI analysis.
The tool outputs JSON to stdout with the following structure:
For a single file:
[
{
"file": "document.pdf",
"ocr_text": "The extracted text content..."
}
]For multiple files:
[
{
"file": "file1.pdf",
"ocr_text": "The extracted text from file1..."
},
{
"file": "file2.pdf",
"ocr_text": "The extracted text from file2..."
}
]When using the GPT-4o analysis feature, the output format is:
[
{
"file": "document.pdf",
"analysis": {
"summary": "Brief summary of the document content",
"entities": [
{"name": "John Smith", "type": "person"},
{"date": "2023-04-15", "type": "date"}
],
"key_points": [
"First important point from the document",
"Second important point from the document"
],
"tables": [
{
"header": ["Column1", "Column2"],
"rows": [
["Value1", "Value2"],
["Value3", "Value4"]
]
}
]
}
}
]Note: The exact structure of the analysis field may vary depending on the prompt used and the content of the document.
To run the test suite:
python -m unittest discover testsThe tests use mock objects to avoid dependencies on external tools, so you can run them even without installing pdftoppm or tesseract.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch:
git checkout -b feature/amazing-feature - Format your code:
black . - Run linters:
flake8andmypy - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
- API Reference – Library and CLI reference
- Project Organization – Code structure and design
- Changelog – History of changes
- Contributing – Guidelines for contributing
pdf-ocr-pipeline/
├── CHANGELOG.md # History of changes to the project
├── CLAUDE.md # Guidelines for Claude AI when working with this code
├── CONTRIBUTING.md # Guidelines for contributing to the project
├── LICENSE # MIT License
├── Makefile # Development task automation
├── README.md # This documentation
├── bin/ # Executable scripts
│ ├── pdf-ocr # OCR command-line script
│ └── summarize_text.py # Text analysis command-line script
├── docs/ # Documentation
│ ├── api.md # API reference
└── project_organization.md # Project structure and design
├── examples/ # Example scripts and usage patterns
│ ├── __init__.py # Package indicator
│ ├── ocr_and_analyze.sh # Combined OCR and analysis script
│ ├── process_dir.sh # Directory processing script
│ └── programmatic_usage.py # Example of programmatic usage
├── pdf-ocr # CLI entry point
├── pyproject.toml # Modern Python project configuration
├── requirements-dev.txt # Development dependencies
├── requirements.lock # Locked dependencies
├── setup.cfg # Configuration for development tools
├── setup.py # Package installation configuration
├── src/ # Source code
│ └── pdf_ocr_pipeline/ # Main package
│ ├── __init__.py # Package initialization
│ ├── __main__.py # Entry point for running as a module
│ ├── cli.py # OCR command-line interface
│ ├── ocr.py # Core OCR functionality
│ └── summarize.py # AI text analysis functionality
├── tests/ # Unit tests
│ ├── __init__.py # Package indicator
│ ├── test_cli.py # Tests for CLI functionality
│ ├── test_ocr.py # Tests for OCR functionality
│ ├── test_pipeline.py # Integration tests for the full pipeline
│ └── test_summarize.py # Tests for AI summarization
└── tox.ini # Test automation configuration
This project is licensed under the MIT License - see the LICENSE file for details.
- Poppler for PDF rendering
- Tesseract OCR for text recognition