Skip to content

Memory corruption crash with ThreadPoolExecutor during batch processing #2559

@adonig

Description

@adonig

Memory corruption crash with ThreadPoolExecutor during batch processing

Description

When processing a large dataset of PDFs (~100k documents) using DocumentConverter with ThreadPoolExecutor, the process crashes randomly with memory corruption errors after processing several hundred documents.

Error Messages

Two variants observed:

  1. malloc(): unsorted double linked list corrupted
  2. corrupted double-linked list

Environment

  • Docling version: 2.57.0
  • Python version: 3.13.9
  • OS: Linux 6.17.4-200.fc42.x86_64 (Fedora 42)
  • CPU: 20 cores
  • Architecture: x86_64

Configuration

from concurrent.futures import ThreadPoolExecutor
from docling.datamodel.accelerator_options import AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# Configuration
workers = 4  # CPU threads per document
parallel = 5  # Documents processed in parallel (auto-detected from cpu_count // workers)

# Setup
accelerator_options = AcceleratorOptions(
    num_threads=workers,
    device="auto",
)

pipeline_options = PdfPipelineOptions(
    artifacts_path="/home/user/.cache/docling/models",
    do_ocr=False,  # OCR disabled for speed
)
pipeline_options.accelerator_options = accelerator_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Process documents in parallel
with ThreadPoolExecutor(max_workers=parallel) as executor:
    # Submit conversion jobs...
    result = converter.convert(str(pdf_path))

Expected Behavior

The converter should process all documents without crashing, or at minimum provide a Python exception that can be caught and handled.

Actual Behavior

Process crashes with C-level memory corruption error that cannot be caught by Python exception handling. The error appears to originate from native libraries (likely PDF parsing or image processing libraries).

Workaround

Currently manually restarting the process after each crash. The command automatically skips already-processed documents, allowing gradual completion of the dataset.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions