Memory corruption crash with ThreadPoolExecutor during batch processing

Description

When processing a large dataset of PDFs (~100k documents) using DocumentConverter with ThreadPoolExecutor, the process crashes randomly with memory corruption errors after processing several hundred documents.

Error Messages

Two variants observed:

malloc(): unsorted double linked list corrupted
corrupted double-linked list

Environment

Docling version: 2.57.0
Python version: 3.13.9
OS: Linux 6.17.4-200.fc42.x86_64 (Fedora 42)
CPU: 20 cores
Architecture: x86_64

Configuration

from concurrent.futures import ThreadPoolExecutor
from docling.datamodel.accelerator_options import AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# Configuration
workers = 4  # CPU threads per document
parallel = 5  # Documents processed in parallel (auto-detected from cpu_count // workers)

# Setup
accelerator_options = AcceleratorOptions(
    num_threads=workers,
    device="auto",
)

pipeline_options = PdfPipelineOptions(
    artifacts_path="/home/user/.cache/docling/models",
    do_ocr=False,  # OCR disabled for speed
)
pipeline_options.accelerator_options = accelerator_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Process documents in parallel
with ThreadPoolExecutor(max_workers=parallel) as executor:
    # Submit conversion jobs...
    result = converter.convert(str(pdf_path))

Expected Behavior

The converter should process all documents without crashing, or at minimum provide a Python exception that can be caught and handled.

Actual Behavior

Process crashes with C-level memory corruption error that cannot be caught by Python exception handling. The error appears to originate from native libraries (likely PDF parsing or image processing libraries).

Workaround

Currently manually restarting the process after each crash. The command automatically skips already-processed documents, allowing gradual completion of the dataset.

Related Issues

docling-serve#389

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory corruption crash with ThreadPoolExecutor during batch processing #2559

Memory corruption crash with ThreadPoolExecutor during batch processing

Description

Error Messages

Environment

Configuration

Expected Behavior

Actual Behavior

Workaround

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory corruption crash with ThreadPoolExecutor during batch processing #2559

Description

Memory corruption crash with ThreadPoolExecutor during batch processing

Description

Error Messages

Environment

Configuration

Expected Behavior

Actual Behavior

Workaround

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions