-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Memory corruption crash with ThreadPoolExecutor during batch processing
Description
When processing a large dataset of PDFs (~100k documents) using DocumentConverter with ThreadPoolExecutor, the process crashes randomly with memory corruption errors after processing several hundred documents.
Error Messages
Two variants observed:
malloc(): unsorted double linked list corruptedcorrupted double-linked list
Environment
- Docling version: 2.57.0
- Python version: 3.13.9
- OS: Linux 6.17.4-200.fc42.x86_64 (Fedora 42)
- CPU: 20 cores
- Architecture: x86_64
Configuration
from concurrent.futures import ThreadPoolExecutor
from docling.datamodel.accelerator_options import AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
# Configuration
workers = 4 # CPU threads per document
parallel = 5 # Documents processed in parallel (auto-detected from cpu_count // workers)
# Setup
accelerator_options = AcceleratorOptions(
num_threads=workers,
device="auto",
)
pipeline_options = PdfPipelineOptions(
artifacts_path="/home/user/.cache/docling/models",
do_ocr=False, # OCR disabled for speed
)
pipeline_options.accelerator_options = accelerator_options
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
# Process documents in parallel
with ThreadPoolExecutor(max_workers=parallel) as executor:
# Submit conversion jobs...
result = converter.convert(str(pdf_path))Expected Behavior
The converter should process all documents without crashing, or at minimum provide a Python exception that can be caught and handled.
Actual Behavior
Process crashes with C-level memory corruption error that cannot be caught by Python exception handling. The error appears to originate from native libraries (likely PDF parsing or image processing libraries).
Workaround
Currently manually restarting the process after each crash. The command automatically skips already-processed documents, allowing gradual completion of the dataset.
Related Issues
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working