PDFs failing to convert to markdown with invalid start byte (0x81 or 0x82) #2491
Unanswered
shanefay422
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a number of PDFs that fail when converting to markdown. I have tried to clean the files up with ghostscipt and PyMuPDF, to no avail. These PDFs all render fine in a PDF reader.
Below is a stack trace of the error I get converting them, as well as the PDF that caused the failure.
Any help would be greatly appreciated!
Traceback (most recent call last): File "/Librarian/tasks/process_document.py", line 302, in get_file_text text, page_num, total_pages = docling_convert_to_markdown(file_path) ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^ File "/Librarian/tasks/process_document.py", line 160, in docling_convert_to_markdown result = converter.convert(source=doc_source) File "/usr/local/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function return wrapper(*args, **kwargs) File "/usr/local/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py", line 136, in __call__ res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs)) File "/usr/local/lib/python3.13/site-packages/docling/document_converter.py", line 245, in convert return next(all_res) File "/usr/local/lib/python3.13/site-packages/docling/document_converter.py", line 268, in convert_all for conv_res in conv_res_iter: ^^^^^^^^^^^^^ File "/usr/local/lib/python3.13/site-packages/docling/document_converter.py", line 340, in _convert for item in map( ~~~^ process_func, ^^^^^^^^^^^^^ input_batch, ^^^^^^^^^^^^ ): ^ File "/usr/local/lib/python3.13/site-packages/docling/document_converter.py", line 387, in _process_document conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error) File "/usr/local/lib/python3.13/site-packages/docling/document_converter.py", line 410, in _execute_pipeline conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error) File "/usr/local/lib/python3.13/site-packages/docling/pipeline/base_pipeline.py", line 80, in execute raise e File "/usr/local/lib/python3.13/site-packages/docling/pipeline/base_pipeline.py", line 72, in execute conv_res = self._build_document(conv_res) File "/usr/local/lib/python3.13/site-packages/docling/pipeline/base_pipeline.py", line 270, in _build_document raise e File "/usr/local/lib/python3.13/site-packages/docling/pipeline/base_pipeline.py", line 230, in _build_document for p in pipeline_pages: # Must exhaust! ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.13/site-packages/docling/pipeline/base_pipeline.py", line 195, in _apply_on_pages yield from page_batch File "/usr/local/lib/python3.13/site-packages/docling/models/page_assemble_model.py", line 70, in __call__ for page in page_batch: ^^^^^^^^^^ File "/usr/local/lib/python3.13/site-packages/docling/models/table_structure_model.py", line 177, in __call__ for page in page_batch: ^^^^^^^^^^ File "/usr/local/lib/python3.13/site-packages/docling/models/layout_model.py", line 152, in __call__ pages = list(page_batch) File "/usr/local/lib/python3.13/site-packages/docling/models/auto_ocr_model.py", line 126, in __call__ yield from page_batch File "/usr/local/lib/python3.13/site-packages/docling/models/page_preprocessing_model.py", line 48, in __call__ page = self._parse_page_cells(conv_res, page) File "/usr/local/lib/python3.13/site-packages/docling/models/page_preprocessing_model.py", line 72, in _parse_page_cells page.parsed_page = page._backend.get_segmented_page() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/usr/local/lib/python3.13/site-packages/docling/backend/docling_parse_v4_backend.py", line 108, in get_segmented_page self._ensure_parsed() ~~~~~~~~~~~~~~~~~~~^^ File "/usr/local/lib/python3.13/site-packages/docling/backend/docling_parse_v4_backend.py", line 56, in _ensure_parsed seg_page = self._dp_doc.get_page( self._page_no + 1, ...<5 lines>... enforce_same_font=True, ) File "/usr/local/lib/python3.13/site-packages/docling_parse/pdf_parser.py", line 162, in get_page doc_dict = self._parser.parse_pdf_from_key_on_page( key=self._key, ...<7 lines>... create_line_cells=create_textlines, ) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 1: invalid start byte18KenaAndOtherUpanishads.pdf
Beta Was this translation helpful? Give feedback.
All reactions