Skip to content

fix: Handle UnicodeDecodeError in PlainTextConverter (#1505)#1540

Open
s3ich4n wants to merge 3 commits intomicrosoft:mainfrom
s3ich4n:fix/issue-1505-unicode-decode-error
Open

fix: Handle UnicodeDecodeError in PlainTextConverter (#1505)#1540
s3ich4n wants to merge 3 commits intomicrosoft:mainfrom
s3ich4n:fix/issue-1505-unicode-decode-error

Conversation

@s3ich4n
Copy link

@s3ich4n s3ich4n commented Jan 21, 2026

When charset detection samples only the first 4096 bytes and detects ascii, but the file contains UTF-8 characters beyond that point, decoding fails with UnicodeDecodeError.

Added fallback to charset_normalizer when UnicodeDecodeError occurs, allowing proper handling of files with non-ASCII characters Spanish, Korean, Japanese, Chinese, etc.)
that appear after the 4096-byte sample.

fixes #1505

When charset detection samples only the first 4096 bytes and detects 'ascii',
but the file contains UTF-8 characters beyond that point,
decoding fails with UnicodeDecodeError.

Added fallback to charset_normalizer when UnicodeDecodeError occurs,
allowing proper handling of files with non-ASCII characters
Spanish, Korean, Japanese, Chinese, etc.)
that appear after the 4096-byte sample.
@s3ich4n
Copy link
Author

s3ich4n commented Jan 21, 2026

@microsoft-github-policy-service agree

mikeumus added a commit to Divinci-AI/markitdown that referenced this pull request Feb 2, 2026
When charset detection samples only the first 4096 bytes and detects 'ascii',
but the file contains UTF-8 characters beyond that point,
decoding fails with UnicodeDecodeError.

Added fallback to charset_normalizer when UnicodeDecodeError occurs,
allowing proper handling of files with non-ASCII characters
(Spanish, Korean, Japanese, Chinese, etc.)
that appear after the 4096-byte sample.

Cherry-picked from microsoft/markitdown PR microsoft#1540
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants