fix: Handle UnicodeDecodeError in PlainTextConverter (#1505) by s3ich4n · Pull Request #1540 · microsoft/markitdown

s3ich4n · 2026-01-21T20:11:18Z

When charset detection samples only the first 4096 bytes and detects ascii, but the file contains UTF-8 characters beyond that point, decoding fails with UnicodeDecodeError.

Added fallback to charset_normalizer when UnicodeDecodeError occurs, allowing proper handling of files with non-ASCII characters Spanish, Korean, Japanese, Chinese, etc.)
that appear after the 4096-byte sample.

fixes #1505

When charset detection samples only the first 4096 bytes and detects 'ascii', but the file contains UTF-8 characters beyond that point, decoding fails with UnicodeDecodeError. Added fallback to charset_normalizer when UnicodeDecodeError occurs, allowing proper handling of files with non-ASCII characters Spanish, Korean, Japanese, Chinese, etc.) that appear after the 4096-byte sample.

s3ich4n · 2026-01-21T20:14:34Z

@microsoft-github-policy-service agree

…1505)

When charset detection samples only the first 4096 bytes and detects 'ascii', but the file contains UTF-8 characters beyond that point, decoding fails with UnicodeDecodeError. Added fallback to charset_normalizer when UnicodeDecodeError occurs, allowing proper handling of files with non-ASCII characters (Spanish, Korean, Japanese, Chinese, etc.) that appear after the 4096-byte sample. Cherry-picked from microsoft/markitdown PR microsoft#1540

s3ich4n added 2 commits January 22, 2026 05:19

style: fix linting (microsoft#1505)

fb6596d

chore: Include charset fallback test in __main__ runner (microsoft#…

a3adc91

…1505)

bpeloquinm-glitch approved these changes Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle UnicodeDecodeError in PlainTextConverter (#1505)#1540

fix: Handle UnicodeDecodeError in PlainTextConverter (#1505)#1540
s3ich4n wants to merge 3 commits intomicrosoft:mainfrom
s3ich4n:fix/issue-1505-unicode-decode-error

s3ich4n commented Jan 21, 2026 •

edited

Loading

s3ich4n commented Jan 21, 2026

Labels

2 participants

Conversation

s3ich4n commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

s3ich4n commented Jan 21, 2026

Labels

2 participants

s3ich4n commented Jan 21, 2026 •

edited

Loading