fix(mcp): return valid UTF-8 snippets by jstar0 · Pull Request #526 · DeusData/codebase-memory-mcp

jstar0 · 2026-06-19T18:21:18Z

Summary

Sanitize get_code_snippet source text to valid UTF-8 before serializing the MCP response.
Preserve valid UTF-8 bytes and replace only invalid source bytes with U+FFFD.
Add a regression test for non-UTF-8 source bytes in the snippet file.

Fixes #511.

Changes

Adds a small lossy UTF-8 sanitizer for snippet source strings.
Uses the sanitized copy for the source field in build_snippet_response().
Verifies both the raw MCP response envelope and the embedded snippet JSON parse as strict JSON when the source contains CP949/EUC-KR-like bytes.

Verification

make -f Makefile.cbm test
make -f Makefile.cbm lint-format
make -f Makefile.cbm lint-no-suppress
git diff --check

I also attempted:

make -f Makefile.cbm lint-ci

That local run stopped at cppcheck because cppcheck is not installed in my environment.

Signed-off-by: King Star <mcxin.y@gmail.com>

DeusData · 2026-06-22T20:27:07Z

Thanks @jstar0 — your reproduce-first test here (write known-bad CP949 bytes, assert the response is valid JSON with U+FFFD replacements) is excellent and we want to keep it.

There's a parallel PR #541 fixing the same issue (#511) with a cleaner sanitizer (canonical (c & 0xF0) == 0xE0 lead-byte checks). Our plan is to land one combined fix: #541's sanitizer + your test + an explicit overflow guard, crediting both of you.

One change needed here regardless: the overflow guard len > (((size_t)-1) - SKIP_ONE) / UTF8_REPLACEMENT_LEN hinges on the opaque SKIP_ONE macro — please make the + 1 explicit (with a comment) so the bound is auditable without chasing the macro.

Would you be open to combining your test with #541's sanitizer? 🙏

…f8 (DeusData#511) Per maintainer review on DeusData#541, two additions to the snippet UTF-8 sanitizer: 1. **Reproduce-first test** (`snippet_source_invalid_utf8`) Writes a Go source file containing the CP949 sequence 0xC0 0xD4 0xB7 0xC2 inside a comment, then calls `get_code_snippet` and asserts: - the outer MCP envelope parses as strict JSON, - the embedded snippet JSON parses as strict JSON, - the original invalid bytes (0xC0 0xD4) do not appear verbatim, - surrounding valid bytes (HandleRequest, return nil) survive, - U+FFFD (EF BF BD) marks where invalid bytes were. Without `sanitize_utf8`, the response contains raw non-UTF-8 bytes and step 1 fails (yyjson rejects it) — i.e. the test fails on pre-fix `main`. Byte pattern and structure borrowed from DeusData#526 (jstar0) so both contributions converge on the same regression. 2. **Allocation overflow guard** `malloc(src_len * 3 + 1)` is technically UB if `src_len > (SIZE_MAX - 1) / 3`. Snippet sources can't reach that in practice, but the guard makes the safety property local to the function rather than relying on call-site bounds. The same bound is enforced in DeusData#526. Refs DeusData#511. Co-authored-by: King Star <54024410+jstar0@users.noreply.github.com> Signed-off-by: piiiico <pico@amdal.dev>

fix(mcp): return valid UTF-8 snippets

935027a

Signed-off-by: King Star <mcxin.y@gmail.com>

DeusData mentioned this pull request Jun 22, 2026

fix: sanitize non-UTF-8 source in get_code_snippet #541

Closed

DeusData merged commit f82c6ae into DeusData:main Jun 23, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): return valid UTF-8 snippets#526

fix(mcp): return valid UTF-8 snippets#526
DeusData merged 1 commit into
DeusData:mainfrom
jstar0:fix/snippet-valid-utf8

jstar0 commented Jun 19, 2026

DeusData commented Jun 22, 2026

Uh oh!

Labels

2 participants

Conversation