Skip to content

fix(mcp): return valid UTF-8 snippets#526

Merged
DeusData merged 1 commit into
DeusData:mainfrom
jstar0:fix/snippet-valid-utf8
Jun 23, 2026
Merged

fix(mcp): return valid UTF-8 snippets#526
DeusData merged 1 commit into
DeusData:mainfrom
jstar0:fix/snippet-valid-utf8

Conversation

@jstar0

@jstar0 jstar0 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Sanitize get_code_snippet source text to valid UTF-8 before serializing the MCP response.
  • Preserve valid UTF-8 bytes and replace only invalid source bytes with U+FFFD.
  • Add a regression test for non-UTF-8 source bytes in the snippet file.

Fixes #511.

Changes

  • Adds a small lossy UTF-8 sanitizer for snippet source strings.
  • Uses the sanitized copy for the source field in build_snippet_response().
  • Verifies both the raw MCP response envelope and the embedded snippet JSON parse as strict JSON when the source contains CP949/EUC-KR-like bytes.

Verification

make -f Makefile.cbm test
make -f Makefile.cbm lint-format
make -f Makefile.cbm lint-no-suppress
git diff --check

I also attempted:

make -f Makefile.cbm lint-ci

That local run stopped at cppcheck because cppcheck is not installed in my environment.

Signed-off-by: King Star <mcxin.y@gmail.com>
@DeusData

Copy link
Copy Markdown
Owner

Thanks @jstar0 — your reproduce-first test here (write known-bad CP949 bytes, assert the response is valid JSON with U+FFFD replacements) is excellent and we want to keep it.

There's a parallel PR #541 fixing the same issue (#511) with a cleaner sanitizer (canonical (c & 0xF0) == 0xE0 lead-byte checks). Our plan is to land one combined fix: #541's sanitizer + your test + an explicit overflow guard, crediting both of you.

One change needed here regardless: the overflow guard len > (((size_t)-1) - SKIP_ONE) / UTF8_REPLACEMENT_LEN hinges on the opaque SKIP_ONE macro — please make the + 1 explicit (with a comment) so the bound is auditable without chasing the macro.

Would you be open to combining your test with #541's sanitizer? 🙏

piiiico added a commit to piiiico/codebase-memory-mcp that referenced this pull request Jun 22, 2026
…f8 (DeusData#511)

Per maintainer review on DeusData#541, two additions to the snippet UTF-8
sanitizer:

1. **Reproduce-first test** (`snippet_source_invalid_utf8`)
   Writes a Go source file containing the CP949 sequence 0xC0 0xD4
   0xB7 0xC2 inside a comment, then calls `get_code_snippet` and
   asserts:
   - the outer MCP envelope parses as strict JSON,
   - the embedded snippet JSON parses as strict JSON,
   - the original invalid bytes (0xC0 0xD4) do not appear verbatim,
   - surrounding valid bytes (HandleRequest, return nil) survive,
   - U+FFFD (EF BF BD) marks where invalid bytes were.

   Without `sanitize_utf8`, the response contains raw non-UTF-8 bytes
   and step 1 fails (yyjson rejects it) — i.e. the test fails on
   pre-fix `main`. Byte pattern and structure borrowed from DeusData#526
   (jstar0) so both contributions converge on the same regression.

2. **Allocation overflow guard**
   `malloc(src_len * 3 + 1)` is technically UB if
   `src_len > (SIZE_MAX - 1) / 3`. Snippet sources can't reach that
   in practice, but the guard makes the safety property local to
   the function rather than relying on call-site bounds. The same
   bound is enforced in DeusData#526.

Refs DeusData#511.

Co-authored-by: King Star <54024410+jstar0@users.noreply.github.com>
Signed-off-by: piiiico <pico@amdal.dev>
@DeusData DeusData merged commit f82c6ae into DeusData:main Jun 23, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants