Skip to content

UTF-16 surrogate pairs mishandled when processing non-BMP characters (e.g., 𣄃) in string-to-byte conversions #2932

Closed
@yoyo837

Description

@yoyo837

Bug description

Summary

When passing a non-BMP character such as '𣄃' (U+23103) into an AssemblyScript function, the string gets split into two UTF-16 code units (a surrogate pair), which is expected. However, this becomes problematic when attempting to encode the string into a UTF-8 byte stream for hashing (e.g., MD5), as AssemblyScript provides no built-in way to handle surrogate pairs as a single Unicode code point.

This leads to incorrect behavior, especially in cryptographic applications (e.g., generating an MD5 hash of a string containing such characters), where string encoding must exactly match UTF-8 as produced by JavaScript's TextEncoder.

Steps to Reproduce

export function debugUtf16(str: string): void {
  for (let i = 0; i < str.length; i++) {
    trace("charCodeAt", 1, str.charCodeAt(i));
  }
}

Would it be possible to:

Provide built-in String.codePointAt(i: i32): i32 in stdlib to make proper iteration over Unicode strings easier?

Offer utility to convert a string to a proper UTF-8 encoded Uint8Array, respecting surrogate pairs.

Possibly integrate this into String.UTF8.encode() or add a safer variant.

Steps to reproduce

None

AssemblyScript version

0.28.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions