UTF-16 surrogate pairs mishandled when processing non-BMP characters (e.g., 𣄃) in string-to-byte conversions

Bug description

Summary

When passing a non-BMP character such as '𣄃' (U+23103) into an AssemblyScript function, the string gets split into two UTF-16 code units (a surrogate pair), which is expected. However, this becomes problematic when attempting to encode the string into a UTF-8 byte stream for hashing (e.g., MD5), as AssemblyScript provides no built-in way to handle surrogate pairs as a single Unicode code point.

This leads to incorrect behavior, especially in cryptographic applications (e.g., generating an MD5 hash of a string containing such characters), where string encoding must exactly match UTF-8 as produced by JavaScript's TextEncoder.

Steps to Reproduce

export function debugUtf16(str: string): void {
  for (let i = 0; i < str.length; i++) {
    trace("charCodeAt", 1, str.charCodeAt(i));
  }
}

Would it be possible to:

Provide built-in String.codePointAt(i: i32): i32 in stdlib to make proper iteration over Unicode strings easier?

Offer utility to convert a string to a proper UTF-8 encoded Uint8Array, respecting surrogate pairs.

Possibly integrate this into String.UTF8.encode() or add a safer variant.

Steps to reproduce

None

AssemblyScript version

0.28.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

UTF-16 surrogate pairs mishandled when processing non-BMP characters (e.g., 𣄃) in string-to-byte conversions #2932

Bug description

Summary

Steps to Reproduce

Steps to reproduce

AssemblyScript version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

UTF-16 surrogate pairs mishandled when processing non-BMP characters (e.g., 𣄃) in string-to-byte conversions #2932

Description

Bug description

Summary

Steps to Reproduce

Steps to reproduce

AssemblyScript version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions