Description
Bug description
Summary
When passing a non-BMP character such as '𣄃'
(U+23103) into an AssemblyScript function, the string gets split into two UTF-16 code units (a surrogate pair), which is expected. However, this becomes problematic when attempting to encode the string into a UTF-8 byte stream for hashing (e.g., MD5), as AssemblyScript provides no built-in way to handle surrogate pairs as a single Unicode code point.
This leads to incorrect behavior, especially in cryptographic applications (e.g., generating an MD5 hash of a string containing such characters), where string encoding must exactly match UTF-8 as produced by JavaScript's TextEncoder
.
Steps to Reproduce
export function debugUtf16(str: string): void {
for (let i = 0; i < str.length; i++) {
trace("charCodeAt", 1, str.charCodeAt(i));
}
}
Would it be possible to:
Provide built-in String.codePointAt(i: i32): i32 in stdlib to make proper iteration over Unicode strings easier?
Offer utility to convert a string to a proper UTF-8 encoded Uint8Array, respecting surrogate pairs.
Possibly integrate this into String.UTF8.encode() or add a safer variant.
Steps to reproduce
None
AssemblyScript version
0.28.2