Skip to content

Fast codePointCount implementation for BytesRef#140388

Merged
parkertimmins merged 12 commits intoelastic:mainfrom
parkertimmins:parker/fast-code-point-count
Jan 12, 2026
Merged

Fast codePointCount implementation for BytesRef#140388
parkertimmins merged 12 commits intoelastic:mainfrom
parkertimmins:parker/fast-code-point-count

Conversation

@parkertimmins
Copy link
Contributor

@parkertimmins parkertimmins commented Jan 8, 2026

Lucene's UnicodeUtil.codePointCount is used to count the number of code points in a unicode string. It processes a single byte at a time. We can improve upon this by loading 8 bytes into a long and processing them at once.

@parkertimmins parkertimmins requested review from martijnvg and romseygeek and removed request for martijnvg January 8, 2026 18:06
@parkertimmins parkertimmins marked this pull request as ready for review January 8, 2026 18:07
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've created a changelog YAML for you.

@parkertimmins parkertimmins self-assigned this Jan 8, 2026
@parkertimmins parkertimmins added the test-release Trigger CI checks against release build label Jan 8, 2026
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, LGTM otherwise.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@parkertimmins
Copy link
Contributor Author

The failing tests are:

Since this is unrelated to this PR, I'll go ahead with the merge.

@parkertimmins parkertimmins merged commit 6c1e866 into elastic:main Jan 12, 2026
36 of 39 checks passed
@ChrisHegarty
Copy link
Contributor

The failing tests are:

Since this is unrelated to this PR, I'll go ahead with the merge.

For reference, PR to fix the unrelated test failures - #140557

@nik9000
Copy link
Member

nik9000 commented Jan 13, 2026

Neat! Did you get any performance numbers on this one?

I imagine this could be plugged into a bunch of other places too.

@parkertimmins
Copy link
Contributor Author

@nik9000
Good, question! I ran some rally tracks at the time, but didn't think to add a micro benchmark. Here's are some micro benchmark results: #140591 (comment)

@parkertimmins parkertimmins deleted the parker/fast-code-point-count branch January 13, 2026 16:37
eranweiss-elastic pushed a commit to eranweiss-elastic/elasticsearch that referenced this pull request Jan 15, 2026
Lucene's UnicodeUtil.codePointCount is used to count the number of code points in a unicode string. It processes a single byte at a time. We can improve upon this by loading 8 bytes into a long and processing them at once.
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
Lucene's UnicodeUtil.codePointCount is used to count the number of code points in a unicode string. It processes a single byte at a time. We can improve upon this by loading 8 bytes into a long and processing them at once.
parkertimmins added a commit that referenced this pull request Jan 21, 2026
Add Panama SIMD implementation of codePointCount. Keep SWAR version from #140388 as fallback if SIMD not available. This results in a very large speedup on long strings, for example those over 100 bytes. Lucene's UnicodeUtil.codePointCount remains faster for small strings, so continue to use this version if byte length is below a threshold.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment