Fast codePointCount implementation for BytesRef by parkertimmins · Pull Request #140388 · elastic/elasticsearch

parkertimmins · 2026-01-08T17:30:55Z

Lucene's UnicodeUtil.codePointCount is used to count the number of code points in a unicode string. It processes a single byte at a time. We can improve upon this by loading 8 bytes into a long and processing them at once.

elasticsearchmachine · 2026-01-08T18:07:45Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2026-01-08T18:07:46Z

Hi @parkertimmins, I've created a changelog YAML for you.

martijnvg

One minor comment, LGTM otherwise.

...g/elasticsearch/index/mapper/blockloader/docvalues/fn/Utf8CodePointsFromOrdsBlockLoader.java

romseygeek

LGTM!

parkertimmins · 2026-01-12T22:47:38Z

The failing tests are:

TimeSeriesFirstDocIdDeduplicationTests.testSimpleToString
WindowGroupingAggregatorFunctionTests.testSimpleToString
RowInTableLookupOperatorTests.testSimpleToString
BlockHashTests.testLongBytesRefHashWithMultiValuedFields
These are failing on other PR with test-release, for example Fix EXTENDED_DOC_VALUES_PARAMS test cluster feature flag. #140519 . This is likely caused by swiss tables (Introduce SwissTable-based hashing (LongSwissHash, BytesRefSwissHash) for ES|QL STATS #139343) still being behind a feature flag.

Since this is unrelated to this PR, I'll go ahead with the merge.

ChrisHegarty · 2026-01-13T10:42:38Z

The failing tests are:

TimeSeriesFirstDocIdDeduplicationTests.testSimpleToString

WindowGroupingAggregatorFunctionTests.testSimpleToString

RowInTableLookupOperatorTests.testSimpleToString

BlockHashTests.testLongBytesRefHashWithMultiValuedFields
These are failing on other PR with test-release, for example Fix EXTENDED_DOC_VALUES_PARAMS test cluster feature flag. #140519 . This is likely caused by swiss tables (Introduce SwissTable-based hashing (LongSwissHash, BytesRefSwissHash) for ES|QL STATS #139343) still being behind a feature flag.

Since this is unrelated to this PR, I'll go ahead with the merge.

For reference, PR to fix the unrelated test failures - #140557

nik9000 · 2026-01-13T13:08:40Z

Neat! Did you get any performance numbers on this one?

I imagine this could be plugged into a bunch of other places too.

parkertimmins · 2026-01-13T16:37:02Z

@nik9000
Good, question! I ran some rally tracks at the time, but didn't think to add a micro benchmark. Here's are some micro benchmark results: #140591 (comment)

Lucene's UnicodeUtil.codePointCount is used to count the number of code points in a unicode string. It processes a single byte at a time. We can improve upon this by loading 8 bytes into a long and processing them at once.

Add Panama SIMD implementation of codePointCount. Keep SWAR version from #140388 as fallback if SIMD not available. This results in a very large speedup on long strings, for example those over 100 bytes. Lucene's UnicodeUtil.codePointCount remains faster for small strings, so continue to use this version if byte length is below a threshold.

Faster codePointCount implementation

7ce4a92

elasticsearchmachine added the v9.4.0 label Jan 8, 2026

parkertimmins and others added 4 commits January 8, 2026 11:32

Remove unneeded code

c1e5745

Merge branch 'main' into parker/fast-code-point-count

e2ecfc7

[CI] Auto commit changes from spotless

32f46b8

Wrap codePointCount in feature flag

4bca805

parkertimmins requested review from martijnvg and romseygeek and removed request for martijnvg January 8, 2026 18:06

parkertimmins added >enhancement :StorageEngine/Codec labels Jan 8, 2026

parkertimmins marked this pull request as ready for review January 8, 2026 18:07

parkertimmins requested a review from martijnvg January 8, 2026 18:07

elasticsearchmachine added the Team:StorageEngine label Jan 8, 2026

Update docs/changelog/140388.yaml

7384d10

[CI] Auto commit changes from spotless

113f397

parkertimmins self-assigned this Jan 8, 2026

parkertimmins added the test-release Trigger CI checks against release build label Jan 8, 2026

martijnvg approved these changes Jan 9, 2026

View reviewed changes

...g/elasticsearch/index/mapper/blockloader/docvalues/fn/Utf8CodePointsFromOrdsBlockLoader.java Outdated Show resolved Hide resolved

romseygeek approved these changes Jan 9, 2026

View reviewed changes

parkertimmins added 5 commits January 9, 2026 10:27

review feedback

ac0fd35

Merge branch 'main' into parker/fast-code-point-count

c4221e5

Merge branch 'main' into parker/fast-code-point-count

204fde2

Merge branch 'main' into parker/fast-code-point-count

809aa29

Merge branch 'main' into parker/fast-code-point-count

964e810

parkertimmins merged commit 6c1e866 into elastic:main Jan 12, 2026
36 of 39 checks passed

martijnvg mentioned this pull request Jan 13, 2026

Fix EXTENDED_DOC_VALUES_PARAMS test cluster feature flag. #140519

Merged

ChrisHegarty mentioned this pull request Jan 13, 2026

Examine a Panama Vector implementation of Fast codePointCount #140567

Closed

parkertimmins mentioned this pull request Jan 13, 2026

Add benchmark to test fastCodePointCount #140591

Open

parkertimmins deleted the parker/fast-code-point-count branch January 13, 2026 16:37

parkertimmins mentioned this pull request Jan 15, 2026

Panama vector implementation of codePointCount #140693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast codePointCount implementation for BytesRef#140388

Fast codePointCount implementation for BytesRef#140388
parkertimmins merged 12 commits intoelastic:mainfrom
parkertimmins:parker/fast-code-point-count

parkertimmins commented Jan 8, 2026 •

edited

Loading

elasticsearchmachine commented Jan 8, 2026

elasticsearchmachine commented Jan 8, 2026

martijnvg left a comment

Uh oh!

romseygeek left a comment

parkertimmins commented Jan 12, 2026

Uh oh!

ChrisHegarty commented Jan 13, 2026

nik9000 commented Jan 13, 2026

parkertimmins commented Jan 13, 2026

Labels

6 participants

Conversation

parkertimmins commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Jan 8, 2026

elasticsearchmachine commented Jan 8, 2026

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

parkertimmins commented Jan 12, 2026

Uh oh!

ChrisHegarty commented Jan 13, 2026

nik9000 commented Jan 13, 2026

parkertimmins commented Jan 13, 2026

Labels

6 participants

parkertimmins commented Jan 8, 2026 •

edited

Loading