Skip to content

Panama vector implementation of codePointCount#140693

Merged
parkertimmins merged 12 commits intoelastic:mainfrom
parkertimmins:parker/simd-code-point-count
Jan 21, 2026
Merged

Panama vector implementation of codePointCount#140693
parkertimmins merged 12 commits intoelastic:mainfrom
parkertimmins:parker/simd-code-point-count

Conversation

@parkertimmins
Copy link
Contributor

@parkertimmins parkertimmins commented Jan 14, 2026

Add Panama SIMD implementation of codePointCount. Keep SWAR version from #140388 as fallback if SIMD not available. This results in a very large speedup on long strings, for example those over 100 bytes. Lucene's UnicodeUtil.codePointCount remains faster for small strings, so continue to use this version if byte length is below a threshold.

Fixes #140567

@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've created a changelog YAML for you.

@parkertimmins
Copy link
Contributor Author

parkertimmins commented Jan 14, 2026

Here are the results from the attached benchmark. (Edit: This is without the short string fallback added in the most recent commit.)

Benchmark                                        (avgNumCodePoints)   (type)   Mode  Cnt    Score    Error   Units
CodePointCountBenchmark.elasticsearchPanamaSimd                   1    ascii  thrpt    5  392.061 ± 12.547  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                   1  unicode  thrpt    5  381.483 ± 11.495  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  10    ascii  thrpt    5  142.437 ±  0.482  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  10  unicode  thrpt    5  124.374 ±  3.003  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                 100    ascii  thrpt    5   74.260 ±  1.471  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                 100  unicode  thrpt    5   70.212 ±  1.589  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                1000    ascii  thrpt    5   20.442 ±  0.062  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                1000  unicode  thrpt    5   17.511 ±  0.065  ops/us
CodePointCountBenchmark.elasticsearchSwar                         1    ascii  thrpt    5  404.306 ±  0.425  ops/us
CodePointCountBenchmark.elasticsearchSwar                         1  unicode  thrpt    5  381.424 ±  2.183  ops/us
CodePointCountBenchmark.elasticsearchSwar                        10    ascii  thrpt    5  169.570 ±  0.409  ops/us
CodePointCountBenchmark.elasticsearchSwar                        10  unicode  thrpt    5  139.186 ±  6.222  ops/us
CodePointCountBenchmark.elasticsearchSwar                       100    ascii  thrpt    5   71.561 ±  2.894  ops/us
CodePointCountBenchmark.elasticsearchSwar                       100  unicode  thrpt    5   46.120 ±  1.608  ops/us
CodePointCountBenchmark.elasticsearchSwar                      1000    ascii  thrpt    5   15.409 ±  0.357  ops/us
CodePointCountBenchmark.elasticsearchSwar                      1000  unicode  thrpt    5    2.890 ±  0.164  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         1    ascii  thrpt    5  522.604 ±  0.379  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         1  unicode  thrpt    5  634.277 ±  4.087  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        10    ascii  thrpt    5  164.069 ± 10.773  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        10  unicode  thrpt    5  133.430 ± 23.857  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                       100    ascii  thrpt    5   29.975 ±  1.692  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                       100  unicode  thrpt    5    4.746 ±  0.034  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                      1000    ascii  thrpt    5    5.206 ±  0.060  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                      1000  unicode  thrpt    5    0.464 ±  0.002  ops/us

There's some speedup for longer strings, but some slowdown for shorter strings. Perhaps we should use Lucene's UnicodeUtil if length is below some threshold.

@parkertimmins
Copy link
Contributor Author

Reran the benchmarks, but with fallback to Lucene's version if byte length is below 16:

Benchmark                                        (avgNumCodePoints)   (type)   Mode  Cnt    Score    Error   Units
CodePointCountBenchmark.elasticsearchPanamaSimd                   1    ascii  thrpt    5  491.943 ± 43.694  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                   1  unicode  thrpt    5  594.924 ± 23.935  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  10    ascii  thrpt    5  167.875 ±  5.113  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  10  unicode  thrpt    5  102.182 ±  2.807  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                 100    ascii  thrpt    5   82.447 ±  1.209  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                 100  unicode  thrpt    5   77.838 ±  2.199  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                1000    ascii  thrpt    5   19.279 ±  0.186  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                1000  unicode  thrpt    5   16.961 ±  0.551  ops/us
CodePointCountBenchmark.elasticsearchSwar                         1    ascii  thrpt    5  498.372 ±  8.909  ops/us
CodePointCountBenchmark.elasticsearchSwar                         1  unicode  thrpt    5  598.291 ± 22.627  ops/us
CodePointCountBenchmark.elasticsearchSwar                        10    ascii  thrpt    5  178.102 ±  1.184  ops/us
CodePointCountBenchmark.elasticsearchSwar                        10  unicode  thrpt    5  128.568 ±  0.205  ops/us
CodePointCountBenchmark.elasticsearchSwar                       100    ascii  thrpt    5   82.428 ±  0.826  ops/us
CodePointCountBenchmark.elasticsearchSwar                       100  unicode  thrpt    5   60.443 ±  0.630  ops/us
CodePointCountBenchmark.elasticsearchSwar                      1000    ascii  thrpt    5   15.341 ±  0.111  ops/us
CodePointCountBenchmark.elasticsearchSwar                      1000  unicode  thrpt    5    2.979 ±  0.045  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         1    ascii  thrpt    5  520.845 ±  0.479  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         1  unicode  thrpt    5  632.268 ± 23.272  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        10    ascii  thrpt    5  165.697 ±  4.694  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        10  unicode  thrpt    5  139.422 ± 12.097  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                       100    ascii  thrpt    5   30.146 ±  0.910  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                       100  unicode  thrpt    5    4.816 ±  0.017  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                      1000    ascii  thrpt    5    5.223 ±  0.024  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                      1000  unicode  thrpt    5    0.464 ±  0.004  ops/us
@ChrisHegarty
Copy link
Contributor

The latest variant with the fallback looks much better to me. Seems like a good improvement to me.

@parkertimmins parkertimmins changed the title codePointCount implementation using Panama vectors API Jan 15, 2026
@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've updated the changelog YAML for you.

@parkertimmins parkertimmins marked this pull request as ready for review January 15, 2026 15:20
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@parkertimmins
Copy link
Contributor Author

Looks like it's worth using UnicodeUtil's scalar logic for short strings, SWAR for medium, SIMD for long:
Based on the followings comparisons, we can use scalar at length < 12, use SWAR if length < 54, else use SIMD

This is based on which method had the highest throughput averaged over ascii and unicode workloads. Not very scientific since the results are just from my machine, but should be faster than the existing alternatives. Also, these cases do come with a penalty of some additional branches, but the final results look fine. Unlike the final results, the code used to make the following results to not branch on the lenght, eg they always use only scalar, swar, or simd logic.

Benchmark                                  (avgNumCodePoints)   (type)   Mode  Cnt    Score    Error   Units
CodePointCountBenchmark.elasticsearchSwar                  10    ascii  thrpt    5  168.898 ±  5.322  ops/us
CodePointCountBenchmark.elasticsearchSwar                  10  unicode  thrpt    5  139.458 ±  3.935  ops/us
CodePointCountBenchmark.elasticsearchSwar                  12    ascii  thrpt    5  155.187 ±  6.162  ops/us
CodePointCountBenchmark.elasticsearchSwar                  12  unicode  thrpt    5  133.741 ±  2.512  ops/us
CodePointCountBenchmark.elasticsearchSwar                  14    ascii  thrpt    5  150.886 ±  2.933  ops/us
CodePointCountBenchmark.elasticsearchSwar                  14  unicode  thrpt    5  125.934 ±  1.087  ops/us
CodePointCountBenchmark.elasticsearchSwar                  16    ascii  thrpt    5  142.317 ±  8.220  ops/us
CodePointCountBenchmark.elasticsearchSwar                  16  unicode  thrpt    5  121.872 ±  3.241  ops/us
CodePointCountBenchmark.elasticsearchSwar                  18    ascii  thrpt    5  142.118 ± 12.871  ops/us
CodePointCountBenchmark.elasticsearchSwar                  18  unicode  thrpt    5  112.896 ±  1.921  ops/us
CodePointCountBenchmark.elasticsearchSwar                  20    ascii  thrpt    5  138.948 ±  2.241  ops/us
CodePointCountBenchmark.elasticsearchSwar                  20  unicode  thrpt    5  110.787 ±  0.336  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  10    ascii  thrpt    5  166.654 ±  1.524  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  10  unicode  thrpt    5  140.596 ±  2.295  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  12    ascii  thrpt    5  152.851 ±  5.299  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  12  unicode  thrpt    5  116.127 ±  3.584  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  14    ascii  thrpt    5  134.139 ±  1.527  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  14  unicode  thrpt    5   49.169 ±  4.266  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  16    ascii  thrpt    5  126.852 ±  1.798  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  16  unicode  thrpt    5   36.618 ±  0.753  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  18    ascii  thrpt    5  115.858 ±  9.273  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  18  unicode  thrpt    5   29.364 ±  1.376  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  20    ascii  thrpt    5  110.248 ±  2.926  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                  20  unicode  thrpt    5   27.028 ±  1.375  ops/us
Benchmark                                        (avgNumCodePoints)   (type)   Mode  Cnt   Score   Error   Units
CodePointCountBenchmark.elasticsearchPanamaSimd                  50    ascii  thrpt    5  83.212 ± 2.959  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  50  unicode  thrpt    5  80.679 ± 1.773  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  51    ascii  thrpt    5  85.774 ± 0.736  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  51  unicode  thrpt    5  80.545 ± 1.047  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  52    ascii  thrpt    5  85.965 ± 1.511  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  52  unicode  thrpt    5  79.711 ± 0.240  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  53    ascii  thrpt    5  84.434 ± 0.552  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  53  unicode  thrpt    5  80.509 ± 0.270  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  54    ascii  thrpt    5  84.115 ± 0.181  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  54  unicode  thrpt    5  79.113 ± 0.507  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  55    ascii  thrpt    5  86.246 ± 0.310  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  55  unicode  thrpt    5  80.114 ± 1.768  ops/us
CodePointCountBenchmark.elasticsearchSwar                        50    ascii  thrpt    5  97.923 ± 1.358  ops/us
CodePointCountBenchmark.elasticsearchSwar                        50  unicode  thrpt    5  68.959 ± 0.448  ops/us
CodePointCountBenchmark.elasticsearchSwar                        51    ascii  thrpt    5  98.680 ± 0.610  ops/us
CodePointCountBenchmark.elasticsearchSwar                        51  unicode  thrpt    5  70.186 ± 0.563  ops/us
CodePointCountBenchmark.elasticsearchSwar                        52    ascii  thrpt    5  97.497 ± 1.715  ops/us
CodePointCountBenchmark.elasticsearchSwar                        52  unicode  thrpt    5  68.939 ± 1.377  ops/us
CodePointCountBenchmark.elasticsearchSwar                        53    ascii  thrpt    5  96.742 ± 1.123  ops/us
CodePointCountBenchmark.elasticsearchSwar                        53  unicode  thrpt    5  67.667 ± 0.504  ops/us
CodePointCountBenchmark.elasticsearchSwar                        54    ascii  thrpt    5  94.754 ± 0.874  ops/us
CodePointCountBenchmark.elasticsearchSwar                        54  unicode  thrpt    5  66.587 ± 1.517  ops/us
CodePointCountBenchmark.elasticsearchSwar                        55    ascii  thrpt    5  94.518 ± 1.275  ops/us
CodePointCountBenchmark.elasticsearchSwar                        55  unicode  thrpt    5  67.504 ± 0.475  ops/us
@parkertimmins
Copy link
Contributor Author

And the final results, where SIMD falls back to SWAR if below 54, and scalar if below 12:

Benchmark                                        (avgNumCodePoints)   (type)   Mode  Cnt    Score    Error   Units
CodePointCountBenchmark.elasticsearchPanamaSimd                   1    ascii  thrpt    5  498.440 ±  4.280  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                   1  unicode  thrpt    5  583.670 ± 43.012  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                   5    ascii  thrpt    5  218.977 ±  1.408  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                   5  unicode  thrpt    5  198.659 ±  1.123  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  10    ascii  thrpt    5  140.810 ±  1.197  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  10  unicode  thrpt    5  131.686 ±  0.820  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  20    ascii  thrpt    5  111.924 ±  1.448  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  20  unicode  thrpt    5   91.100 ±  0.375  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  50    ascii  thrpt    5   90.012 ±  0.594  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                  50  unicode  thrpt    5   84.470 ±  0.184  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                 100    ascii  thrpt    5   79.084 ±  0.076  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                 100  unicode  thrpt    5   77.255 ±  1.004  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                1000    ascii  thrpt    5   20.280 ±  0.108  ops/us
CodePointCountBenchmark.elasticsearchPanamaSimd                1000  unicode  thrpt    5   17.082 ±  0.246  ops/us
CodePointCountBenchmark.elasticsearchSwar                         1    ascii  thrpt    5  490.761 ± 42.442  ops/us
CodePointCountBenchmark.elasticsearchSwar                         1  unicode  thrpt    5  605.348 ±  9.541  ops/us
CodePointCountBenchmark.elasticsearchSwar                         5    ascii  thrpt    5  218.971 ±  2.099  ops/us
CodePointCountBenchmark.elasticsearchSwar                         5  unicode  thrpt    5  188.434 ± 48.058  ops/us
CodePointCountBenchmark.elasticsearchSwar                        10    ascii  thrpt    5  141.422 ±  7.623  ops/us
CodePointCountBenchmark.elasticsearchSwar                        10  unicode  thrpt    5  127.145 ±  0.230  ops/us
CodePointCountBenchmark.elasticsearchSwar                        20    ascii  thrpt    5  119.830 ±  0.452  ops/us
CodePointCountBenchmark.elasticsearchSwar                        20  unicode  thrpt    5   98.323 ±  2.157  ops/us
CodePointCountBenchmark.elasticsearchSwar                        50    ascii  thrpt    5  104.755 ±  5.404  ops/us
CodePointCountBenchmark.elasticsearchSwar                        50  unicode  thrpt    5   77.345 ±  7.276  ops/us
CodePointCountBenchmark.elasticsearchSwar                       100    ascii  thrpt    5   86.380 ±  0.202  ops/us
CodePointCountBenchmark.elasticsearchSwar                       100  unicode  thrpt    5   40.965 ±  0.164  ops/us
CodePointCountBenchmark.elasticsearchSwar                      1000    ascii  thrpt    5   15.101 ±  0.271  ops/us
CodePointCountBenchmark.elasticsearchSwar                      1000  unicode  thrpt    5    2.865 ±  0.118  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         1    ascii  thrpt    5  521.750 ±  4.053  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         1  unicode  thrpt    5  620.325 ± 62.394  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         5    ascii  thrpt    5  228.389 ± 11.363  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                         5  unicode  thrpt    5  226.859 ± 10.963  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        10    ascii  thrpt    5  166.704 ±  1.751  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        10  unicode  thrpt    5  141.457 ±  1.474  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        20    ascii  thrpt    5  110.322 ±  3.208  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        20  unicode  thrpt    5   26.322 ±  3.870  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        50    ascii  thrpt    5   64.991 ±  4.327  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                        50  unicode  thrpt    5    9.541 ±  0.197  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                       100    ascii  thrpt    5   30.472 ±  0.446  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                       100  unicode  thrpt    5    4.748 ±  0.072  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                      1000    ascii  thrpt    5    5.184 ±  0.216  ops/us
CodePointCountBenchmark.luceneUnicodeUtil                      1000  unicode  thrpt    5    0.462 ±  0.006  ops/us
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@parkertimmins parkertimmins merged commit 20ea68a into elastic:main Jan 21, 2026
35 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Jan 21, 2026
…-tests

* upstream/main: (104 commits)
  Partition time-series source (elastic#140475)
  Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackSubqueryIT testManyRandomKeywordFieldsInSubqueryIntermediateResultsWithSortManyFields elastic#141083
  Reindex relocation: skip nodes marked for shutdown (elastic#141044)
  Make fails on fixture caching not fail image building (elastic#140959)
  Add multi-project tests for get and list reindex (elastic#140980)
  Painless docs overhaul (reference) (elastic#137211)
  Panama vector implementation of codePointCount (elastic#140693)
  Enable PromQL in release builds (elastic#140808)
  Update rest-api-spec for Jina embedding task (elastic#140696)
  [CI] ShardSearchPhaseAPMMetricsTests testUniformCanMatchMetricAttributesWhenPlentyOfDocumentsInIndex failed (elastic#140848)
  Combine hash computation with bloom filter writes/reads (elastic#140969)
  Refactor posting iterators to provide more information (elastic#141058)
  Wait for cluster to recover to yellow before checking index health (elastic#141057) (elastic#141065)
  Fix repo analysis read count assertions (elastic#140994)
  Fixed a bug in logsdb rolling upgrade sereverless tests involving par… (elastic#141022)
  Fix readiness edge case on startup (elastic#140791)
  PromQL: fix quantile function (elastic#141033)
  ignore `mmr` command for check (in development) (elastic#140981)
  Use Double.compare to compare doubles in tdigest.Sort (elastic#141049)
  Migrate third party module tests using legacy test clusters framework (elastic#140991)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment