Speed up bit compared with floats or bytes script operations by benwtrent · Pull Request #117199 · elastic/elasticsearch

benwtrent · 2024-11-20T21:07:08Z

Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s.

This led to a marginal speed improvement on ARM.

I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being.

Benchmark                                              (dims)   Mode  Cnt  Score   Error   Units
IpBitVectorScorerBenchmark.dotProductByteIfStatement      768  thrpt    5  2.952 ± 0.026  ops/us
IpBitVectorScorerBenchmark.dotProductByteUnwrap           768  thrpt    5  4.017 ± 0.068  ops/us
IpBitVectorScorerBenchmark.dotProductFloatIfStatement     768  thrpt    5  2.987 ± 0.124  ops/us
IpBitVectorScorerBenchmark.dotProductFloatUnwrap          768  thrpt    5  4.726 ± 0.136  ops/us

Benchmark I used.
https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9

elasticsearchmachine · 2024-11-20T21:07:33Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2024-11-20T21:07:33Z

Hi @benwtrent, I've created a changelog YAML for you.

…yte-ip-speedup

pmpailis

LGTM

john-wagster

LGTM

pmpailis · 2024-12-02T15:48:21Z

...c/main/java/org/elasticsearch/simdvec/internal/vectorization/DefaultESVectorUtilSupport.java

+        // now combine the two vectors, summing the byte dimensions where the bit in d is `1`
+        for (int i = 0; i < d.length; i++) {
+            byte mask = d[i];
+            acc0 += fma(q[i * Byte.SIZE + 0], (mask >> 7) & 1, acc0);


Overlooked this one initially; but shouldn't the additive component to fma be either 0 or just reset the value of acc0 (i.e. without +=) ? I think that we're making the addition twice in for the accumulators for last bits in lines 80-83.

@pmpailis you are correct ;) I did a bad copy paste here. Tests have found it.

…#117199) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9

#117841) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9

svilen-mihaylov-db · 2024-12-02T19:07:08Z

...c/main/java/org/elasticsearch/simdvec/internal/vectorization/DefaultESVectorUtilSupport.java

+        int acc2 = 0;
+        int acc3 = 0;
+        // now combine the two vectors, summing the byte dimensions where the bit in d is `1`
+        for (int i = 0; i < d.length; i++) {


Just a drive-by question here (free to disregard): is this intended to allow vectorization?

@svilen-mihaylov-db it allows some vectorization via the unrolling, but it definitely isn't as fast as a custom vectorized version that we could provide with the Panama API. This solution isn't as fast as it could be, for sure.

Mainly, I discovered its much faster than the previous if block and so its a step in the right direction :)

Thanks for explaining!

Speed up bit float/byte operations slightly

782c03b

benwtrent added >enhancement auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v9.0.0 v8.17.0 labels Nov 20, 2024

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 20, 2024

Update docs/changelog/117199.yaml

d98754d

elasticsearchmachine added v8.18.0 and removed v8.17.0 labels Nov 20, 2024

benwtrent and others added 3 commits November 21, 2024 13:23

Merge branch 'main' into feature/bit-float-byte-ip-speedup

4ed7c94

Merge remote-tracking branch 'upstream/main' into feature/bit-float-b…

73d01b7

…yte-ip-speedup

fix impl

c39d363

pmpailis approved these changes Dec 2, 2024

View reviewed changes

john-wagster approved these changes Dec 2, 2024

View reviewed changes

pmpailis reviewed Dec 2, 2024

View reviewed changes

fixing fma

0cea632

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 2, 2024

elasticsearchmachine merged commit e10fc3c into elastic:main Dec 2, 2024

benwtrent deleted the feature/bit-float-byte-ip-speedup branch December 2, 2024 17:19

benwtrent mentioned this pull request Dec 2, 2024

[8.x] Speed up bit compared with floats or bytes script operations (#117199) #117841

Merged

svilen-mihaylov-db reviewed Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up bit compared with floats or bytes script operations#117199

Speed up bit compared with floats or bytes script operations#117199
elasticsearchmachine merged 6 commits intoelastic:mainfrom
benwtrent:feature/bit-float-byte-ip-speedup

benwtrent commented Nov 20, 2024

elasticsearchmachine commented Nov 20, 2024

elasticsearchmachine commented Nov 20, 2024

pmpailis left a comment

john-wagster left a comment

pmpailis Dec 2, 2024

benwtrent Dec 2, 2024

svilen-mihaylov-db Dec 2, 2024

benwtrent Dec 2, 2024

svilen-mihaylov-db Dec 2, 2024

Labels

5 participants

Conversation

benwtrent commented Nov 20, 2024

elasticsearchmachine commented Nov 20, 2024

elasticsearchmachine commented Nov 20, 2024

pmpailis left a comment

Choose a reason for hiding this comment

john-wagster left a comment

Choose a reason for hiding this comment

pmpailis Dec 2, 2024

Choose a reason for hiding this comment

benwtrent Dec 2, 2024

Choose a reason for hiding this comment

svilen-mihaylov-db Dec 2, 2024

Choose a reason for hiding this comment

benwtrent Dec 2, 2024

Choose a reason for hiding this comment

svilen-mihaylov-db Dec 2, 2024

Choose a reason for hiding this comment

Labels

5 participants