Speed up bit compared with floats or bytes script operations#117199
Speed up bit compared with floats or bytes script operations#117199elasticsearchmachine merged 6 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Hi @benwtrent, I've created a changelog YAML for you. |
| // now combine the two vectors, summing the byte dimensions where the bit in d is `1` | ||
| for (int i = 0; i < d.length; i++) { | ||
| byte mask = d[i]; | ||
| acc0 += fma(q[i * Byte.SIZE + 0], (mask >> 7) & 1, acc0); |
There was a problem hiding this comment.
Overlooked this one initially; but shouldn't the additive component to fma be either 0 or just reset the value of acc0 (i.e. without +=) ? I think that we're making the addition twice in for the accumulators for last bits in lines 80-83.
There was a problem hiding this comment.
@pmpailis you are correct ;) I did a bad copy paste here. Tests have found it.
…#117199) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9
#117841) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9
| int acc2 = 0; | ||
| int acc3 = 0; | ||
| // now combine the two vectors, summing the byte dimensions where the bit in d is `1` | ||
| for (int i = 0; i < d.length; i++) { |
There was a problem hiding this comment.
Just a drive-by question here (free to disregard): is this intended to allow vectorization?
There was a problem hiding this comment.
@svilen-mihaylov-db it allows some vectorization via the unrolling, but it definitely isn't as fast as a custom vectorized version that we could provide with the Panama API. This solution isn't as fast as it could be, for sure.
Mainly, I discovered its much faster than the previous if block and so its a step in the right direction :)
Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s.
This led to a marginal speed improvement on ARM.
I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being.
Benchmark I used.
https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9