[SIMD][ARM] Optimized native bulk dot product scoring for Int7#138552
[SIMD][ARM] Optimized native bulk dot product scoring for Int7#138552ldematte merged 21 commits intoelastic:mainfrom
Conversation
|
Note: this is still WIP as I want to at least test it on x64 before producing the new native library version (and bump versions), but it's ready for review. An obvious quick next step is to do the same optimization for x64; then, we need to evaluate if we want to expand it to all scoring functions (sqrt/cos) and to float32 as well. |
| Similarities.dotProduct7uBulkWithOffsets(vectors, firstVector, dims, vectorPitch, ordinals, numNodes, scores); | ||
|
|
||
| // Java-side adjustment | ||
| var aOffset = Float.intBitsToFloat(vectors.asSlice(firstByteOffset + vectorLength, Float.BYTES).get(ValueLayout.JAVA_INT, 0)); |
There was a problem hiding this comment.
A note: I have tried to see if adjusting scores on the native side makes a difference. It's not an intensive operation, but the C++ compiler is able to vectorize it, and accessing correction data just after the vector is read improves data locality.
But it does not make a significant difference, at least not on ARM:
Adjust on Java-side
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 128 thrpt 5 382527,380 ± 1418,882 ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 1500 thrpt 5 30728,328 ± 563,632 ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 130000 thrpt 5 832,423 ± 25,523 ops/s
Adjust on native side
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 128 thrpt 5 399578,109 ± 7202,504 ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 1500 thrpt 5 31304,660 ± 411,803 ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 130000 thrpt 5 837,141 ± 13,483 ops/s
It's single-digit % improvements, so I don't think it's worth it.
I decided to keep it simpler, but I will measure again when I optimize code for x64 and if the difference is substantial I'll re-evaluate this.
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Hi @ldematte, I've created a changelog YAML for you. |
|
(that's 30GB of vectors on a 32GB RAM Mac). |
The default max mmap chunk size in Lucene is 16GB. So I do wonder what's going on here. Also, I don't remember the specifics, so I will need to check, but I believe that the code will slice across an index input backed by several memory segments. Only scoring natively when the vector falls within a single segment, otherwise falling back to a copy on-heap and slower scorer. But maybe the benchmark avoids all this. |
That's interesting! The benchmark uses Lucene to open the vector data file, so it might very well fall back here. |
|
It was Lucene default max mmap chunk size of 16GB indeed. We might want to refine this, but given our use cases and constraints (segment size < 5GB), I think we are OK with this for now. |
libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java
Show resolved
Hide resolved
|
Some preliminary x64 (AVX2) numbers: You can see the performance boost for the sequential access pattern is significant; for the random access pattern less so. |
…search into simd/arm-optimized-bulk
…search into simd/arm-optimized-bulk
|
@thecoop this is what I am working on (plus the x64 optimized variant). Tagging you so you can look around, and we can discuss about how to expand this work (happy to chat about it any time this week or the next one). |
|
Before merging, I wanted to execute some benchmarks on Graviton CPUs too. TL;DR: I experimented with prefetching too, but this does not seem to bring good results on Graviton(*) - it actually has a negative effect on Graviton 2 for the sequential case. This needs more experimentation and explanation, but for the moment this PR is good to go as it is. (*) prefetching does a fantastic job on Apple silicon though, doubling the performances again with the random access pattern, to the point that it's almost as fast as the sequential access pattern. It is definitely worth a second look. |
In #138204 @benwtrent implemented bulk scoring for int7 centroid scoring.
Recent Lucene versions also introduced the possibility to provide specialized code for bulk scoring by overriding
RandomVectorScorer#bulkScore.We generalized Ben's native bulk int7 scoring implementation to work with the Lucene case too.
Furthermore, we explored the possibility to optimize it a little bit. For the int7 centroid scoring case, vectors to score are accessed sequentially in memory, but in the Lucene/HNSW case, the access is random. We used benchmarks introduced in #138384 to asses how much the random memory access hurts performances, and the answer is: quite a lot:
Random access is almost 3x slower than sequential access.
Simply introducing a bulk API speeds up things a bit, something around 10/15%:
However, we want to improve on that and see if we could prefetch relevant data to make memory access faster with a random access pattern. A good way to do it is to unroll code so that we issue multiple memory access instructions at the same time. This way we can leverage the internal instruction scheduler of the CPU: the processor can "see what happens next" and plan accordingly (thanks @ChrisHegarty for the ARM code!)
Using this stratagem, we see better throughput, with a ~60% improvement (or 1.6x):
The unrolled/optimized code for ARM has a minor benefit in the sequential access patter too; bulk scoring for int7 centroid scoring from #138204 goes from
to
a 10 to 20% improvement.