Skip to content

[SIMD][ARM] Optimized native bulk dot product scoring for Int7#138552

Merged
ldematte merged 21 commits intoelastic:mainfrom
ldematte:simd/arm-optimized-bulk
Dec 3, 2025
Merged

[SIMD][ARM] Optimized native bulk dot product scoring for Int7#138552
ldematte merged 21 commits intoelastic:mainfrom
ldematte:simd/arm-optimized-bulk

Conversation

@ldematte
Copy link
Contributor

@ldematte ldematte commented Nov 25, 2025

In #138204 @benwtrent implemented bulk scoring for int7 centroid scoring.
Recent Lucene versions also introduced the possibility to provide specialized code for bulk scoring by overriding RandomVectorScorer#bulkScore.

We generalized Ben's native bulk int7 scoring implementation to work with the Lucene case too.

Furthermore, we explored the possibility to optimize it a little bit. For the int7 centroid scoring case, vectors to score are accessed sequentially in memory, but in the Lucene/HNSW case, the access is random. We used benchmarks introduced in #138384 to asses how much the random memory access hurts performances, and the answer is: quite a lot:

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequential        1024        128000  thrpt    5  1438,851 ± 16,475  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024        128000  thrpt    5   538,936 ± 11,652  ops/s

Random access is almost 3x slower than sequential access.
Simply introducing a bulk API speeds up things a bit, something around 10/15%:

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024        128000  thrpt    5  589,787 ± 10,727  ops/s

However, we want to improve on that and see if we could prefetch relevant data to make memory access faster with a random access pattern. A good way to do it is to unroll code so that we issue multiple memory access instructions at the same time. This way we can leverage the internal instruction scheduler of the CPU: the processor can "see what happens next" and plan accordingly (thanks @ChrisHegarty for the ARM code!)

Using this stratagem, we see better throughput, with a ~60% improvement (or 1.6x):

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024        130000  thrpt    5     837,141 ±   13,483  ops/s

The unrolled/optimized code for ARM has a minor benefit in the sequential access patter too; bulk scoring for int7 centroid scoring from #138204 goes from

Benchmark                                       (dims)   Mode  Cnt   Score   Error   Units
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  60,206 ± 0,229  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  26,151 ± 0,161  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  23,916 ± 0,220  ops/ms

to

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  67,770 ± 0,427  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  33,421 ± 0,133  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  29,235 ± 0,246  ops/ms

a 10 to 20% improvement.

@ldematte ldematte changed the title [WIP] "Bulk with offset" native vector functions Nov 28, 2025
@ldematte
Copy link
Contributor Author

Note: this is still WIP as I want to at least test it on x64 before producing the new native library version (and bump versions), but it's ready for review.

An obvious quick next step is to do the same optimization for x64; then, we need to evaluate if we want to expand it to all scoring functions (sqrt/cos) and to float32 as well.

Similarities.dotProduct7uBulkWithOffsets(vectors, firstVector, dims, vectorPitch, ordinals, numNodes, scores);

// Java-side adjustment
var aOffset = Float.intBitsToFloat(vectors.asSlice(firstByteOffset + vectorLength, Float.BYTES).get(ValueLayout.JAVA_INT, 0));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note: I have tried to see if adjusting scores on the native side makes a difference. It's not an intensive operation, but the C++ compiler is able to vectorize it, and accessing correction data just after the vector is read improves data locality.
But it does not make a significant difference, at least not on ARM:

Adjust on Java-side

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024           128  thrpt    5  382527,380 ± 1418,882  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024          1500  thrpt    5   30728,328 ±  563,632  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024        130000  thrpt    5     832,423 ±   25,523  ops/s

Adjust on native side

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024           128  thrpt    5  399578,109 ± 7202,504  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024          1500  thrpt    5   31304,660 ±  411,803  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024        130000  thrpt    5     837,141 ±   13,483  ops/s

It's single-digit % improvements, so I don't think it's worth it.
I decided to keep it simpler, but I will measure again when I optimize code for x64 and if the difference is substantial I'll re-evaluate this.

@ldematte ldematte marked this pull request as ready for review November 28, 2025 10:11
@ldematte ldematte requested a review from a team as a code owner November 28, 2025 10:11
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ldematte
Copy link
Contributor Author

ldematte commented Nov 28, 2025

A final interesting tidbit to investigate: on macos, if the dataset size exceeds the RAM size, bulk scoring becomes significantly slower:
Edit: there is an explanation, and we fixed this. See comments below.

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024      30000000  thrpt    5   347,222 ±  6,881  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk        1024      30000000  thrpt    5   177,188 ±  4,264  ops/s

(that's 30GB of vectors on a 32GB RAM Mac).
I want to test this on Linux; it seems that creating separate slices on the memory mapped file leads to better paging. Almost if "manually" defining which portion of the file needs to be accessed leads to less thrashing/better handling of pages and caches.
IIRC this was (is?) the case with Windows too, were there are actually separate APIs to create a mmap and to map a view from the mmapped region. On Linux, it should be different. On Mac, I have no idea of how it works :)
I'll test and see.

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Nov 28, 2025

A final interesting tidbit to investigate: on macos, if the dataset size exceeds the RAM size, bulk scoring becomes significantly slower:

...
(that's 30GB of vectors on a 32GB RAM Mac). I want to test this on Linux; it seems that creating separate slices on the memory mapped file leads to better paging. Almost if "manually" defining which portion of the file needs to be accessed leads to less thrashing/better handling of pages and caches. IIRC this was (is?) the case with Windows too, were there are actually separate APIs to create a mmap and to map a view from the mmapped region. On Linux, it should be different. On Mac, I have no idea of how it works :) I'll test and see.

The default max mmap chunk size in Lucene is 16GB. So I do wonder what's going on here. Also, I don't remember the specifics, so I will need to check, but I believe that the code will slice across an index input backed by several memory segments. Only scoring natively when the vector falls within a single segment, otherwise falling back to a copy on-heap and slower scorer. But maybe the benchmark avoids all this.

@ldematte
Copy link
Contributor Author

The default max mmap chunk size in Lucene is 16GB. So I do wonder what's going on here. Also, I don't remember the specifics, so I will need to check, but I believe that the code will slice across an index input backed by several memory segments. Only scoring natively when the vector falls within a single segment, otherwise falling back to a copy on-heap and slower scorer. But maybe the benchmark avoids all this.

That's interesting! The benchmark uses Lucene to open the vector data file, so it might very well fall back here.
Let me check

@ldematte
Copy link
Contributor Author

ldematte commented Nov 28, 2025

It was Lucene default max mmap chunk size of 16GB indeed.
The problem is that we want to pass the whole dataset to the native functions, and those will compute the offsets by themselves. So we try to grab a MemorySegment for the whole segment, and this will fail.
I had a fallback in that case, but I did not consider this might happen for different reasons.
I've changed the fallback function to account for this case, and just use the simple non-bulk path, instead of directly go to the fallbackScorer (which is slower). This makes things better, and now for datasets > 16GB it's not slower (same speed as non-bulk).

We might want to refine this, but given our use cases and constraints (segment size < 5GB), I think we are OK with this for now.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome change! 🚀

@ldematte
Copy link
Contributor Author

ldematte commented Nov 28, 2025

Some preliminary x64 (AVX2) numbers:

Benchmark                                                              (dims)  (numVectors)   Mode  Cnt    Score    Error  Units
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024        100000  thrpt    5  339.004 ± 14.982  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk        1024        100000  thrpt    5  385.208 ± 27.041  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequential        1024        100000  thrpt    5  544.691 ± 15.329  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequentialBulk    1024        100000  thrpt    5  933.197 ± 71.425  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024       1000000  thrpt    5  288.494 ±  4.957  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk        1024       1000000  thrpt    5  355.963 ± 10.555  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequential        1024       1000000  thrpt    5  522.417 ± 24.982  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequentialBulk    1024       1000000  thrpt    5  902.250 ± 20.645  ops/s

You can see the performance boost for the sequential access pattern is significant; for the random access pattern less so.
It varies with runs, but it's a bit expected: we do not do anything towards prefetching.
That's why we need to follow up with an optimized x64 implementation. I'll do that next.

@ldematte ldematte added the test-arm Pull Requests that should be tested against arm agents label Nov 28, 2025
@ldematte ldematte requested a review from thecoop December 2, 2025 08:11
@ldematte
Copy link
Contributor Author

ldematte commented Dec 2, 2025

@thecoop this is what I am working on (plus the x64 optimized variant). Tagging you so you can look around, and we can discuss about how to expand this work (happy to chat about it any time this week or the next one).

@ldematte
Copy link
Contributor Author

ldematte commented Dec 3, 2025

Before merging, I wanted to execute some benchmarks on Graviton CPUs too.

TL;DR:
with a sequential access pattern, bulk does not bring a significant performance benefit.
with a random access patter, bulk brings a good boost of ~1.7x on Graviton 2 (c6g.2xlarge) and ~2x on Graviton 4 (c8g.2xlarge) with respect to the non-bulk version.

I experimented with prefetching too, but this does not seem to bring good results on Graviton(*) - it actually has a negative effect on Graviton 2 for the sequential case. This needs more experimentation and explanation, but for the moment this PR is good to go as it is.

(*) prefetching does a fantastic job on Apple silicon though, doubling the performances again with the random access pattern, to the point that it's almost as fast as the sequential access pattern. It is definitely worth a second look.

@ldematte ldematte enabled auto-merge (squash) December 3, 2025 17:23
@ldematte ldematte merged commit 0dffc46 into elastic:main Dec 3, 2025
36 of 40 checks passed
@ldematte ldematte deleted the simd/arm-optimized-bulk branch December 3, 2025 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents v9.3.0

3 participants