[SIMD][ARM] Optimized native bulk dot product scoring for Int7 by ldematte · Pull Request #138552 · elastic/elasticsearch

ldematte · 2025-11-25T09:00:09Z

In #138204 @benwtrent implemented bulk scoring for int7 centroid scoring.
Recent Lucene versions also introduced the possibility to provide specialized code for bulk scoring by overriding RandomVectorScorer#bulkScore.

We generalized Ben's native bulk int7 scoring implementation to work with the Lucene case too.

Furthermore, we explored the possibility to optimize it a little bit. For the int7 centroid scoring case, vectors to score are accessed sequentially in memory, but in the Lucene/HNSW case, the access is random. We used benchmarks introduced in #138384 to asses how much the random memory access hurts performances, and the answer is: quite a lot:

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequential        1024        128000  thrpt    5  1438,851 ± 16,475  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024        128000  thrpt    5   538,936 ± 11,652  ops/s

Random access is almost 3x slower than sequential access.
Simply introducing a bulk API speeds up things a bit, something around 10/15%:

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024        128000  thrpt    5  589,787 ± 10,727  ops/s

However, we want to improve on that and see if we could prefetch relevant data to make memory access faster with a random access pattern. A good way to do it is to unroll code so that we issue multiple memory access instructions at the same time. This way we can leverage the internal instruction scheduler of the CPU: the processor can "see what happens next" and plan accordingly (thanks @ChrisHegarty for the ARM code!)

Using this stratagem, we see better throughput, with a ~60% improvement (or 1.6x):

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk    1024        130000  thrpt    5     837,141 ±   13,483  ops/s

The unrolled/optimized code for ARM has a minor benefit in the sequential access patter too; bulk scoring for int7 centroid scoring from #138204 goes from

Benchmark                                       (dims)   Mode  Cnt   Score   Error   Units
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  60,206 ± 0,229  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  26,151 ± 0,161  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  23,916 ± 0,220  ops/ms

to

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  67,770 ± 0,427  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  33,421 ± 0,133  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  29,235 ± 0,246  ops/ms

a 10 to 20% improvement.

…bulk

ldematte · 2025-11-28T10:03:31Z

Note: this is still WIP as I want to at least test it on x64 before producing the new native library version (and bump versions), but it's ready for review.

An obvious quick next step is to do the same optimization for x64; then, we need to evaluate if we want to expand it to all scoring functions (sqrt/cos) and to float32 as well.

ldematte · 2025-11-28T10:09:57Z

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/Int7SQVectorScorerSupplier.java

+            Similarities.dotProduct7uBulkWithOffsets(vectors, firstVector, dims, vectorPitch, ordinals, numNodes, scores);
+
+            // Java-side adjustment
+            var aOffset = Float.intBitsToFloat(vectors.asSlice(firstByteOffset + vectorLength, Float.BYTES).get(ValueLayout.JAVA_INT, 0));


A note: I have tried to see if adjusting scores on the native side makes a difference. It's not an intensive operation, but the C++ compiler is able to vectorize it, and accessing correction data just after the vector is read improves data locality.
But it does not make a significant difference, at least not on ARM:

Adjust on Java-side

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 128 thrpt 5 382527,380 ± 1418,882 ops/s VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 1500 thrpt 5 30728,328 ± 563,632 ops/s VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 130000 thrpt 5 832,423 ± 25,523 ops/s

Adjust on native side

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 128 thrpt 5 399578,109 ± 7202,504 ops/s VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 1500 thrpt 5 31304,660 ± 411,803 ops/s VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk 1024 130000 thrpt 5 837,141 ± 13,483 ops/s

It's single-digit % improvements, so I don't think it's worth it.
I decided to keep it simpler, but I will measure again when I optimize code for x64 and if the difference is substantial I'll re-evaluate this.

elasticsearchmachine · 2025-11-28T10:11:48Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2025-11-28T10:11:48Z

Hi @ldematte, I've created a changelog YAML for you.

ldematte · 2025-11-28T10:24:24Z

~~A final interesting tidbit to investigate: on macos, if the dataset size exceeds the RAM size, bulk scoring becomes significantly slower:~~
Edit: there is an explanation, and we fixed this. See comments below.

VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024      30000000  thrpt    5   347,222 ±  6,881  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk        1024      30000000  thrpt    5   177,188 ±  4,264  ops/s

(that's 30GB of vectors on a 32GB RAM Mac).
I want to test this on Linux; it seems that creating separate slices on the memory mapped file leads to better paging. Almost if "manually" defining which portion of the file needs to be accessed leads to less thrashing/better handling of pages and caches.
IIRC this was (is?) the case with Windows too, were there are actually separate APIs to create a mmap and to map a view from the mmapped region. On Linux, it should be different. On Mac, I have no idea of how it works :)
I'll test and see.

ChrisHegarty · 2025-11-28T11:04:08Z

A final interesting tidbit to investigate: on macos, if the dataset size exceeds the RAM size, bulk scoring becomes significantly slower:

...
(that's 30GB of vectors on a 32GB RAM Mac). I want to test this on Linux; it seems that creating separate slices on the memory mapped file leads to better paging. Almost if "manually" defining which portion of the file needs to be accessed leads to less thrashing/better handling of pages and caches. IIRC this was (is?) the case with Windows too, were there are actually separate APIs to create a mmap and to map a view from the mmapped region. On Linux, it should be different. On Mac, I have no idea of how it works :) I'll test and see.

The default max mmap chunk size in Lucene is 16GB. So I do wonder what's going on here. Also, I don't remember the specifics, so I will need to check, but I believe that the code will slice across an index input backed by several memory segments. Only scoring natively when the vector falls within a single segment, otherwise falling back to a copy on-heap and slower scorer. But maybe the benchmark avoids all this.

ldematte · 2025-11-28T11:25:09Z

The default max mmap chunk size in Lucene is 16GB. So I do wonder what's going on here. Also, I don't remember the specifics, so I will need to check, but I believe that the code will slice across an index input backed by several memory segments. Only scoring natively when the vector falls within a single segment, otherwise falling back to a copy on-heap and slower scorer. But maybe the benchmark avoids all this.

That's interesting! The benchmark uses Lucene to open the vector data file, so it might very well fall back here.
Let me check

…ver it.

ldematte · 2025-11-28T14:47:15Z

It was Lucene default max mmap chunk size of 16GB indeed.
The problem is that we want to pass the whole dataset to the native functions, and those will compute the offsets by themselves. So we try to grab a MemorySegment for the whole segment, and this will fail.
I had a fallback in that case, but I did not consider this might happen for different reasons.
I've changed the fallback function to account for this case, and just use the simple non-bulk path, instead of directly go to the fallbackScorer (which is slower). This makes things better, and now for datasets > 16GB it's not slower (same speed as non-bulk).

We might want to refine this, but given our use cases and constraints (segment size < 5GB), I think we are OK with this for now.

ChrisHegarty

Awesome change! 🚀

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

ldematte · 2025-11-28T15:41:22Z

Some preliminary x64 (AVX2) numbers:

Benchmark                                                              (dims)  (numVectors)   Mode  Cnt    Score    Error  Units
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024        100000  thrpt    5  339.004 ± 14.982  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk        1024        100000  thrpt    5  385.208 ± 27.041  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequential        1024        100000  thrpt    5  544.691 ± 15.329  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequentialBulk    1024        100000  thrpt    5  933.197 ± 71.425  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandom            1024       1000000  thrpt    5  288.494 ±  4.957  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleRandomBulk        1024       1000000  thrpt    5  355.963 ± 10.555  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequential        1024       1000000  thrpt    5  522.417 ± 24.982  ops/s
VectorScorerInt7uBulkBenchmark.dotProductNativeMultipleSequentialBulk    1024       1000000  thrpt    5  902.250 ± 20.645  ops/s

You can see the performance boost for the sequential access pattern is significant; for the random access pattern less so.
It varies with runs, but it's a bit expected: we do not do anything towards prefetching.
That's why we need to follow up with an optimized x64 implementation. I'll do that next.

…search into simd/arm-optimized-bulk

…bulk

…search into simd/arm-optimized-bulk

ldematte · 2025-12-02T08:13:39Z

@thecoop this is what I am working on (plus the x64 optimized variant). Tagging you so you can look around, and we can discuss about how to expand this work (happy to chat about it any time this week or the next one).

ldematte · 2025-12-03T17:04:57Z

Before merging, I wanted to execute some benchmarks on Graviton CPUs too.

TL;DR:
with a sequential access pattern, bulk does not bring a significant performance benefit.
with a random access patter, bulk brings a good boost of ~1.7x on Graviton 2 (c6g.2xlarge) and ~2x on Graviton 4 (c8g.2xlarge) with respect to the non-bulk version.

I experimented with prefetching too, but this does not seem to bring good results on Graviton(*) - it actually has a negative effect on Graviton 2 for the sequential case. This needs more experimentation and explanation, but for the moment this PR is good to go as it is.

(*) prefetching does a fantastic job on Apple silicon though, doubling the performances again with the random access pattern, to the point that it's almost as fast as the sequential access pattern. It is definitely worth a second look.

Wire "bulk with offset" functions, c++ implementation, ARM optimization

5bebc1f

ldematte added the WIP label Nov 25, 2025

elasticsearchmachine added the v9.3.0 label Nov 25, 2025

ldematte mentioned this pull request Nov 25, 2025

[SIMD] Move native/vec code to C++ #138525

Merged

ldematte added 7 commits November 26, 2025 11:54

Merge remote-tracking branch 'upstream/main' into simd/arm-optimized-…

b8cbd5a

…bulk

Fix/reconcile/merge cpp files after merge

de51eea

Another fix; add benchmarks to cover all paths

639bbf0

WIP: scoring on CPP side - fix signature to have pitch

3369521

Fixes to function signatures and CPP code

451b924

Add implementation for Java22 Int7SQVectorScorer

f232613

Remove score correction from native signature

39d7f79

ldematte changed the title ~~[WIP] "Bulk with offset" native vector functions~~ Nov 28, 2025

ldematte commented Nov 28, 2025

View reviewed changes

ldematte requested review from ChrisHegarty and benwtrent November 28, 2025 10:10

ldematte added >enhancement :Search Relevance/Vectors Vector search and removed WIP labels Nov 28, 2025

ldematte marked this pull request as ready for review November 28, 2025 10:11

ldematte requested a review from a team as a code owner November 28, 2025 10:11

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 28, 2025

Update docs/changelog/138552.yaml

03ec7f3

ldematte added 2 commits November 28, 2025 15:29

Better fallback from bulk trying to map the whole index + tests to co…

efe8d31

…ver it.

remove bulk fallback (unused)

3aaf780

ChrisHegarty approved these changes Nov 28, 2025

View reviewed changes

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java Show resolved Hide resolved

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp Show resolved Hide resolved

ldematte added 4 commits November 28, 2025 17:01

Add javadoc

3d24711

Fix x64 code

47f4875

Merge branch 'simd/arm-optimized-bulk' of github.com:ldematte/elastic…

ada63b1

…search into simd/arm-optimized-bulk

Bump native libvec version

5458d93

ldematte added the test-arm Pull Requests that should be tested against arm agents label Nov 28, 2025

ldematte and others added 5 commits November 28, 2025 17:04

Merge remote-tracking branch 'upstream/main' into simd/arm-optimized-…

b485c66

…bulk

[CI] Auto commit changes from spotless

5a3ab67

Fix adjustment on non-aligned vectors

7fa26e8

Merge branch 'simd/arm-optimized-bulk' of github.com:ldematte/elastic…

cfedef3

…search into simd/arm-optimized-bulk

Merge branch 'main' into simd/arm-optimized-bulk

9fa7265

ldematte requested a review from thecoop December 2, 2025 08:11

Merge branch 'main' into simd/arm-optimized-bulk

f3d14b3

ldematte enabled auto-merge (squash) December 3, 2025 17:23

ldematte merged commit 0dffc46 into elastic:main Dec 3, 2025
36 of 40 checks passed

ldematte deleted the simd/arm-optimized-bulk branch December 3, 2025 21:01

This was referenced Dec 4, 2025

Implement native int7u HNSW bulk scoring #139060

Closed

[SIMD][x64] Optimized native bulk dot product scoring for Int7 #139069

Merged

thecoop mentioned this pull request Dec 10, 2025

Use the MemorySegment scorer on JDK 21 with appropriate modifications #139306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIMD][ARM] Optimized native bulk dot product scoring for Int7#138552

[SIMD][ARM] Optimized native bulk dot product scoring for Int7#138552
ldematte merged 21 commits intoelastic:mainfrom
ldematte:simd/arm-optimized-bulk

ldematte commented Nov 25, 2025 •

edited

Loading

ldematte commented Nov 28, 2025

ldematte Nov 28, 2025

elasticsearchmachine commented Nov 28, 2025

elasticsearchmachine commented Nov 28, 2025

ldematte commented Nov 28, 2025 •

edited

Loading

ChrisHegarty commented Nov 28, 2025 •

edited

Loading

ldematte commented Nov 28, 2025

ldematte commented Nov 28, 2025 •

edited

Loading

ChrisHegarty left a comment

Uh oh!

Uh oh!

ldematte commented Nov 28, 2025 •

edited

Loading

ldematte commented Dec 2, 2025

ldematte commented Dec 3, 2025

Uh oh!

Labels

3 participants

Conversation

ldematte commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ldematte commented Nov 28, 2025

ldematte Nov 28, 2025

Choose a reason for hiding this comment

elasticsearchmachine commented Nov 28, 2025

elasticsearchmachine commented Nov 28, 2025

ldematte commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ChrisHegarty commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ldematte commented Nov 28, 2025

ldematte commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldematte commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ldematte commented Dec 2, 2025

ldematte commented Dec 3, 2025

Uh oh!

Labels

3 participants

ldematte commented Nov 25, 2025 •

edited

Loading

ldematte commented Nov 28, 2025 •

edited

Loading

ChrisHegarty commented Nov 28, 2025 •

edited

Loading

ldematte commented Nov 28, 2025 •

edited

Loading

ldematte commented Nov 28, 2025 •

edited

Loading