[SIMD][x64] Optimized native bulk dot product scoring for Int7 by ldematte · Pull Request #139069 · elastic/elasticsearch

ldematte · 2025-12-04T15:49:21Z

This PR introduces an optimized native implementation (both for HNSW and for the "sequential" DIskBBQ usage) of bulk dot product scoring.

The optimization centers around optimizing data access, using a combination of unrolling and prefetching.
The whole effort was benchmark-driven, on multiple processors and generations (AMD EPYC 3rd gen and Intel Xeon 4th gen with AVX2, AMD EPYC 5th gen and Intel Xeon 6th gen for AVX-512).

Benchmarks revealed that the sweet spot is to prefetch 2 vectors "ahead" for AVX2 processors, and 4 vectors "ahead" for newer processors (those supporting AVX-512).
Compared to the non-optimized version, this change shows a further improvement between 20% and 50%, or up to 3x over the non-bulk case on some processors.

elasticsearchmachine · 2025-12-04T15:50:08Z

Hi @ldematte, I've created a changelog YAML for you.

ldematte · 2025-12-04T21:15:08Z

Some benchmars.
Base is the starting point from #138552:

Benchmark                                   (dims)  (numVectors)   Mode  Cnt     Score    Error  Units
Intel Xeon 3 - bulk base (random)            1024        100000  thrpt    5   385.208 ±  27.041  ops/s
Intel Xeon 3 - bulk base (random)            1024       1000000  thrpt    5   355.963 ±  10.555  ops/s
Intel Xeon 3 - bulk base (sequential)        1024       1000000  thrpt    5   902.250 ±  20.645  ops/s
Intel Xeon 3 - optimized (random)            1024        100000  thrpt    5   474.316 ±  23.093  ops/s
Intel Xeon 3 - optimized (random)            1024       1000000  thrpt    5   428.613 ±  24.635  ops/s
Intel Xeon 3 - optimized (sequential)        1024       1000000  thrpt    5  1056.787 ±   4.269  ops/s

AMD EPYC 3 (c6a) - bulk base (random)        1024       1000000  thrpt    5   485.967 ±   5.357  ops/s
AMD EPYC 3 (c6a) - optimized (random)        1024       1000000  thrpt    5   815.387 ±   9.895  ops/s
AMD EPYC 3 (c6a) - non-bulk  (sequential)    1024       1000000  thrpt    5  1488.855 ±   3.026  ops/s
AMD EPYC 3 (c6a) - optimized (sequential)    1024       1000000  thrpt    5  2158.516 ±   1.802  ops/s

AMD EPYC 5 (c8a) - non-bulk  (random)        1024       1000000  thrpt    5   521.555 ±  22.522  ops/s
AMD EPYC 5 (c8a) - bulk base (random)        1024       1000000  thrpt    5   983.750 ±   2.203  ops/s
AMD EPYC 5 (c8a) - optimized (random)        1024       1000000  thrpt    5  1510.839 ±  53.862  ops/s
AMD EPYC 5 (c8a) - optimized (sequential)    1024       1000000  thrpt    5  3600.388 ± 134.888  ops/s

elasticsearchmachine · 2025-12-04T21:19:49Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ChrisHegarty

This is a super change 🚀 I left some minor comments, but nothing too significant. LGTM.

ChrisHegarty · 2025-12-05T08:49:18Z

libs/simdvec/native/src/vec/headers/vec_common.h

+#include <assert.h>
+
+template <uintptr_t align>
+static inline uintptr_t align_downwards(const void* ptr) {


classic alignment floor operation. Nice.

ChrisHegarty · 2025-12-05T08:59:51Z

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

+    const int8_t* a0 = safe_mapper_offset<0, mapper>(a, pitch, offsets, count);
+    const int8_t* a1 = safe_mapper_offset<1, mapper>(a, pitch, offsets, count);
+
+    // Process 4 vectors at a time


This comment is technically correct, but it caused me to pause for a moment. We compute the dot for the next 2 vectors, and prefect the subsequent two. All of which is good, just could be worth a small clarification to the comment.

Oh no, that was left over -- as you can imagine from the benchmarks, I tried 4, 2, 1 :)
The comment is from the first iteration and did not get updated. Will fix it.

ChrisHegarty · 2025-12-05T09:00:24Z

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp

+    const int8_t* a2 = safe_mapper_offset<2, mapper>(a, pitch, offsets, count);
+    const int8_t* a3 = safe_mapper_offset<3, mapper>(a, pitch, offsets, count);
+
+    // Process 4 vectors at a time


Similar(ish) comment as preview.

Yeah, and here is a "different" 4 (process 4, prefetch other 4). Will expand to reflect what it's happening and why

ldematte · 2025-12-05T10:01:55Z

cc @benwtrent
Just FYI (don't feel obliged to review the C++ code, unless you want to!), as this gives x64 a ~10% improvement for the sequential access bulk scoring too (the DiskBBQ one) thanks to better prefetching

…search into simd/x64-optimized-bulk

…bulk

benwtrent · 2025-12-05T11:48:03Z

@ldematte Looks good to me! Thank you for benchmarking over different architectures!

ldematte added 2 commits December 4, 2025 16:10

Refactor + x64 optimized implementation

5829f7a

Update versions + compilation fix

a958935

ldematte added >enhancement :Search Relevance/Vectors Vector search test-arm Pull Requests that should be tested against arm agents labels Dec 4, 2025

elasticsearchmachine added the v9.3.0 label Dec 4, 2025

Update docs/changelog/139069.yaml

58b7cca

ldematte mentioned this pull request Dec 4, 2025

Implement native int7u HNSW bulk scoring #139060

Closed

5 tasks

ldematte requested review from ChrisHegarty and thecoop December 4, 2025 21:17

Merge branch 'main' into simd/x64-optimized-bulk

4f469ea

ldematte marked this pull request as ready for review December 4, 2025 21:19

ldematte requested a review from a team as a code owner December 4, 2025 21:19

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Dec 4, 2025

ChrisHegarty approved these changes Dec 5, 2025

View reviewed changes

ldematte added 3 commits December 5, 2025 11:18

Improve code comments

029da8d

Merge branch 'simd/x64-optimized-bulk' of github.com:ldematte/elastic…

30e5ab0

…search into simd/x64-optimized-bulk

Merge remote-tracking branch 'upstream/main' into simd/x64-optimized-…

b58c68f

…bulk

ldematte merged commit c6f47cc into elastic:main Dec 10, 2025
40 checks passed

ldematte deleted the simd/x64-optimized-bulk branch December 10, 2025 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIMD][x64] Optimized native bulk dot product scoring for Int7#139069

[SIMD][x64] Optimized native bulk dot product scoring for Int7#139069
ldematte merged 7 commits intoelastic:mainfrom
ldematte:simd/x64-optimized-bulk

ldematte commented Dec 4, 2025 •

edited

Loading

elasticsearchmachine commented Dec 4, 2025

ldematte commented Dec 4, 2025

elasticsearchmachine commented Dec 4, 2025

ChrisHegarty left a comment

ChrisHegarty Dec 5, 2025

ChrisHegarty Dec 5, 2025

ldematte Dec 5, 2025

ChrisHegarty Dec 5, 2025

ldematte Dec 5, 2025

ldematte commented Dec 5, 2025 •

edited

Loading

benwtrent commented Dec 5, 2025

Uh oh!

Labels

4 participants

Conversation

ldematte commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Dec 4, 2025

ldematte commented Dec 4, 2025

elasticsearchmachine commented Dec 4, 2025

ChrisHegarty left a comment

Choose a reason for hiding this comment

ChrisHegarty Dec 5, 2025

Choose a reason for hiding this comment

ChrisHegarty Dec 5, 2025

Choose a reason for hiding this comment

ldematte Dec 5, 2025

Choose a reason for hiding this comment

ChrisHegarty Dec 5, 2025

Choose a reason for hiding this comment

ldematte Dec 5, 2025

Choose a reason for hiding this comment

ldematte commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

benwtrent commented Dec 5, 2025

Uh oh!

Labels

4 participants

ldematte commented Dec 4, 2025 •

edited

Loading

ldematte commented Dec 5, 2025 •

edited

Loading