Skip to content

[SIMD][x64] Optimized native bulk dot product scoring for Int7#139069

Merged
ldematte merged 7 commits intoelastic:mainfrom
ldematte:simd/x64-optimized-bulk
Dec 10, 2025
Merged

[SIMD][x64] Optimized native bulk dot product scoring for Int7#139069
ldematte merged 7 commits intoelastic:mainfrom
ldematte:simd/x64-optimized-bulk

Conversation

@ldematte
Copy link
Contributor

@ldematte ldematte commented Dec 4, 2025

This PR introduces an optimized native implementation (both for HNSW and for the "sequential" DIskBBQ usage) of bulk dot product scoring.

The optimization centers around optimizing data access, using a combination of unrolling and prefetching.
The whole effort was benchmark-driven, on multiple processors and generations (AMD EPYC 3rd gen and Intel Xeon 4th gen with AVX2, AMD EPYC 5th gen and Intel Xeon 6th gen for AVX-512).

Benchmarks revealed that the sweet spot is to prefetch 2 vectors "ahead" for AVX2 processors, and 4 vectors "ahead" for newer processors (those supporting AVX-512).
Compared to the non-optimized version, this change shows a further improvement between 20% and 50%, or up to 3x over the non-bulk case on some processors.

@ldematte ldematte added >enhancement :Search Relevance/Vectors Vector search test-arm Pull Requests that should be tested against arm agents labels Dec 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ldematte
Copy link
Contributor Author

ldematte commented Dec 4, 2025

Some benchmars.
Base is the starting point from #138552:

Benchmark                                   (dims)  (numVectors)   Mode  Cnt     Score    Error  Units
Intel Xeon 3 - bulk base (random)            1024        100000  thrpt    5   385.208 ±  27.041  ops/s
Intel Xeon 3 - bulk base (random)            1024       1000000  thrpt    5   355.963 ±  10.555  ops/s
Intel Xeon 3 - bulk base (sequential)        1024       1000000  thrpt    5   902.250 ±  20.645  ops/s
Intel Xeon 3 - optimized (random)            1024        100000  thrpt    5   474.316 ±  23.093  ops/s
Intel Xeon 3 - optimized (random)            1024       1000000  thrpt    5   428.613 ±  24.635  ops/s
Intel Xeon 3 - optimized (sequential)        1024       1000000  thrpt    5  1056.787 ±   4.269  ops/s

AMD EPYC 3 (c6a) - bulk base (random)        1024       1000000  thrpt    5   485.967 ±   5.357  ops/s
AMD EPYC 3 (c6a) - optimized (random)        1024       1000000  thrpt    5   815.387 ±   9.895  ops/s
AMD EPYC 3 (c6a) - non-bulk  (sequential)    1024       1000000  thrpt    5  1488.855 ±   3.026  ops/s
AMD EPYC 3 (c6a) - optimized (sequential)    1024       1000000  thrpt    5  2158.516 ±   1.802  ops/s

AMD EPYC 5 (c8a) - non-bulk  (random)        1024       1000000  thrpt    5   521.555 ±  22.522  ops/s
AMD EPYC 5 (c8a) - bulk base (random)        1024       1000000  thrpt    5   983.750 ±   2.203  ops/s
AMD EPYC 5 (c8a) - optimized (random)        1024       1000000  thrpt    5  1510.839 ±  53.862  ops/s
AMD EPYC 5 (c8a) - optimized (sequential)    1024       1000000  thrpt    5  3600.388 ± 134.888  ops/s
@ldematte ldematte marked this pull request as ready for review December 4, 2025 21:19
@ldematte ldematte requested a review from a team as a code owner December 4, 2025 21:19
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Dec 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super change 🚀 I left some minor comments, but nothing too significant. LGTM.

#include <assert.h>

template <uintptr_t align>
static inline uintptr_t align_downwards(const void* ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic alignment floor operation. Nice.

const int8_t* a0 = safe_mapper_offset<0, mapper>(a, pitch, offsets, count);
const int8_t* a1 = safe_mapper_offset<1, mapper>(a, pitch, offsets, count);

// Process 4 vectors at a time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is technically correct, but it caused me to pause for a moment. We compute the dot for the next 2 vectors, and prefect the subsequent two. All of which is good, just could be worth a small clarification to the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, that was left over -- as you can imagine from the benchmarks, I tried 4, 2, 1 :)
The comment is from the first iteration and did not get updated. Will fix it.

const int8_t* a2 = safe_mapper_offset<2, mapper>(a, pitch, offsets, count);
const int8_t* a3 = safe_mapper_offset<3, mapper>(a, pitch, offsets, count);

// Process 4 vectors at a time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar(ish) comment as preview.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and here is a "different" 4 (process 4, prefetch other 4). Will expand to reflect what it's happening and why

@ldematte
Copy link
Contributor Author

ldematte commented Dec 5, 2025

cc @benwtrent
Just FYI (don't feel obliged to review the C++ code, unless you want to!), as this gives x64 a ~10% improvement for the sequential access bulk scoring too (the DiskBBQ one) thanks to better prefetching

@benwtrent
Copy link
Member

@ldematte Looks good to me! Thank you for benchmarking over different architectures!

:shipit:

@ldematte ldematte merged commit c6f47cc into elastic:main Dec 10, 2025
40 checks passed
@ldematte ldematte deleted the simd/x64-optimized-bulk branch December 10, 2025 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents v9.3.0

4 participants