[SIMD][x64] Optimized native bulk dot product scoring for Int7#139069
[SIMD][x64] Optimized native bulk dot product scoring for Int7#139069ldematte merged 7 commits intoelastic:mainfrom
Conversation
|
Hi @ldematte, I've created a changelog YAML for you. |
|
Some benchmars. |
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
ChrisHegarty
left a comment
There was a problem hiding this comment.
This is a super change 🚀 I left some minor comments, but nothing too significant. LGTM.
| #include <assert.h> | ||
|
|
||
| template <uintptr_t align> | ||
| static inline uintptr_t align_downwards(const void* ptr) { |
There was a problem hiding this comment.
classic alignment floor operation. Nice.
| const int8_t* a0 = safe_mapper_offset<0, mapper>(a, pitch, offsets, count); | ||
| const int8_t* a1 = safe_mapper_offset<1, mapper>(a, pitch, offsets, count); | ||
|
|
||
| // Process 4 vectors at a time |
There was a problem hiding this comment.
This comment is technically correct, but it caused me to pause for a moment. We compute the dot for the next 2 vectors, and prefect the subsequent two. All of which is good, just could be worth a small clarification to the comment.
There was a problem hiding this comment.
Oh no, that was left over -- as you can imagine from the benchmarks, I tried 4, 2, 1 :)
The comment is from the first iteration and did not get updated. Will fix it.
| const int8_t* a2 = safe_mapper_offset<2, mapper>(a, pitch, offsets, count); | ||
| const int8_t* a3 = safe_mapper_offset<3, mapper>(a, pitch, offsets, count); | ||
|
|
||
| // Process 4 vectors at a time |
There was a problem hiding this comment.
Similar(ish) comment as preview.
There was a problem hiding this comment.
Yeah, and here is a "different" 4 (process 4, prefetch other 4). Will expand to reflect what it's happening and why
|
cc @benwtrent |
…search into simd/x64-optimized-bulk
|
@ldematte Looks good to me! Thank you for benchmarking over different architectures!
|
This PR introduces an optimized native implementation (both for HNSW and for the "sequential" DIskBBQ usage) of bulk dot product scoring.
The optimization centers around optimizing data access, using a combination of unrolling and prefetching.
The whole effort was benchmark-driven, on multiple processors and generations (AMD EPYC 3rd gen and Intel Xeon 4th gen with AVX2, AMD EPYC 5th gen and Intel Xeon 6th gen for AVX-512).
Benchmarks revealed that the sweet spot is to prefetch 2 vectors "ahead" for AVX2 processors, and 4 vectors "ahead" for newer processors (those supporting AVX-512).
Compared to the non-optimized version, this change shows a further improvement between 20% and 50%, or up to 3x over the non-bulk case on some processors.