Add int7(8) bulk vector search micro benchmarks to include dataset larger than typical cache sizes by ldematte · Pull Request #138384 · elastic/elasticsearch

ldematte · 2025-11-20T18:42:33Z

Relates to #138358

Benchmarks scoring operations against multiple vectors, accessing data sequentially or randomly, in an iterative way (for loop) VS using explicit bulk operations(*).
The purpose it to highlight memory-level parallelism (or lack thereof), contention, caching issues, and be a measure for potential optimizations. In particular, random-access variants of the benchmarks should show differences wrt linear access and dataset size (fits in L1 cache, in L2 cache, or needs to go to L3/memory).

(*) NOTE: at this time, bulk operations are implemented as for loops, so we won't see any difference; optimized implementation of bulk operations will be addressed in a following PR.

…rger than typical cache sizes

…nchmarks # Conflicts: # benchmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerInt7uBenchmark.java # benchmarks/src/test/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerInt7uBenchmarkTests.java

elasticsearchmachine · 2025-11-20T18:42:59Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

benwtrent

Good foundational work. I think this is nice!

ChrisHegarty

LGTM

ldematte · 2025-11-21T11:52:40Z

For the record, these are the current numbers on my Apple M2 silicon:

Benchmark                                                        (dims)  (numVectors)   Mode  Cnt     Score     Error  Units
Int7uBulkScorerBenchmark.dotProductLuceneMultipleRandom            1024        128000  thrpt    5   184,970 ±  42,695  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleRandom            1024       1500000  thrpt    5   190,188 ±  25,491  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleRandom            1024      30000000  thrpt    5   155,936 ±  13,074  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleSequential        1024        128000  thrpt    5   308,332 ±   2,292  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleSequential        1024       1500000  thrpt    5   308,641 ±   3,160  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleSequential        1024      30000000  thrpt    5   283,195 ±   2,825  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandom            1024        128000  thrpt    5   455,813 ± 141,887  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandom            1024       1500000  thrpt    5   442,524 ±  65,760  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandom            1024      30000000  thrpt    5   279,743 ±  79,089  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandomBulk        1024        128000  thrpt    5   490,786 ±  35,508  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandomBulk        1024       1500000  thrpt    5   421,876 ±  52,435  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandomBulk        1024      30000000  thrpt    5   288,747 ±  60,848  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequential        1024        128000  thrpt    5  1388,678 ±  97,494  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequential        1024       1500000  thrpt    5  1371,374 ±  45,775  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequential        1024      30000000  thrpt    5  1293,271 ±  44,071  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequentialBulk    1024        128000  thrpt    5  1402,330 ±  82,153  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequentialBulk    1024       1500000  thrpt    5  1390,791 ± 113,286  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequentialBulk    1024      30000000  thrpt    5  1286,183 ±  72,599  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleRandom            1024        128000  thrpt    5   103,146 ±  21,808  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleRandom            1024       1500000  thrpt    5    94,810 ±  36,921  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleRandom            1024      30000000  thrpt    5    91,631 ±   7,844  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleSequential        1024        128000  thrpt    5   141,871 ±  18,292  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleSequential        1024       1500000  thrpt    5   146,074 ±   5,344  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleSequential        1024      30000000  thrpt    5   131,756 ±   8,769  ops/s

Bulk operations currently are not different from their non-bulk counterparts (they just iterate).

Not surprising to see that sequential access is from 1.5x faster (for scalar, the slowest method) to 4x faster (for native, the fastest one). Cost of accessing the memory out of order shows more on faster methods, and can become a bottleneck. Can't wait to see if we can optimize Bulk operations!

…rger than typical cache sizes (elastic#138384) Relates to elastic#138358 Benchmarks scoring operations against multiple vectors, accessing data sequentially or randomly, in an iterative way (for loop) VS using explicit bulk operations(*). The purpose it to highlight memory-level parallelism (or lack thereof), contention, caching issues, and be a measure for potential optimizations. In particular, random-access variants of the benchmarks should show differences wrt linear access and dataset size (fits in L1 cache, in L2 cache, or needs to go to L3/memory). (*) NOTE: at this time, bulk operations are implemented as for loops, so we won't see any difference; optimized implementation of bulk operations will be addressed in a following PR.

ldematte added 3 commits November 20, 2025 19:01

Add int7(8) bulk vector search micro benchmarks to include dataset la…

9d23556

…rger than typical cache sizes

Move and rename after merge

77aa0d8

ldematte requested a review from ChrisHegarty November 20, 2025 18:42

ldematte added >test Issues or PRs that are addressing/adding tests :Search Relevance/Vectors Vector search labels Nov 20, 2025

elasticsearchmachine added v9.3.0 Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Nov 20, 2025

benwtrent approved these changes Nov 20, 2025

View reviewed changes

ChrisHegarty approved these changes Nov 21, 2025

View reviewed changes

ldematte merged commit 3666aad into elastic:main Nov 21, 2025
37 checks passed

ldematte deleted the bulk-int7-scorer-benchmarks branch November 21, 2025 11:52

ldematte mentioned this pull request Nov 28, 2025

[SIMD][ARM] Optimized native bulk dot product scoring for Int7 #138552

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add int7(8) bulk vector search micro benchmarks to include dataset larger than typical cache sizes#138384

Add int7(8) bulk vector search micro benchmarks to include dataset larger than typical cache sizes#138384
ldematte merged 3 commits intoelastic:mainfrom
ldematte:bulk-int7-scorer-benchmarks

ldematte commented Nov 20, 2025

elasticsearchmachine commented Nov 20, 2025

benwtrent left a comment

ChrisHegarty left a comment

ldematte commented Nov 21, 2025

Uh oh!

Labels

4 participants

Conversation