Skip to content

Add int7(8) bulk vector search micro benchmarks to include dataset larger than typical cache sizes#138384

Merged
ldematte merged 3 commits intoelastic:mainfrom
ldematte:bulk-int7-scorer-benchmarks
Nov 21, 2025
Merged

Add int7(8) bulk vector search micro benchmarks to include dataset larger than typical cache sizes#138384
ldematte merged 3 commits intoelastic:mainfrom
ldematte:bulk-int7-scorer-benchmarks

Conversation

@ldematte
Copy link
Contributor

Relates to #138358

Benchmarks scoring operations against multiple vectors, accessing data sequentially or randomly, in an iterative way (for loop) VS using explicit bulk operations(*).
The purpose it to highlight memory-level parallelism (or lack thereof), contention, caching issues, and be a measure for potential optimizations. In particular, random-access variants of the benchmarks should show differences wrt linear access and dataset size (fits in L1 cache, in L2 cache, or needs to go to L3/memory).

(*) NOTE: at this time, bulk operations are implemented as for loops, so we won't see any difference; optimized implementation of bulk operations will be addressed in a following PR.

…nchmarks

# Conflicts:
#	benchmarks/src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerInt7uBenchmark.java
#	benchmarks/src/test/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerInt7uBenchmarkTests.java
@ldematte ldematte added >test Issues or PRs that are addressing/adding tests :Search Relevance/Vectors Vector search labels Nov 20, 2025
@elasticsearchmachine elasticsearchmachine added v9.3.0 Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Nov 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good foundational work. I think this is nice!

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldematte
Copy link
Contributor Author

For the record, these are the current numbers on my Apple M2 silicon:

Benchmark                                                        (dims)  (numVectors)   Mode  Cnt     Score     Error  Units
Int7uBulkScorerBenchmark.dotProductLuceneMultipleRandom            1024        128000  thrpt    5   184,970 ±  42,695  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleRandom            1024       1500000  thrpt    5   190,188 ±  25,491  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleRandom            1024      30000000  thrpt    5   155,936 ±  13,074  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleSequential        1024        128000  thrpt    5   308,332 ±   2,292  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleSequential        1024       1500000  thrpt    5   308,641 ±   3,160  ops/s
Int7uBulkScorerBenchmark.dotProductLuceneMultipleSequential        1024      30000000  thrpt    5   283,195 ±   2,825  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandom            1024        128000  thrpt    5   455,813 ± 141,887  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandom            1024       1500000  thrpt    5   442,524 ±  65,760  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandom            1024      30000000  thrpt    5   279,743 ±  79,089  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandomBulk        1024        128000  thrpt    5   490,786 ±  35,508  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandomBulk        1024       1500000  thrpt    5   421,876 ±  52,435  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleRandomBulk        1024      30000000  thrpt    5   288,747 ±  60,848  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequential        1024        128000  thrpt    5  1388,678 ±  97,494  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequential        1024       1500000  thrpt    5  1371,374 ±  45,775  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequential        1024      30000000  thrpt    5  1293,271 ±  44,071  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequentialBulk    1024        128000  thrpt    5  1402,330 ±  82,153  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequentialBulk    1024       1500000  thrpt    5  1390,791 ± 113,286  ops/s
Int7uBulkScorerBenchmark.dotProductNativeMultipleSequentialBulk    1024      30000000  thrpt    5  1286,183 ±  72,599  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleRandom            1024        128000  thrpt    5   103,146 ±  21,808  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleRandom            1024       1500000  thrpt    5    94,810 ±  36,921  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleRandom            1024      30000000  thrpt    5    91,631 ±   7,844  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleSequential        1024        128000  thrpt    5   141,871 ±  18,292  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleSequential        1024       1500000  thrpt    5   146,074 ±   5,344  ops/s
Int7uBulkScorerBenchmark.dotProductScalarMultipleSequential        1024      30000000  thrpt    5   131,756 ±   8,769  ops/s

Bulk operations currently are not different from their non-bulk counterparts (they just iterate).

Not surprising to see that sequential access is from 1.5x faster (for scalar, the slowest method) to 4x faster (for native, the fastest one). Cost of accessing the memory out of order shows more on faster methods, and can become a bottleneck. Can't wait to see if we can optimize Bulk operations!

@ldematte ldematte merged commit 3666aad into elastic:main Nov 21, 2025
37 checks passed
@ldematte ldematte deleted the bulk-int7-scorer-benchmarks branch November 21, 2025 11:52
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025
…rger than typical cache sizes (elastic#138384)

Relates to elastic#138358

Benchmarks scoring operations against multiple vectors, accessing data sequentially or randomly, in an iterative way (for loop) VS using explicit bulk operations(*).
The purpose it to highlight memory-level parallelism (or lack thereof), contention, caching issues, and be a measure for potential optimizations. In particular, random-access variants of the benchmarks should show differences wrt linear access and dataset size (fits in L1 cache, in L2 cache, or needs to go to L3/memory).

(*) NOTE: at this time, bulk operations are implemented as for loops, so we won't see any difference; optimized implementation of bulk operations will be addressed in a following PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch >test Issues or PRs that are addressing/adding tests v9.3.0

4 participants