Skip to content

[DiskBBQ] Add concurrency to KMeansLocal#139239

Merged
iverase merged 17 commits intoelastic:mainfrom
iverase:kmeansconcurrency
Dec 11, 2025
Merged

[DiskBBQ] Add concurrency to KMeansLocal#139239
iverase merged 17 commits intoelastic:mainfrom
iverase:kmeansconcurrency

Conversation

@iverase
Copy link
Contributor

@iverase iverase commented Dec 9, 2025

The most expensive methods when running the k-means algorithm are the methods that assign the closest centroid and the soar assignment to each of the vectors. Still those methods can be easily slice and execute using more than one thread.

Therefore this PR refactors the KmeansLocal class as an abstract class and implements a KmeansLocalSerial implementation that just runs on the current thread and a KmeansLocalConcurrent implementation that uses concurrency for the assignments.

It does provide a great speed up for large merges while using almost the same resources. We only need to allocate a extract FixedBitSet per thread with size being the number of centroids.

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Dec 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @iverase, I've created a changelog YAML for you.

@iverase iverase changed the title [DiskBBQ] Add concurrency on KMeansLocal Dec 9, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we can build neighborhoods concurrently. Maybe that isn't that big of a cost right now and we want to only optimize one thing at a time.

@iverase
Copy link
Contributor Author

iverase commented Dec 9, 2025

I also think we can build neighborhoods concurrently. Maybe that isn't that big of a cost right now and we want to only optimize one thing at a time.

I planned to do it in a follow up as it needs a bit of analysis to make sure we make the right decision between brute force and building a graph.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change, much less complicated than I was expecting

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iverase iverase merged commit 2455b00 into elastic:main Dec 11, 2025
34 checks passed
@iverase iverase deleted the kmeansconcurrency branch December 11, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.3.0

5 participants