Skip to content

Track shardStarted events for simulation in DesiredBalanceComputer#133630

Merged
elasticsearchmachine merged 24 commits intoelastic:mainfrom
ywangd:ES-12723-test
Sep 17, 2025
Merged

Track shardStarted events for simulation in DesiredBalanceComputer#133630
elasticsearchmachine merged 24 commits intoelastic:mainfrom
ywangd:ES-12723-test

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Aug 27, 2025

If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives.

Resolves: ES-12723

@ywangd ywangd added >enhancement :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.2.0 labels Aug 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

@ywangd
Copy link
Member Author

ywangd commented Aug 27, 2025

I had some back-and-forth with the way to track shardStarted events. At the end, I decided to do it with mostly DesiredBalanceComputer since (1) it is the only place where it is needed and (2) less wiring changes compared to tracking inside InternalClusterInfoService. I am raising it as a draft to seek agreement on the approach. I will work on more tests if we are OK to proceed or I can take a different approach if folks are not happy with the current one. Thanks!

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the approach but I think it adds some complexity and state to an already quite complex/stateful bit of code

@ywangd
Copy link
Member Author

ywangd commented Aug 28, 2025

it adds some complexity and state to an already quite complex/stateful bit of code

I think it will have to add some complexity. But if we track the real shard started events, the complexity might be a bit less in DesiredBalanceComputer. I am thinking switching to that also because of this comment

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few initial comments, did not get into the weeds of the started simulations yet

return currentClusterInfo;
}

private void updateAndGetCurrentClusterInfo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name here hints that it should return the cluster info? That would seem nice to do, but I'd also be fine to just call it updateClusterInfo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it was an oversight and was meant to return the value. Fixed in 001af0d

@ywangd
Copy link
Member Author

ywangd commented Aug 28, 2025

I got some new idea after talking to Henning. I'll rework this PR. Please hold on your reviews. Thanks! 😅

@ywangd
Copy link
Member Author

ywangd commented Sep 8, 2025

As discussed previously, I pushed 85d0089 to implement simulation for started shards by diffing between the current RoutingNodes and last polled ClusterInfo. The logic is mostly in the new method DesiredBalanceComputer#simulateAlreadyStartedShards. Please see the inline comments for reasoning and discussions. I am keeping this PR as draft to get overall agreement on the new approach before adding more tests. Also resolved most previous comments since they no longer apply. Thanks a lot!

@ywangd
Copy link
Member Author

ywangd commented Sep 15, 2025

This PR is now ready for review. Thanks! 🙏

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Only skimmed through the tests, looks like adequate coverage but may be good to have a second review as well.

return currentClusterInfo;
}

private ClusterInfo updateAndGetCurrentClusterInfo() {
Copy link
Contributor

@nicktindall nicktindall Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in chat, the point of separating get and update is to remove the risk of someone calling get when a refresh is half-finished and receiving a ClusterInfo that was a mix of data from the current and previous refresh.

I wonder in light of this change whether we should make

leastAvailableSpaceUsages / mostAvailableSpaceUsages / nodeThreadPoolUsageStatsPerNode etc. all fields on the AsyncRefresh, so that state is kept private until the refresh was completed? Feels a bit safer than leaving it as fields on the enclosing class where it can be accessed whenever.

If not that we should at least put some doc here to indicate that using any of those fields directly isn't safe and the reason for the update/get split.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would mean rather than updateAndGet... living at the top level and accessing fields, it might need to change to being updateClusterInfo and accepting one that the AsyncRefresh had created from its private state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have a separate PR for the refactoring. For now I added comments in 95ba02d

// We use dataPath to find out whether a shard is allocated on a node.
// TODO: DataPath is sent with disk usages but thread pool usage is sent separately so that local shard allocation
// may change between the two calls.
if (clusterInfo.getDataPath(shardRouting) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: I wonder if we could put this logic in the ClusterInfo and have the computer be more agnostic about how this is implemented.

i.e. if we implemented boolean ClusterInfo#shardHasMoved(ShardRouting) and then put this logic and an explanation of it in there. It might reduce the cognitive load for the reader of this method, and allow us to adapt the implementation of #shardHasMoved as the contents of ClusterInfo evolves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Pushed 7bad062 Thanks!

final var firstNodeUpdatedStats = updatedClusterInfo.getNodeUsageStatsForThreadPools()
.get(firstNode.getId())
.threadPoolUsageStatsMap()
.get("write");
Copy link
Contributor

@nicktindall nicktindall Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can use ThreadPool.Names.WRITE here? (and above/below?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep see f43767d

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some proposals around some restructuring, feel free to ignore.

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 16, 2025
@elasticsearchmachine elasticsearchmachine merged commit f248596 into elastic:main Sep 17, 2025
34 checks passed
@ywangd ywangd deleted the ES-12723-test branch September 17, 2025 05:33
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Sep 17, 2025
…lastic#133630)

If a shard starts on the target node before the next ClusterInfo
polling, today we don't include it for the simulation. With this PR, we
track shards that can potentially start within one ClusterInfo polling
cycle so that they are always included in simulation. The tracking is
reset when a new ClusterInfo arrives.

Resolves: ES-12723
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025
…lastic#133630)

If a shard starts on the target node before the next ClusterInfo
polling, today we don't include it for the simulation. With this PR, we
track shards that can potentially start within one ClusterInfo polling
cycle so that they are always included in simulation. The tracking is
reset when a new ClusterInfo arrives.

Resolves: ES-12723
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 19, 2025
Only the overall ClusterInfo is needed at the top level. This PR moves
the individual intermediate stats fields onto AsyncRefresh to avoid
potential misuses.

Relates:
elastic#133630 (comment)
elasticsearchmachine pushed a commit that referenced this pull request Sep 19, 2025
Only the overall ClusterInfo is needed at the top level. This PR moves
the individual intermediate stats fields onto AsyncRefresh to avoid
potential misuses.

Relates:
#133630 (comment)
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Sep 24, 2025
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Sep 24, 2025
This was referenced Sep 24, 2025
DiannaHohensee pushed a commit that referenced this pull request Sep 24, 2025
* Revert "Move individual stats fields to AsyncRefresh (#135052)"

This reverts commit 2b0153b.

* Revert "Track shardStarted events for simulation in DesiredBalanceComputer (#133630)"

This reverts commit f248596.
DiannaHohensee pushed a commit that referenced this pull request Sep 24, 2025
* Revert "Move individual stats fields to AsyncRefresh (#135052)"

This reverts commit 2b0153b.

* Revert "Track shardStarted events for simulation in DesiredBalanceComputer (#133630)"

This reverts commit f248596.

* [CI] Update transport version definitions

* Revert "[CI] Update transport version definitions"

This reverts commit 90f38b0.

* Don't reset upper bounds (#135226)

Transport version upper bounds are currently set to their values from
upstream main whenever no new definition is detected. However, this
is like a partial merge of upstream, and produces broken state files.
This commit temporarily comments out resetting until a more robust
solution is built.

* Revert "Don't reset upper bounds (#135226)"

This reverts commit ddbac68.

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 29, 2025
elasticsearchmachine pushed a commit that referenced this pull request Oct 2, 2025
…mputer" (#135597)

This PR reapplies both #133630 and #135052 with a performance bug fix.
The original PR #133630 had a severe impact on throughput for index
creation. It was reverted with #135369. The flamegraph suggests the
system spent a lot time to compute shard assignments on ClusterInfo
instantiation time. But that is unnecessary since they do not change
within a single polling interval. This PR fixes it by reuse the last
value and avoid recomputation.

Copying the original commit message here

If a shard starts on the target node before the next ClusterInfo
polling, today we don't include it for the simulation. With this PR, we
track shards that can potentially start within one ClusterInfo polling
cycle so that they are always included in simulation. The tracking is
reset when a new ClusterInfo arrives.

Resolves: ES-12723
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.2.0

4 participants