Track shardStarted events for simulation in DesiredBalanceComputer by ywangd · Pull Request #133630 · elastic/elasticsearch

ywangd · 2025-08-27T08:36:21Z

If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives.

Resolves: ES-12723

Relates: ES-12723

elasticsearchmachine · 2025-08-27T08:36:46Z

Hi @ywangd, I've created a changelog YAML for you.

ywangd · 2025-08-27T08:45:35Z

I had some back-and-forth with the way to track shardStarted events. At the end, I decided to do it with mostly DesiredBalanceComputer since (1) it is the only place where it is needed and (2) less wiring changes compared to tracking inside InternalClusterInfoService. I am raising it as a draft to seek agreement on the approach. I will work on more tests if we are OK to proceed or I can take a different approach if folks are not happy with the current one. Thanks!

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

nicktindall

I don't mind the approach but I think it adds some complexity and state to an already quite complex/stateful bit of code

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

ywangd · 2025-08-28T04:23:54Z

it adds some complexity and state to an already quite complex/stateful bit of code

I think it will have to add some complexity. But if we track the real shard started events, the complexity might be a bit less in DesiredBalanceComputer. I am thinking switching to that also because of this comment

henningandersen

Left a few initial comments, did not get into the weeds of the started simulations yet

henningandersen · 2025-08-28T08:04:05Z

server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java

+        return currentClusterInfo;
+    }
+
+    private void updateAndGetCurrentClusterInfo() {


The method name here hints that it should return the cluster info? That would seem nice to do, but I'd also be fine to just call it updateClusterInfo

Yeah it was an oversight and was meant to return the value. Fixed in 001af0d

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

ywangd · 2025-08-28T09:04:34Z

I got some new idea after talking to Henning. I'll rework this PR. Please hold on your reviews. Thanks! 😅

ywangd · 2025-09-08T09:17:42Z

As discussed previously, I pushed 85d0089 to implement simulation for started shards by diffing between the current RoutingNodes and last polled ClusterInfo. The logic is mostly in the new method DesiredBalanceComputer#simulateAlreadyStartedShards. Please see the inline comments for reasoning and discussions. I am keeping this PR as draft to get overall agreement on the new approach before adding more tests. Also resolved most previous comments since they no longer apply. Thanks a lot!

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

ywangd · 2025-09-15T04:17:45Z

This PR is now ready for review. Thanks! 🙏

henningandersen

LGTM.

Only skimmed through the tests, looks like adequate coverage but may be good to have a second review as well.

nicktindall · 2025-09-16T03:36:45Z

server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java

+        return currentClusterInfo;
+    }
+
+    private ClusterInfo updateAndGetCurrentClusterInfo() {


As discussed in chat, the point of separating get and update is to remove the risk of someone calling get when a refresh is half-finished and receiving a ClusterInfo that was a mix of data from the current and previous refresh.

I wonder in light of this change whether we should make

leastAvailableSpaceUsages / mostAvailableSpaceUsages / nodeThreadPoolUsageStatsPerNode etc. all fields on the AsyncRefresh, so that state is kept private until the refresh was completed? Feels a bit safer than leaving it as fields on the enclosing class where it can be accessed whenever.

If not that we should at least put some doc here to indicate that using any of those fields directly isn't safe and the reason for the update/get split.

It would mean rather than updateAndGet... living at the top level and accessing fields, it might need to change to being updateClusterInfo and accepting one that the AsyncRefresh had created from its private state.

I'd have a separate PR for the refactoring. For now I added comments in 95ba02d

nicktindall · 2025-09-16T03:41:57Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

+                // We use dataPath to find out whether a shard is allocated on a node.
+                // TODO: DataPath is sent with disk usages but thread pool usage is sent separately so that local shard allocation
+                // may change between the two calls.
+                if (clusterInfo.getDataPath(shardRouting) == null) {


NIt: I wonder if we could put this logic in the ClusterInfo and have the computer be more agnostic about how this is implemented.

i.e. if we implemented boolean ClusterInfo#shardHasMoved(ShardRouting) and then put this logic and an explanation of it in there. It might reduce the cognitive load for the reader of this method, and allow us to adapt the implementation of #shardHasMoved as the contents of ClusterInfo evolves.

Good suggestion. Pushed 7bad062 Thanks!

nicktindall · 2025-09-16T03:55:43Z

...g/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceShardsAllocatorTests.java

+            final var firstNodeUpdatedStats = updatedClusterInfo.getNodeUsageStatsForThreadPools()
+                .get(firstNode.getId())
+                .threadPoolUsageStatsMap()
+                .get("write");


Nit: can use ThreadPool.Names.WRITE here? (and above/below?)

Yep see f43767d

nicktindall

LGTM, just some proposals around some restructuring, feel free to ignore.

…lastic#133630) If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives. Resolves: ES-12723

Only the overall ClusterInfo is needed at the top level. This PR moves the individual intermediate stats fields onto AsyncRefresh to avoid potential misuses. Relates: elastic#133630 (comment)

Only the overall ClusterInfo is needed at the top level. This PR moves the individual intermediate stats fields onto AsyncRefresh to avoid potential misuses. Relates: #133630 (comment)

…puter (elastic#133630)" This reverts commit f248596.

* Revert "Move individual stats fields to AsyncRefresh (#135052)" This reverts commit 2b0153b. * Revert "Track shardStarted events for simulation in DesiredBalanceComputer (#133630)" This reverts commit f248596.

* Revert "Move individual stats fields to AsyncRefresh (#135052)" This reverts commit 2b0153b. * Revert "Track shardStarted events for simulation in DesiredBalanceComputer (#133630)" This reverts commit f248596. * [CI] Update transport version definitions * Revert "[CI] Update transport version definitions" This reverts commit 90f38b0. * Don't reset upper bounds (#135226) Transport version upper bounds are currently set to their values from upstream main whenever no new definition is detected. However, this is like a partial merge of upstream, and produces broken state files. This commit temporarily comments out resetting until a more robust solution is built. * Revert "Don't reset upper bounds (#135226)" This reverts commit ddbac68. --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Ryan Ernst <ryan@iernst.net>

This reverts commit e76d333.

…mputer" (#135597) This PR reapplies both #133630 and #135052 with a performance bug fix. The original PR #133630 had a severe impact on throughput for index creation. It was reverted with #135369. The flamegraph suggests the system spent a lot time to compute shard assignments on ClusterInfo instantiation time. But that is unnecessary since they do not change within a single polling interval. This PR fixes it by reuse the last value and avoid recomputation. Copying the original commit message here If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives. Resolves: ES-12723

ywangd added 5 commits August 26, 2025 19:57

[Test] Test to verify ClusterInfoSimulator update for each compute cycle

15a3ee4

Relates: ES-12723

enhance the test to include actual shard started event

bf0629e

Merge remote-tracking branch 'origin/main' into ES-12723-test

5a14682

Track startedShards in compute

d8a84d3

improve handling for resetting desired balance

c2d1b36

ywangd requested review from DiannaHohensee, henningandersen, mhl-b and nicktindall August 27, 2025 08:36

ywangd added >enhancement :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.2.0 labels Aug 27, 2025

Update docs/changelog/133630.yaml

4293f15

ywangd commented Aug 27, 2025

View reviewed changes

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java Outdated Show resolved Hide resolved

more comment

c3eacda

ywangd commented Aug 27, 2025

View reviewed changes

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java Outdated Show resolved Hide resolved

nicktindall reviewed Aug 28, 2025

View reviewed changes

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java Outdated Show resolved Hide resolved

henningandersen reviewed Aug 28, 2025

View reviewed changes

ywangd added 4 commits September 8, 2025 14:11

revert changes to desired balance computer

0716c70

Merge remote-tracking branch 'origin/main' into ES-12723-test

5d5dabd

updateAndGet

001af0d

Simulating started shards from diff between RoutingNodes and ClusterInfo

85d0089

ywangd requested review from henningandersen and nicktindall September 8, 2025 09:17

nicktindall reviewed Sep 9, 2025

View reviewed changes

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java Show resolved Hide resolved

remove debug log

23777c2

henningandersen approved these changes Sep 15, 2025

View reviewed changes

nicktindall reviewed Sep 16, 2025

View reviewed changes

nicktindall approved these changes Sep 16, 2025

View reviewed changes

ywangd added 4 commits September 16, 2025 14:18

Merge remote-tracking branch 'origin/main' into ES-12723-test

5131e37

add comments

95ba02d

extract method

7bad062

replace string

f43767d

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 16, 2025

ywangd added 2 commits September 16, 2025 19:16

Merge branch 'main' into ES-12723-test

1c0b9f8

Merge branch 'main' into ES-12723-test

da3f889

elasticsearchmachine merged commit f248596 into elastic:main Sep 17, 2025
34 checks passed

ywangd deleted the ES-12723-test branch September 17, 2025 05:33

ywangd mentioned this pull request Sep 19, 2025

Move individual stats fields to AsyncRefresh #135052

Merged

pxsalehi mentioned this pull request Sep 24, 2025

Revert #135052 #133630 #135341

Merged

pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Sep 24, 2025

Revert "Track shardStarted events for simulation in DesiredBalanceCom…

ad43dd0

…puter (elastic#133630)" This reverts commit f248596.

pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Sep 24, 2025

Revert "Track shardStarted events for simulation in DesiredBalanceCom…

3215ffe

…puter (elastic#133630)" This reverts commit f248596.

This was referenced Sep 24, 2025

Revert #135052 and #133630 #135368

Closed

Revert #135052 #133630 #135369

Merged

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 29, 2025

Revert "Revert elastic#135052 elastic#133630 (elastic#135369)"

0695344

This reverts commit e76d333.

ywangd mentioned this pull request Sep 29, 2025

Reapply "Track shardStarted events for simulation in DesiredBalanceComputer" #135597

Merged

shainaraskas mentioned this pull request Oct 10, 2025

Update release notes for Elastic Cloud Serverless elastic/docs-content#3431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track shardStarted events for simulation in DesiredBalanceComputer#133630

Track shardStarted events for simulation in DesiredBalanceComputer#133630
elasticsearchmachine merged 24 commits intoelastic:mainfrom
ywangd:ES-12723-test

ywangd commented Aug 27, 2025

elasticsearchmachine commented Aug 27, 2025

ywangd commented Aug 27, 2025

Uh oh!

Uh oh!

nicktindall left a comment

Uh oh!

ywangd commented Aug 28, 2025

henningandersen left a comment

henningandersen Aug 28, 2025

ywangd Sep 8, 2025

Uh oh!

Uh oh!

ywangd commented Aug 28, 2025

ywangd commented Sep 8, 2025

Uh oh!

ywangd commented Sep 15, 2025

henningandersen left a comment

nicktindall Sep 16, 2025 •

edited

Loading

nicktindall Sep 16, 2025

ywangd Sep 16, 2025

nicktindall Sep 16, 2025

ywangd Sep 16, 2025

nicktindall Sep 16, 2025 •

edited

Loading

ywangd Sep 16, 2025

nicktindall left a comment

Uh oh!

Labels

4 participants

Conversation

ywangd commented Aug 27, 2025

elasticsearchmachine commented Aug 27, 2025

ywangd commented Aug 27, 2025

Uh oh!

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd commented Aug 28, 2025

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Aug 28, 2025

Choose a reason for hiding this comment

ywangd Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywangd commented Aug 28, 2025

ywangd commented Sep 8, 2025

Uh oh!

ywangd commented Sep 15, 2025

henningandersen left a comment

Choose a reason for hiding this comment

nicktindall Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

nicktindall Sep 16, 2025

Choose a reason for hiding this comment

ywangd Sep 16, 2025

Choose a reason for hiding this comment

nicktindall Sep 16, 2025

Choose a reason for hiding this comment

ywangd Sep 16, 2025

Choose a reason for hiding this comment

nicktindall Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ywangd Sep 16, 2025

Choose a reason for hiding this comment

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

nicktindall Sep 16, 2025 •

edited

Loading

nicktindall Sep 16, 2025 •

edited

Loading