Reapply "Track shardStarted events for simulation in DesiredBalanceComputer" by ywangd · Pull Request #135597 · elastic/elasticsearch

ywangd · 2025-09-29T08:04:25Z

This PR reapplies both #133630 and #135052 with a performance bug fix. The original PR #133630 had a severe impact on throughput for index creation. It was reverted with #135369. The flamegraph suggests the system spent a lot time to compute shard assignments on ClusterInfo instantiation time. But that is unnecessary since they do not change within a single polling interval. This PR fixes it by reuse the last value and avoid recomputation.

Copying the original commit message here

If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives.

Resolves: ES-12723

This reverts commit e76d333.

ywangd · 2025-09-29T08:08:00Z

server/src/main/java/org/elasticsearch/cluster/ClusterInfoSimulator.java

-        return new ClusterInfo(
-            leastAvailableSpaceUsage,
-            mostAvailableSpaceUsage,
-            shardSizes.toImmutableMap(),
-            shardDataSetSizes,
-            dataPath,
-            Map.of(),
-            estimatedHeapUsages,
-            shardMovementWriteLoadSimulator.simulatedNodeUsageStatsForThreadPools(),
-            allocation.clusterInfo().getShardWriteLoads(),
-            maxHeapSizePerNode
-        );
+        return allocation.clusterInfo()
+            .updateWith(
+                leastAvailableSpaceUsage,
+                mostAvailableSpaceUsage,
+                shardSizes.toImmutableMap(),
+                Map.of(),
+                estimatedHeapUsages,
+                shardMovementWriteLoadSimulator.simulatedNodeUsageStatsForThreadPools()
+            );


This is the centre piece of the performance fix. See also commit b78f608 for full details.

Note that I will not merge this PR until I get proof from benchmarks.

gbanasiak · 2025-09-29T08:30:01Z

Buildkite benchmark this with many-shards-quantitative please

ywangd · 2025-09-30T03:29:36Z

The last results show that the performance is better than the original PR but still worse than the baseline.

The P50 throughput for initial-indices is 998, down from the baseline 1257. The original PR was at 368, the original baseline was 985.
It might be arguable that the new value (998) is within noise level since the historical low of past 90 days is below 900.

Based on the flamegraphs, I think we may still be able to improve it and I pushed 0b6cde1. Will run a new benchmark for this change.

ywangd · 2025-09-30T03:30:13Z

Buildkite benchmark this with many-shards-quantitative please

ywangd · 2025-09-30T03:58:51Z

server/src/main/java/org/elasticsearch/common/util/CopyOnFirstWriteMap.java

-        return Map.copyOf(getForRead());
+        return wasCopied ? Collections.unmodifiableMap(getForRead()) : getForRead();


This and a similar change in ClusterInfo's constructor is the new fix after reviewing the results of last benchmark.

It is not necessarily a new issue. This class assumes mutation operation is rare (see its class level comment) so that it copies map if it is mutated (this is already a performance fix since the previous behaviour was always copying). But the assumption does not seem true when there is a high rate of index creations as in the many-shards benchmark. The consequence is that the shardSizes map gets copied many times which is visible in the baseline flamegraph (6.56%). The PR adds simulateAlreadyStartedShard that simulates shard started within a single ClusterInfo polling. The most expensive part of this new logic is again map copying (12.78%).

The additional work for simulateAlreadyStartedShard is omitted by the baseline which effectively makes it do less work (with less accurate result). The PR fixes this issue so that the additional work is justified. But the issue is that additional work reuses the same non-performant map copying code and it makes the problem more visible.

In 0b6cde1, I replaced the map copying with Collections.unmodifiableMap wrapping which should remove most of the allocation cost. ClusterInfo for simulation is used in a single-threaded environment, there should be no real functional difference between map copying and immutable wrapping.

Baseline

Contender

This seems like a pretty serious behaviour change for something that's in org.elasticsearch.common.util, but it also only appears to be used by the ClusterInfoSimulator.

I wonder if we should at least change the name of the method from toImmutable to unmodifiableReference or something. It makes me feel a little uneasy.

Or perhaps it makes more sense to avoid using it in the ClusterInfoSimulator, seeing as our mutations are not in fact "rare"?

Or perhaps it makes more sense to avoid using it in the ClusterInfoSimulator

Yeah, this is along the line of my suggestion here.

Another option is to not call getClusterInfo in ClusterInfoSimulator#simulateShardStarted. The downstream code only needs info from shardSizes map and we can pass down a lookup function which internally directly use the shardSizes field.

The getClusterInfo will still be called by DesiredBalanceComputer. But its contribution to performance is very minor based on the baseline flamegraph as shown below. The purple color indicates where the map copying happened and the one pointed by the arrow is from DesiredBalanceComputer and the rest is from ClusterInfoSimulator#simulateShardStarted.

Overall, I still slightly prefer the prior proposal.

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

nicktindall

Nice work with the optimisation, I'm just a little uneasy about the change to CopyOnFirstWriteMap. Be good to get consensus on that first.

ywangd · 2025-09-30T08:21:25Z

I'm just a little uneasy about the change to CopyOnFirstWriteMap. Be good to get consensus on that first.

Yeah, understandable. My proposal is to move it into ClusterInfoSimulator since that is the only place where it is used and this should avoid any future misuse. It only works in single threaded environment which was a bit surprising to me given its similar name to Java's concurrent CopyOnWrite collections. So I think internalize it makes sense.

ywangd · 2025-09-30T12:04:30Z

The last benchmark results show the "unmodifable map wrapping" change in this commit 0b6cde1 has been very effective.

The contender's P50 throughput is 1269, now higher than 1199 from the baseline. Comparing the flamegraphs, we can also see that the overtime spent for DesiredBalanceComputer#compute dropped from 28.39% to 21.09%.

The remaining question is whether we want to go with the unmodifiable wrapping change as described here or alternatively go with the other proposal here.

PS: The many shards benchmark configures cluster.info.update.interval to be 60 seconds instead of the default 30s. This also adds up to the work for the new started shard tracking since slower ClusterInfo update means tracking longer for more shards.

This reverts commit 0b6cde1.

…d-shards-take-2

ywangd · 2025-10-01T03:51:21Z

Buildkite benchmark this with many-shards-quantitative please

elasticmachine · 2025-10-01T03:53:12Z

💚 Build Succeeded

Buildkite Build
Commit: 69a0880
Baseline: bd0d1ac (env ID bc3b2367-ef12-4b1d-ad12-135f1728f25f)
Contender: 69a0880 (env ID b9784e36-fc2b-441f-a92d-c0dc7e77918b)
Benchmark results

This build ran two many-shards-quantitative benchmarks to evaluate performance impact of this PR.

History

💚 Build #24 succeeded 0b6cde1
💚 Build #23 succeeded b78f608

ywangd · 2025-10-01T04:45:09Z

After discussion with @nicktindall, we agreed to take the alternative approach of avoid creating ClusterInfo for estimating shard sizes as described here. I have update the PR and triggered another round of benchmark.

…d-shards-take-2

ywangd · 2025-10-01T23:55:21Z

As expected, the latest change also has an comparable performance: P50 throughput is 1304 vs baseline 1045. The most impactful fix is splitted and merged separately as #135751. The fix in this PR is now just b78f608 which is less useful but still nice to have, i.e. avoid re-computing shard assignments and map copying when reconstructing ClusterInfo with ClusterInfoSimulator.

PS: In long term, I'd personally like to see us getting rid of the dances between ClusterInfo and ClusterInfoSimulator. It is clear that we use ClusterInfo in both immutable and mutable ways, similar to RoutingAllocation. Hence one option is making ClusterInfo and ClusterInfoSimulator share an interface and serve as the immutable and mutable implementations, respectively. So that a mutable RoutingAllocation can work directly with ClusterInfoSimulator and avoid the conversions and remaining copies. Along the same line, we could also eventually remove the need to build ModelNode (it shows up in flamegraph now that most map copying is fixed) and potentially replace it with ClusterInfoSimulator and/or RoutingNodes.

elasticsearchmachine · 2025-10-01T23:56:18Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java

nicktindall

LGTM

ywangd added 2 commits September 29, 2025 16:38

Revert "Revert elastic#135052 elastic#133630 (elastic#135369)"

0695344

This reverts commit e76d333.

Avoid unnecessary computation for shard assignments

b78f608

elasticsearchmachine added the v9.2.0 label Sep 29, 2025

ywangd added the :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 29, 2025

ywangd requested review from DiannaHohensee, henningandersen, mhl-b and nicktindall September 29, 2025 08:06

ywangd commented Sep 29, 2025

View reviewed changes

opitmize for shardSizes copying

0b6cde1

ywangd commented Sep 30, 2025

View reviewed changes

nicktindall reviewed Sep 30, 2025

View reviewed changes

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java Outdated Show resolved Hide resolved

nicktindall reviewed Sep 30, 2025

View reviewed changes

ywangd added 3 commits October 1, 2025 13:46

Revert "opitmize for shardSizes copying"

cebdab7

This reverts commit 0b6cde1.

Merge remote-tracking branch 'origin/main' into ES-12723-track-starte…

bb8d188

…d-shards-take-2

shardSizeProvider interface

69a0880

ywangd added 2 commits October 2, 2025 09:42

fix changelog

3a9cf74

Merge remote-tracking branch 'origin/main' into ES-12723-track-starte…

a193584

…d-shards-take-2

ywangd marked this pull request as ready for review October 1, 2025 23:55

ywangd added the >enhancement label Oct 1, 2025

ywangd requested a review from nicktindall October 1, 2025 23:56

elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Oct 1, 2025

nicktindall reviewed Oct 2, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java Outdated Show resolved Hide resolved

no need for volatile read

cbd5adb

nicktindall reviewed Oct 2, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java Show resolved Hide resolved

nicktindall approved these changes Oct 2, 2025

View reviewed changes

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 2, 2025

elasticsearchmachine merged commit 0755992 into elastic:main Oct 2, 2025
34 checks passed

ywangd deleted the ES-12723-track-started-shards-take-2 branch October 2, 2025 01:36

ywangd mentioned this pull request Oct 17, 2025

Refactor Metadata.Builder to hold ProjectMetadata #134599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapply "Track shardStarted events for simulation in DesiredBalanceComputer"#135597

Reapply "Track shardStarted events for simulation in DesiredBalanceComputer"#135597
elasticsearchmachine merged 9 commits intoelastic:mainfrom
ywangd:ES-12723-track-started-shards-take-2

ywangd commented Sep 29, 2025 •

edited

Loading

ywangd Sep 29, 2025 •

edited

Loading

gbanasiak commented Sep 29, 2025

ywangd commented Sep 30, 2025

ywangd commented Sep 30, 2025

ywangd Sep 30, 2025 •

edited

Loading

nicktindall Sep 30, 2025 •

edited

Loading

ywangd Sep 30, 2025

ywangd Sep 30, 2025

Uh oh!

nicktindall left a comment

ywangd commented Sep 30, 2025

ywangd commented Sep 30, 2025

ywangd commented Oct 1, 2025

elasticmachine commented Oct 1, 2025 •

edited

Loading

ywangd commented Oct 1, 2025

ywangd commented Oct 1, 2025

elasticsearchmachine commented Oct 1, 2025

Uh oh!

Uh oh!

nicktindall left a comment

Uh oh!

Labels

5 participants

		return Map.copyOf(getForRead());
		return wasCopied ? Collections.unmodifiableMap(getForRead()) : getForRead();

Conversation

ywangd commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ywangd Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

gbanasiak commented Sep 29, 2025

ywangd commented Sep 30, 2025

ywangd commented Sep 30, 2025

ywangd Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

nicktindall Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ywangd Sep 30, 2025

Choose a reason for hiding this comment

ywangd Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

ywangd commented Sep 30, 2025

ywangd commented Sep 30, 2025

ywangd commented Oct 1, 2025

elasticmachine commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

History

ywangd commented Oct 1, 2025

ywangd commented Oct 1, 2025

elasticsearchmachine commented Oct 1, 2025

Uh oh!

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

5 participants

ywangd commented Sep 29, 2025 •

edited

Loading

ywangd Sep 29, 2025 •

edited

Loading

ywangd Sep 30, 2025 •

edited

Loading

nicktindall Sep 30, 2025 •

edited

Loading

elasticmachine commented Oct 1, 2025 •

edited

Loading