Skip to content

Reapply "Track shardStarted events for simulation in DesiredBalanceComputer"#135597

Merged
elasticsearchmachine merged 9 commits intoelastic:mainfrom
ywangd:ES-12723-track-started-shards-take-2
Oct 2, 2025
Merged

Reapply "Track shardStarted events for simulation in DesiredBalanceComputer"#135597
elasticsearchmachine merged 9 commits intoelastic:mainfrom
ywangd:ES-12723-track-started-shards-take-2

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Sep 29, 2025

This PR reapplies both #133630 and #135052 with a performance bug fix. The original PR #133630 had a severe impact on throughput for index creation. It was reverted with #135369. The flamegraph suggests the system spent a lot time to compute shard assignments on ClusterInfo instantiation time. But that is unnecessary since they do not change within a single polling interval. This PR fixes it by reuse the last value and avoid recomputation.

Copying the original commit message here

If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives.

Resolves: ES-12723

@ywangd ywangd added the :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 29, 2025
Comment on lines -159 to +196
return new ClusterInfo(
leastAvailableSpaceUsage,
mostAvailableSpaceUsage,
shardSizes.toImmutableMap(),
shardDataSetSizes,
dataPath,
Map.of(),
estimatedHeapUsages,
shardMovementWriteLoadSimulator.simulatedNodeUsageStatsForThreadPools(),
allocation.clusterInfo().getShardWriteLoads(),
maxHeapSizePerNode
);
return allocation.clusterInfo()
.updateWith(
leastAvailableSpaceUsage,
mostAvailableSpaceUsage,
shardSizes.toImmutableMap(),
Map.of(),
estimatedHeapUsages,
shardMovementWriteLoadSimulator.simulatedNodeUsageStatsForThreadPools()
);
Copy link
Member Author

@ywangd ywangd Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the centre piece of the performance fix. See also commit b78f608 for full details.

Note that I will not merge this PR until I get proof from benchmarks.

@gbanasiak
Copy link
Contributor

Buildkite benchmark this with many-shards-quantitative please

@ywangd
Copy link
Member Author

ywangd commented Sep 30, 2025

The last results show that the performance is better than the original PR but still worse than the baseline.

  • The P50 throughput for initial-indices is 998, down from the baseline 1257. The original PR was at 368, the original baseline was 985.
  • It might be arguable that the new value (998) is within noise level since the historical low of past 90 days is below 900.

Based on the flamegraphs, I think we may still be able to improve it and I pushed 0b6cde1. Will run a new benchmark for this change.

@ywangd
Copy link
Member Author

ywangd commented Sep 30, 2025

Buildkite benchmark this with many-shards-quantitative please

Comment on lines 45 to 47
return Map.copyOf(getForRead());
return wasCopied ? Collections.unmodifiableMap(getForRead()) : getForRead();
Copy link
Member Author

@ywangd ywangd Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and a similar change in ClusterInfo's constructor is the new fix after reviewing the results of last benchmark.

It is not necessarily a new issue. This class assumes mutation operation is rare (see its class level comment) so that it copies map if it is mutated (this is already a performance fix since the previous behaviour was always copying). But the assumption does not seem true when there is a high rate of index creations as in the many-shards benchmark. The consequence is that the shardSizes map gets copied many times which is visible in the baseline flamegraph (6.56%). The PR adds simulateAlreadyStartedShard that simulates shard started within a single ClusterInfo polling. The most expensive part of this new logic is again map copying (12.78%).

The additional work for simulateAlreadyStartedShard is omitted by the baseline which effectively makes it do less work (with less accurate result). The PR fixes this issue so that the additional work is justified. But the issue is that additional work reuses the same non-performant map copying code and it makes the problem more visible.

In 0b6cde1, I replaced the map copying with Collections.unmodifiableMap wrapping which should remove most of the allocation cost. ClusterInfo for simulation is used in a single-threaded environment, there should be no real functional difference between map copying and immutable wrapping.

Baseline
Screenshot 2025-09-30 at 1 49 18 pm

Contender
Screenshot 2025-09-30 at 1 50 00 pm

Copy link
Contributor

@nicktindall nicktindall Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty serious behaviour change for something that's in org.elasticsearch.common.util, but it also only appears to be used by the ClusterInfoSimulator.

I wonder if we should at least change the name of the method from toImmutable to unmodifiableReference or something. It makes me feel a little uneasy.

Or perhaps it makes more sense to avoid using it in the ClusterInfoSimulator, seeing as our mutations are not in fact "rare"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps it makes more sense to avoid using it in the ClusterInfoSimulator

Yeah, this is along the line of my suggestion here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to not call getClusterInfo in ClusterInfoSimulator#simulateShardStarted. The downstream code only needs info from shardSizes map and we can pass down a lookup function which internally directly use the shardSizes field.

The getClusterInfo will still be called by DesiredBalanceComputer. But its contribution to performance is very minor based on the baseline flamegraph as shown below. The purple color indicates where the map copying happened and the one pointed by the arrow is from DesiredBalanceComputer and the rest is from ClusterInfoSimulator#simulateShardStarted.

Screenshot 2025-09-30 at 7 04 42 pm

Overall, I still slightly prefer the prior proposal.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with the optimisation, I'm just a little uneasy about the change to CopyOnFirstWriteMap. Be good to get consensus on that first.

@ywangd
Copy link
Member Author

ywangd commented Sep 30, 2025

I'm just a little uneasy about the change to CopyOnFirstWriteMap. Be good to get consensus on that first.

Yeah, understandable. My proposal is to move it into ClusterInfoSimulator since that is the only place where it is used and this should avoid any future misuse. It only works in single threaded environment which was a bit surprising to me given its similar name to Java's concurrent CopyOnWrite collections. So I think internalize it makes sense.

@ywangd
Copy link
Member Author

ywangd commented Sep 30, 2025

The last benchmark results show the "unmodifable map wrapping" change in this commit 0b6cde1 has been very effective.

The contender's P50 throughput is 1269, now higher than 1199 from the baseline. Comparing the flamegraphs, we can also see that the overtime spent for DesiredBalanceComputer#compute dropped from 28.39% to 21.09%.

The remaining question is whether we want to go with the unmodifiable wrapping change as described here or alternatively go with the other proposal here.

PS: The many shards benchmark configures cluster.info.update.interval to be 60 seconds instead of the default 30s. This also adds up to the work for the new started shard tracking since slower ClusterInfo update means tracking longer for more shards.

@ywangd
Copy link
Member Author

ywangd commented Oct 1, 2025

Buildkite benchmark this with many-shards-quantitative please

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 1, 2025

💚 Build Succeeded

This build ran two many-shards-quantitative benchmarks to evaluate performance impact of this PR.

History

@ywangd
Copy link
Member Author

ywangd commented Oct 1, 2025

After discussion with @nicktindall, we agreed to take the alternative approach of avoid creating ClusterInfo for estimating shard sizes as described here. I have update the PR and triggered another round of benchmark.

@ywangd
Copy link
Member Author

ywangd commented Oct 1, 2025

As expected, the latest change also has an comparable performance: P50 throughput is 1304 vs baseline 1045. The most impactful fix is splitted and merged separately as #135751. The fix in this PR is now just b78f608 which is less useful but still nice to have, i.e. avoid re-computing shard assignments and map copying when reconstructing ClusterInfo with ClusterInfoSimulator.

PS: In long term, I'd personally like to see us getting rid of the dances between ClusterInfo and ClusterInfoSimulator. It is clear that we use ClusterInfo in both immutable and mutable ways, similar to RoutingAllocation. Hence one option is making ClusterInfo and ClusterInfoSimulator share an interface and serve as the immutable and mutable implementations, respectively. So that a mutable RoutingAllocation can work directly with ClusterInfoSimulator and avoid the conversions and remaining copies. Along the same line, we could also eventually remove the need to build ModelNode (it shows up in flamegraph now that most map copying is fixed) and potentially replace it with ClusterInfoSimulator and/or RoutingNodes.

@ywangd ywangd marked this pull request as ready for review October 1, 2025 23:55
@ywangd ywangd requested a review from nicktindall October 1, 2025 23:56
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Oct 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 2, 2025
@elasticsearchmachine elasticsearchmachine merged commit 0755992 into elastic:main Oct 2, 2025
34 checks passed
@ywangd ywangd deleted the ES-12723-track-started-shards-take-2 branch October 2, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.2.0

5 participants