Reapply "Track shardStarted events for simulation in DesiredBalanceComputer"#135597
Conversation
| return new ClusterInfo( | ||
| leastAvailableSpaceUsage, | ||
| mostAvailableSpaceUsage, | ||
| shardSizes.toImmutableMap(), | ||
| shardDataSetSizes, | ||
| dataPath, | ||
| Map.of(), | ||
| estimatedHeapUsages, | ||
| shardMovementWriteLoadSimulator.simulatedNodeUsageStatsForThreadPools(), | ||
| allocation.clusterInfo().getShardWriteLoads(), | ||
| maxHeapSizePerNode | ||
| ); | ||
| return allocation.clusterInfo() | ||
| .updateWith( | ||
| leastAvailableSpaceUsage, | ||
| mostAvailableSpaceUsage, | ||
| shardSizes.toImmutableMap(), | ||
| Map.of(), | ||
| estimatedHeapUsages, | ||
| shardMovementWriteLoadSimulator.simulatedNodeUsageStatsForThreadPools() | ||
| ); |
There was a problem hiding this comment.
This is the centre piece of the performance fix. See also commit b78f608 for full details.
Note that I will not merge this PR until I get proof from benchmarks.
|
Buildkite benchmark this with many-shards-quantitative please |
|
The last results show that the performance is better than the original PR but still worse than the baseline.
Based on the flamegraphs, I think we may still be able to improve it and I pushed 0b6cde1. Will run a new benchmark for this change. |
|
Buildkite benchmark this with many-shards-quantitative please |
| return Map.copyOf(getForRead()); | ||
| return wasCopied ? Collections.unmodifiableMap(getForRead()) : getForRead(); |
There was a problem hiding this comment.
This and a similar change in ClusterInfo's constructor is the new fix after reviewing the results of last benchmark.
It is not necessarily a new issue. This class assumes mutation operation is rare (see its class level comment) so that it copies map if it is mutated (this is already a performance fix since the previous behaviour was always copying). But the assumption does not seem true when there is a high rate of index creations as in the many-shards benchmark. The consequence is that the shardSizes map gets copied many times which is visible in the baseline flamegraph (6.56%). The PR adds simulateAlreadyStartedShard that simulates shard started within a single ClusterInfo polling. The most expensive part of this new logic is again map copying (12.78%).
The additional work for simulateAlreadyStartedShard is omitted by the baseline which effectively makes it do less work (with less accurate result). The PR fixes this issue so that the additional work is justified. But the issue is that additional work reuses the same non-performant map copying code and it makes the problem more visible.
In 0b6cde1, I replaced the map copying with Collections.unmodifiableMap wrapping which should remove most of the allocation cost. ClusterInfo for simulation is used in a single-threaded environment, there should be no real functional difference between map copying and immutable wrapping.
There was a problem hiding this comment.
This seems like a pretty serious behaviour change for something that's in org.elasticsearch.common.util, but it also only appears to be used by the ClusterInfoSimulator.
I wonder if we should at least change the name of the method from toImmutable to unmodifiableReference or something. It makes me feel a little uneasy.
Or perhaps it makes more sense to avoid using it in the ClusterInfoSimulator, seeing as our mutations are not in fact "rare"?
There was a problem hiding this comment.
Or perhaps it makes more sense to avoid using it in the ClusterInfoSimulator
Yeah, this is along the line of my suggestion here.
There was a problem hiding this comment.
Another option is to not call getClusterInfo in ClusterInfoSimulator#simulateShardStarted. The downstream code only needs info from shardSizes map and we can pass down a lookup function which internally directly use the shardSizes field.
The getClusterInfo will still be called by DesiredBalanceComputer. But its contribution to performance is very minor based on the baseline flamegraph as shown below. The purple color indicates where the map copying happened and the one pointed by the arrow is from DesiredBalanceComputer and the rest is from ClusterInfoSimulator#simulateShardStarted.
Overall, I still slightly prefer the prior proposal.
...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java
Outdated
Show resolved
Hide resolved
nicktindall
left a comment
There was a problem hiding this comment.
Nice work with the optimisation, I'm just a little uneasy about the change to CopyOnFirstWriteMap. Be good to get consensus on that first.
Yeah, understandable. My proposal is to move it into |
|
The last benchmark results show the "unmodifable map wrapping" change in this commit 0b6cde1 has been very effective. The contender's P50 throughput is The remaining question is whether we want to go with the unmodifiable wrapping change as described here or alternatively go with the other proposal here. PS: The many shards benchmark configures |
This reverts commit 0b6cde1.
|
Buildkite benchmark this with many-shards-quantitative please |
💚 Build Succeeded
This build ran two many-shards-quantitative benchmarks to evaluate performance impact of this PR. History |
|
After discussion with @nicktindall, we agreed to take the alternative approach of avoid creating |
|
As expected, the latest change also has an comparable performance: P50 throughput is 1304 vs baseline 1045. The most impactful fix is splitted and merged separately as #135751. The fix in this PR is now just b78f608 which is less useful but still nice to have, i.e. avoid re-computing shard assignments and map copying when reconstructing PS: In long term, I'd personally like to see us getting rid of the dances between |
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |


This PR reapplies both #133630 and #135052 with a performance bug fix. The original PR #133630 had a severe impact on throughput for index creation. It was reverted with #135369. The flamegraph suggests the system spent a lot time to compute shard assignments on ClusterInfo instantiation time. But that is unnecessary since they do not change within a single polling interval. This PR fixes it by reuse the last value and avoid recomputation.
Copying the original commit message here
If a shard starts on the target node before the next ClusterInfo polling, today we don't include it for the simulation. With this PR, we track shards that can potentially start within one ClusterInfo polling cycle so that they are always included in simulation. The tracking is reset when a new ClusterInfo arrives.
Resolves: ES-12723