Defer reroute when starting shards#44433
Conversation
Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105.
|
Pinging @elastic/es-distributed |
|
Note to reviewers: don't let the |
|
@DaveCTurner seems there's a problem with the setting change on CI and it's probably just missing a warning exclusion in the rest tests: |
We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead.
|
Bit more complicated than that - I thought I could get away with something simple but got bitten by the deprecation logger, so had to add a specific test in 5fc2e7e. |
|
Jenkins run elasticsearch-ci/2 (just the xpack license test failing) |
There was a problem hiding this comment.
Code looks good, just one question.
Also, to me the change to NORMAL priority/not-inline-execution makes sense here but maybe Yannick should confirm as well :)
I wonder though if we really should hide this change as much as it's hidden now. Couldn't we add a note on this change stating that this might be a problem if your cluster is already badly configured (master overloaded + lots of shard fluctuation?) but is an optimization for a healthy setup? That way users affected negatively by this might be more likely to either understand/fix the problem or report back in a more meaningful way?
server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java
Outdated
Show resolved
Hide resolved
|
Sure. There's no way to express changes to the migration docs in a PR against diff --git a/docs/reference/migration/migrate_7_4.asciidoc b/docs/reference/migration/migrate_7_4.asciidoc
index ebfca7d25c1..63315f9fb65 100644
--- a/docs/reference/migration/migrate_7_4.asciidoc
+++ b/docs/reference/migration/migrate_7_4.asciidoc
@@ -67,4 +67,20 @@ unsupported on buckets created after September 30th 2020.
Starting in version 7.4, a `+` in a URL will be encoded as `%2B` by all REST API functionality. Prior versions handled a `+` as a single space.
If your application requires handling `+` as a single space you can return to the old behaviour by setting the system property
`es.rest.url_plus_as_space` to `true`. Note that this behaviour is deprecated and setting this system property to `true` will cease
-to be supported in version 8.
\ No newline at end of file
+to be supported in version 8.
+
+[float]
+[[breaking_74_cluster_changes]]
+=== Cluster changes
+
+[float]
+==== Rerouting after starting a shard runs at lower priority
+
+After starting each shard the elected master node must perform a reroute to
+search for other shards that could be allocated. In particular, when creating
+an index it is this task that allocates the replicas once the primaries have
+started. In versions prior to 7.4 this task runs at priority `URGENT`, but
+starting in version 7.4 its priority is reduced to `NORMAL`. In a
+well-configured cluster this reduces the amount of work the master must do, but
+means that a cluster with a master that is overloaded with other tasks at
+`HIGH` or `URGENT` priority may take longer to allocate all replicas. |
original-brownbear
left a comment
There was a problem hiding this comment.
LGTM :) thanks @DaveCTurner
ywelsch
left a comment
There was a problem hiding this comment.
I've left a few small comments, looking good o.w.
server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java
Outdated
Show resolved
Hide resolved
* Defer reroute when starting shards Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105. * Specific test case for followup priority setting We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead. * Checkstyle * Cluster state always changed here * Assert consistency of routing nodes * Restrict setting only to reasonable priorities
The change in elastic#44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.
The change in elastic#44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.
The change in #44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.
Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. Backport of #44433 and #44543.
In elastic#44433 we introduced a temporary (immediately deprecated) escape-hatch setting to control the priority of the reroute scheduled after starting a batch of shards. This commit removes this setting in `master`, fixing the followup reroute's priority at `NORMAL`.
In #44433 we introduced a temporary (immediately deprecated) escape-hatch setting to control the priority of the reroute scheduled after starting a batch of shards. This commit removes this setting in `master`, fixing the followup reroute's priority at `NORMAL`.
Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in `ESIntegTestCase#waitForRelocation()` to deal with the case that this health request returns successfully despite the fact that there is a pending reroute task which will relocate another shard. Relates elastic#44433 Fixes elastic#45003
This adds the `IndexBalanceTests` ES tests based on elastic/elasticsearch@5dda2b0 because the more recent version requires a not yet backport patch (elastic/elasticsearch#44433).
This adds the `IndexBalanceTests` ES tests based on elastic/elasticsearch@5dda2b0 because the more recent version requires a not yet backport patch (elastic/elasticsearch#44433).
Today we reroute the cluster as part of the process of starting a shard, which
runs at
URGENTpriority. In large clusters, rerouting may take some time tocomplete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.
However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at
NORMALpriority, avoiding starvation of higher-priority tasks.This may improve some of the situations related to #42738 and #42105.