Do not recommend increasing `max_shards_per_node` by DaveCTurner · Pull Request #120458 · elastic/elasticsearch

DaveCTurner · 2025-01-20T09:10:11Z

Today if the shards_capacity health indicator detects a problem then
it recommends increasing the limit, which goes against the advice in the
manual about not increasing these limits and also makes it rather
pointless having a limit in the first place.

This commit improves the recommendation to suggest either adding nodes
or else reducing the shard count.

Today if the `shards_capacity` health indicator detects a problem then it recommends increasing the limit, which goes against the advice in the manual about not increasing these limits and also makes it rather pointless having a limit in the first place. This commit improves the recommendation to suggest either adding nodes or else reducing the shard count.

elasticsearchmachine · 2025-01-20T09:10:35Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-01-20T09:10:37Z

Hi @DaveCTurner, I've created a changelog YAML for you.

DaveCTurner · 2025-01-20T09:17:58Z

server/src/main/java/org/elasticsearch/health/node/ShardsCapacityHealthIndicatorService.java

    static final Diagnosis SHARDS_MAX_CAPACITY_REACHED_DATA_NODES = SHARD_MAX_CAPACITY_REACHED_FN.apply(
-        "increase_max_shards_per_node",
+        "decrease_shards_per_non_frozen_node",
        ShardLimitValidator.SETTING_CLUSTER_MAX_SHARDS_PER_NODE,
-        "data"
+        "non-frozen"
    );
    static final Diagnosis SHARDS_MAX_CAPACITY_REACHED_FROZEN_NODES = SHARD_MAX_CAPACITY_REACHED_FN.apply(
-        "increase_max_shards_per_node_frozen",
+        "decrease_shards_per_frozen_node",
        ShardLimitValidator.SETTING_CLUSTER_MAX_SHARDS_PER_NODE_FROZEN,


The bad "increase the limit" advice was baked into the actual diagnosis IDs - fixed here, and see also https://github.com/elastic/telemetry/pull/4362 for the corresponding change to the telemetry cluster

gmarouli · 2025-01-20T10:49:41Z

Hey @DaveCTurner , you are bringing up a very good point here. I do have a concern though.

If I am not mistaken the current limit is quite low, so it is probable that it would make sense to first increase the limit before expanding the cluster or reducing the shards. So, I am thinking of 2 options to make this more useful to users:

we could combine this PR with increasing the default to a more realistic value, considering the many shards improvements,
or, if this is too risky or difficult to implement to add one more sentence in the diagnosis, to give the user one last option to increase the limit if they are sure that their cluster can handle the load.

Does this make sense?

DaveCTurner · 2025-01-20T13:13:55Z

The default of 1000 shards per node is still rather relaxed IMO, at least for high-segment-count or high-field-count indices, and we do want users to stick to it for now. We do get support cases involving egregiously high shard-per-node counts sometimes, and we need to be able to point at the guidance in the manual when telling users to scale up their clusters. It rather weakens that argument when the health API told them specifically to keep on relaxing the limit each time they got close.

A better limit would be nice ofc, maybe one based on #111123, but that won't be a quick process and I don't think we can in good conscience block this change on that work.

gmarouli

LGTM! Thanks for raising this and addressing this @DaveCTurner

DaveCTurner · 2025-01-20T14:33:07Z

Thanks @gmarouli

elasticsearchmachine · 2025-01-20T14:34:07Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120458

Today if the `shards_capacity` health indicator detects a problem then it recommends increasing the limit, which goes against the advice in the manual about not increasing these limits and also makes it rather pointless having a limit in the first place. This commit improves the recommendation to suggest either adding nodes or else reducing the shard count.

DaveCTurner · 2025-01-20T15:20:16Z

Backported to 8.x in de5be24

DaveCTurner added >bug auto-backport Automatically create backport pull requests when merged :Distributed/Health Issues for the health report API v9.0.0 v8.18.0 labels Jan 20, 2025

DaveCTurner requested a review from gmarouli January 20, 2025 09:10

DaveCTurner requested a review from a team as a code owner January 20, 2025 09:10

elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Jan 20, 2025

Update docs/changelog/120458.yaml

0e1a068

DaveCTurner commented Jan 20, 2025

View reviewed changes

Merge branch 'main' into 2025/01/20/shards-capacity-advice

3ec5c94

gmarouli approved these changes Jan 20, 2025

View reviewed changes

DaveCTurner merged commit 9c0709f into elastic:main Jan 20, 2025
16 checks passed

DaveCTurner deleted the 2025/01/20/shards-capacity-advice branch January 20, 2025 14:33

elasticsearchmachine added the backport pending label Jan 20, 2025

DaveCTurner removed the backport pending label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not recommend increasing `max_shards_per_node`#120458

Do not recommend increasing `max_shards_per_node`#120458
DaveCTurner merged 3 commits intoelastic:mainfrom
DaveCTurner:2025/01/20/shards-capacity-advice

DaveCTurner commented Jan 20, 2025

elasticsearchmachine commented Jan 20, 2025

elasticsearchmachine commented Jan 20, 2025

DaveCTurner Jan 20, 2025

gmarouli commented Jan 20, 2025

DaveCTurner commented Jan 20, 2025

gmarouli left a comment

Uh oh!

DaveCTurner commented Jan 20, 2025

elasticsearchmachine commented Jan 20, 2025

DaveCTurner commented Jan 20, 2025

Labels

3 participants

Conversation

DaveCTurner commented Jan 20, 2025

elasticsearchmachine commented Jan 20, 2025

elasticsearchmachine commented Jan 20, 2025

DaveCTurner Jan 20, 2025

Choose a reason for hiding this comment

gmarouli commented Jan 20, 2025

DaveCTurner commented Jan 20, 2025

gmarouli left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Jan 20, 2025

elasticsearchmachine commented Jan 20, 2025

💔 Backport failed

DaveCTurner commented Jan 20, 2025

Labels

3 participants