Retry ES|QL node requests on shard level failures (#120774) by dnhatn · Pull Request #121879 · elastic/elasticsearch

dnhatn · 2025-02-06T07:09:31Z

Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received.

On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries.

There are two decisions around how quickly we should retry:

Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response?
What is the maximum number of inflight requests we should allow on the sending side?

This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node

Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node

There are two issues in the current implementation: 1. We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed. 2. We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded. Closes elastic#121966

dnhatn added the backport label Feb 6, 2025

elasticsearchmachine added the v8.19.0 label Feb 6, 2025

dnhatn force-pushed the 8.x-retry-shard-failures branch 3 times, most recently from 8365e58 to ad780e2 Compare February 11, 2025 05:37

dnhatn added 3 commits February 14, 2025 22:14

Remove fatal shards when selecting requests

7ee4a4a

dnhatn force-pushed the 8.x-retry-shard-failures branch from 675716b to 7ee4a4a Compare February 15, 2025 06:14

dnhatn added the :Analytics/ES|QL AKA ESQL label Feb 15, 2025

dnhatn marked this pull request as ready for review February 15, 2025 06:58

Merge remote-tracking branch 'elastic/8.x' into 8.x-retry-shard-failures

f02d4f3

dnhatn merged commit 7ee9810 into elastic:8.x Feb 15, 2025
15 checks passed

dnhatn deleted the 8.x-retry-shard-failures branch February 15, 2025 18:04

dnhatn mentioned this pull request Jul 11, 2025

Retry in ES|QL #103081

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry ES|QL node requests on shard level failures (#120774)#121879

Retry ES|QL node requests on shard level failures (#120774)#121879
dnhatn merged 4 commits intoelastic:8.xfrom
dnhatn:8.x-retry-shard-failures

dnhatn commented Feb 6, 2025

Uh oh!

Labels

2 participants

Conversation

dnhatn commented Feb 6, 2025

Uh oh!

Labels

2 participants