Retry ES|QL node requests on shard level failures (#120774)#121879
Merged
dnhatn merged 4 commits intoelastic:8.xfrom Feb 15, 2025
Merged
Retry ES|QL node requests on shard level failures (#120774)#121879dnhatn merged 4 commits intoelastic:8.xfrom
dnhatn merged 4 commits intoelastic:8.xfrom
Conversation
8365e58 to
ad780e2
Compare
Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node
There are two issues in the current implementation: 1. We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed. 2. We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded. Closes elastic#121966
675716b to
7ee4a4a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail.
There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received.
On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries.
There are two decisions around how quickly we should retry:
Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response?
What is the maximum number of inflight requests we should allow on the sending side?
This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node