Retry internally when CAS upload is throttled [GCS]#120250
Retry internally when CAS upload is throttled [GCS]#120250nicktindall merged 9 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
| if (retries.hasNext()) { | ||
| try { | ||
| // noinspection BusyWait | ||
| Thread.sleep(retries.next().millis()); |
There was a problem hiding this comment.
If we're good with retrying the whole thing from the start in the event of a throttle, we could do this one level up (where it's async) so we don't have to sleep.
There was a problem hiding this comment.
Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.
DaveCTurner
left a comment
There was a problem hiding this comment.
LGTM
Just checking my understanding tho, the Azure implementation already does what we want right?
| if (retries.hasNext()) { | ||
| try { | ||
| // noinspection BusyWait | ||
| Thread.sleep(retries.next().millis()); |
There was a problem hiding this comment.
Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.
This problem was unique due to the way GCP relied on the outer scope to do the retry (on throttling, it simulated a failure to CAS which would trigger a re-attempt). Azure doesn't do that, instead if it gets throttled it'll propagate that error out. We don't have any retries beyond those built into the client, but as far as I know we haven't seen the analysis test fail. |
| static final Setting<TimeValue> RETRY_THROTTLED_CAS_DELAY_INCREMENT = Setting.timeSetting( | ||
| "throttled_cas_retry.delay_increment", | ||
| TimeValue.timeValueMillis(100), | ||
| TimeValue.ZERO | ||
| ); | ||
| static final Setting<Integer> RETRY_THROTTLED_CAS_MAX_NUMBER_OF_RETRIES = Setting.intSetting( | ||
| "throttled_cas_retry.maximum_number_of_retries", | ||
| 2, | ||
| 0 | ||
| ); | ||
| static final Setting<TimeValue> RETRY_THROTTLED_CAS_MAXIMUM_DELAY = Setting.timeSetting( | ||
| "throttled_cas_retry.maximum_delay", | ||
| TimeValue.timeValueSeconds(5), | ||
| TimeValue.ZERO | ||
| ); |
There was a problem hiding this comment.
Are these settings registered anywhere?
There was a problem hiding this comment.
No, also I should document them 👍
There was a problem hiding this comment.
Scratch that. I forgot repository metadata are not true settings.
|
Should this be labelled as |
I guess it should given that it changes actual behaviour. Will update. |
|
Hi @nicktindall, I've created a changelog YAML for you. |
IIUC, with this PR, GCP implementation should also do the same after the retry exhausted, right? I think one main difference from the previous behaviour is that we will get a clear exception which helps troubleshooting instead of a return value of |
Fixes #116546
I've only changed the case where we are throttled trying to upload the new register contents, because currently that was the only place we returned
MISSINGwhen we were throttled. Do we think it'd make more sense to start the whole CAS again in the event that ANY of the requests are throttled?It looks like by default, GCS is configured with
initial retry delay = 1s
retry delay multiplier = 2
max retry delay = 32s
max attempts = 6
So by adding another layer of retries, this CAS could end up taking some time. By default I allowed two retries, which will take the maximum time out to 96s.