Retry internally when CAS upload is throttled [GCS] by nicktindall · Pull Request #120250 · elastic/elasticsearch

nicktindall · 2025-01-16T02:21:05Z

I've only changed the case where we are throttled trying to upload the new register contents, because currently that was the only place we returned MISSING when we were throttled. Do we think it'd make more sense to start the whole CAS again in the event that ANY of the requests are throttled?

It looks like by default, GCS is configured with

initial retry delay = 1s
retry delay multiplier = 2
max retry delay = 32s
max attempts = 6

So by adding another layer of retries, this CAS could end up taking some time. By default I allowed two retries, which will take the maximum time out to 96s.

Fixes elastic#116546

elasticsearchmachine · 2025-01-16T06:58:52Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall · 2025-01-16T09:06:38Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java

+                    if (retries.hasNext()) {
+                        try {
+                            // noinspection BusyWait
+                            Thread.sleep(retries.next().millis());


If we're good with retrying the whole thing from the start in the event of a throttle, we could do this one level up (where it's async) so we don't have to sleep.

Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.

DaveCTurner

LGTM

Just checking my understanding tho, the Azure implementation already does what we want right?

DaveCTurner · 2025-01-16T12:08:39Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java

+                    if (retries.hasNext()) {
+                        try {
+                            // noinspection BusyWait
+                            Thread.sleep(retries.next().millis());


Sleeping seems ok here to me, if we're being throttled on a CAS then we probably shouldn't be freeing up the thread to do some other blob-store operation.

nicktindall · 2025-01-17T03:08:04Z

LGTM

Just checking my understanding tho, the Azure implementation already does what we want right?

This problem was unique due to the way GCP relied on the outer scope to do the retry (on throttling, it simulated a failure to CAS which would trigger a re-attempt). Azure doesn't do that, instead if it gets throttled it'll propagate that error out. We don't have any retries beyond those built into the client, but as far as I know we haven't seen the analysis test fail.

ywangd · 2025-01-17T03:58:10Z

...itory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRepository.java

+    static final Setting<TimeValue> RETRY_THROTTLED_CAS_DELAY_INCREMENT = Setting.timeSetting(
+        "throttled_cas_retry.delay_increment",
+        TimeValue.timeValueMillis(100),
+        TimeValue.ZERO
+    );
+    static final Setting<Integer> RETRY_THROTTLED_CAS_MAX_NUMBER_OF_RETRIES = Setting.intSetting(
+        "throttled_cas_retry.maximum_number_of_retries",
+        2,
+        0
+    );
+    static final Setting<TimeValue> RETRY_THROTTLED_CAS_MAXIMUM_DELAY = Setting.timeSetting(
+        "throttled_cas_retry.maximum_delay",
+        TimeValue.timeValueSeconds(5),
+        TimeValue.ZERO
+    );


Are these settings registered anywhere?

No, also I should document them 👍

Scratch that. I forgot repository metadata are not true settings.

ywangd · 2025-01-17T03:59:37Z

Should this be labelled as >enhancement instead of >test?

nicktindall · 2025-01-17T04:12:45Z

Should this be labelled as >enhancement instead of >test?

I guess it should given that it changes actual behaviour. Will update.

elasticsearchmachine · 2025-01-17T04:13:44Z

Hi @nicktindall, I've created a changelog YAML for you.

ywangd · 2025-01-17T04:17:26Z

Azure doesn't do that, instead if it gets throttled it'll propagate that error out

IIUC, with this PR, GCP implementation should also do the same after the retry exhausted, right? I think one main difference from the previous behaviour is that we will get a clear exception which helps troubleshooting instead of a return value of OptionalBytesReference[MISSING].

Retry internally when CAS upload is throttled

879297c

Fixes elastic#116546

nicktindall changed the title ~~WIP: Retry internally when CAS upload is throttled~~ Jan 16, 2025

elasticsearchmachine added the v9.0.0 label Jan 16, 2025

nicktindall added 4 commits January 16, 2025 13:37

Fix copy-paste error

f555127

Enable CAS retries in test

611479c

Reduce number of retries

a21e0a7

Always throw non-ServiceException

74b0fb9

nicktindall added the >test Issues or PRs that are addressing/adding tests label Jan 16, 2025

nicktindall added 3 commits January 16, 2025 15:51

Move test to RetriesTests

1eecdec

Fix maxRetries

1547670

Randomise maxRetries

7fa6a20

nicktindall added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jan 16, 2025

nicktindall marked this pull request as ready for review January 16, 2025 06:58

elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jan 16, 2025

nicktindall requested review from DaveCTurner and ywangd January 16, 2025 06:59

nicktindall commented Jan 16, 2025

View reviewed changes

DaveCTurner approved these changes Jan 16, 2025

View reviewed changes

nicktindall removed the request for review from ywangd January 17, 2025 03:16

ywangd reviewed Jan 17, 2025

View reviewed changes

nicktindall changed the title ~~Retry internally when CAS upload is throttled~~ Jan 17, 2025

nicktindall added >enhancement and removed >test Issues or PRs that are addressing/adding tests labels Jan 17, 2025

Update docs/changelog/120250.yaml

a2056f8

nicktindall merged commit c02292f into elastic:main Jan 20, 2025
15 of 16 checks passed

nicktindall deleted the retry_throttled_gcs_cas branch September 10, 2025 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry internally when CAS upload is throttled [GCS]#120250

Retry internally when CAS upload is throttled [GCS]#120250
nicktindall merged 9 commits intoelastic:mainfrom
nicktindall:retry_throttled_gcs_cas

nicktindall commented Jan 16, 2025 •

edited

Loading

elasticsearchmachine commented Jan 16, 2025

nicktindall Jan 16, 2025

DaveCTurner Jan 16, 2025

DaveCTurner left a comment

DaveCTurner Jan 16, 2025

nicktindall commented Jan 17, 2025

ywangd Jan 17, 2025

nicktindall Jan 17, 2025

ywangd Jan 17, 2025

ywangd commented Jan 17, 2025

nicktindall commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

ywangd commented Jan 17, 2025

Uh oh!

Labels

4 participants

Conversation

nicktindall commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Jan 16, 2025

nicktindall Jan 16, 2025

Choose a reason for hiding this comment

DaveCTurner Jan 16, 2025

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Jan 16, 2025

Choose a reason for hiding this comment

nicktindall commented Jan 17, 2025

ywangd Jan 17, 2025

Choose a reason for hiding this comment

nicktindall Jan 17, 2025

Choose a reason for hiding this comment

ywangd Jan 17, 2025

Choose a reason for hiding this comment

ywangd commented Jan 17, 2025

nicktindall commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

ywangd commented Jan 17, 2025

Uh oh!

Labels

4 participants

nicktindall commented Jan 16, 2025 •

edited

Loading