Retry throttled snapshot deletions by nicktindall · Pull Request #113237 · elastic/elasticsearch

nicktindall · 2024-09-20T07:15:11Z

The change is

For bulk deletes with OperationPurpose SNAPSHOT_DATA or SNAPSHOT_METADATA we will retry when throttled with a progressive back-off up to (by default) 8 times over about 5 seconds.
- This is in addition to the (default 3) retries performed by the AWS client for each request
- The initial delay and number of retries is configurable via repository settings
If the thread is interrupted while delayed between retries, we preserve the interrupt and fail the current batch with the throttling exception. If there are subsequent batches waiting, the first of those will trigger an AWS AbortedException which will propagate wrapped in an IOException
I've added an APM histogram of how many retries were necessary for each batch, I'll create a ticket to add this to the S3 dashboard if we go ahead with it
- The delete publishes the number of retries if it either completes successfully or runs out of retries. Other types of failure don't publish retry metrics

nicktindall · 2024-09-20T07:16:46Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java

+                aex.set(ExceptionsHelper.useOrSuppress(aex.get(), e));
+                return;
+            } catch (AmazonClientException e) {
+                if (shouldRetryDelete(purpose) && RetryUtils.isThrottlingException(e)) {


I used AWS client's isThrottlingException. It looks like they come in many forms, and I assume this logic will be kept up to date.

nicktindall · 2024-09-20T07:18:48Z

...pository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobContainerRetriesTests.java

+        int expectedNumberOfBatches = (blobsToDelete.size() / 1_000) + (blobsToDelete.size() % 1_000 == 0 ? 0 : 1);
+        assertThat(numberOfDeleteAttempts.get(), equalTo(throttleTimesBeforeSuccess + expectedNumberOfBatches));
+        assertThat(numberOfSuccessfulDeletes.get(), equalTo(expectedNumberOfBatches));
+    }


This test class seemed very geared towards retrying input stream, but it sounded like the most appropriate home for this.

elasticsearchmachine · 2024-09-20T07:30:59Z

Hi @nicktindall, I've created a changelog YAML for you.

DaveCTurner

I assume we don't want to change behaviour in the event of an interrupt, I've written it initially to abort the retries for any batch in progress and any subsequent batches (via the preserved interrupt flag), but to continue attempting to delete the remaining batches.

I think preserving the interrupt flag should abort subsequent operations on this thread, since these should be interruptible IO operations. That seems like the right behaviour to me.

Do we want to restrict this behaviour to SNAPSHOT_DATA and SNAPSHOT_METADATA (or even just SNAPSHOT_DATA) or do we think it's desirable across the board? If we're applying it more broadly perhaps we should limit how long we retry for?

I suspect we should limit retries everywhere, so that if the repository happens to be completely wedged then this snapshot thread will eventually go and do something else on a different repository. I think we should also drop and re-acquire the clientReference on each attempt, so that if the repository is closed then we'll fail sooner too.

Do we want the interval and back-off to be configurable via settings?

Yes.

I've put a histogram of how many attempts it takes to delete the batches, do we want something similar for how long?

I don't have a strong opinion on this.

nicktindall · 2024-09-23T07:45:16Z

...pository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobContainerRetriesTests.java

+        assertThrows(
+            Exception.class, /* ? */
+            () -> blobContainer.deleteBlobsIgnoringIfNotExists(randomFrom(operationPurposesThatRetryOnDelete()), blobsToDelete.iterator())
+        );


The idea here was to close the S3Service (mimicking the repository being closed) during the retries and observe that it aborted, but it wasn't obvious why that would work. It doesn't work in the test, perhaps something has been stubbed and it changes the behaviour?

The S3Service#close method releases the cached clients, but I still seem to be able to create a new one after that's occurred, and the comments appear to indicate that's by design.

Perhaps we could set a flag when the S3BlobStore is closed and check that instead?

Hm ok I see, bit weird that we are so lenient here and let you acquire a client with an arbitrary name. I'm not sure we really need a test for this, at least not specifically for this context.

nicktindall · 2024-09-23T07:48:47Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java

+        AtomicReference<AmazonS3Reference> clientReferenceHolder,
        List<String> partition,
        AtomicReference<Exception> aex
    ) {


I don't love this change in the API, but it seemed the lightest-touch way to allow the client reference to be re-acquired when a batch is retried while still allowing re-use between batches in the happy case. Open to moving to reference-per-batch if we think the extra allocations are fine.
The clients themselves are cached so perhaps reference-per-batch is OK. We'd be building the settings object for each acquisition though.

Yeah reference-per-batch should be fine.

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java

…unsuccessful_snapshot_deletions

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java

nicktindall · 2024-09-24T05:33:01Z

...nalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryMetricsTests.java

+            repositoryName,
+            Settings.builder()
+                .put(repositorySettings(repositoryName))
+                .put(S3ClientSettings.MAX_RETRIES_SETTING.getConcreteSettingForNamespace("placeholder").getKey(), 0)


It seems as though putting client settings in the repository settings like this is deprecated, and/or I've done it wrong. Any advice on how to override AWS client internal retries on a per-test basis would be appreciated.

I tried making a different client with that setting in the node settings (set at the class level), but there are a lot of additional settings configured for the test client that I'd need to duplicate to make that work.

It feels like the correct way to do this would be to change the client settings in S3BlobStoreRepositoryTests to be configured for default client, allowing them to be selectively overridden in individual tests. But that's a larger change. Perhaps there's an easier/better way?

I think it's ok in tests, we don't have a plan for removing this apparently-deprecated functionality any time soon.

…unsuccessful_snapshot_deletions

elasticsearchmachine · 2024-09-24T06:09:50Z

Pinging @elastic/es-distributed (Team:Distributed)

…unsuccessful_snapshot_deletions

…eletions

DaveCTurner

Looks fine to me, I left a few tiny comments. We'll learn from experience whether this is enough to keep S3 happy or whether further adjustments are needed.

DaveCTurner · 2024-10-09T12:51:06Z

...pository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobContainerRetriesTests.java

+            assertThat(handler.numberOfSuccessfulDeletes.get(), equalTo(0));
+        } finally {
+            // Clear the interrupt (this seemed to leak between tests)
+            Thread.interrupted();


nit: can we assertTrue(Thread.interrupted()); instead?

Fixed in 53dad4b

DaveCTurner · 2024-10-09T12:52:07Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java

+                logger.warn("Aborting delete retries due to interrupt");
+            }
+        } else {
+            logger.warn("Exceeded maximum delete retries, aborting");


Could we have some more detail in this message about what exactly we're aborting, and the retry strategy we were using?

Fixed in 52a315b

Sorry I meant more details like the number of times we retried, and refs to the settings that can be used to influence this and/or a link to the relevant docs. Otherwise IME we eventually get support cases asking for that information.

Which reminds me, we don't document these new settings either, but we should.

I had a go at this in 3ed3bc4

DaveCTurner · 2024-10-09T12:53:16Z

...nalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryMetricsTests.java

        }
    }
+
+    static class S3ErrorResponse {


nit: looks like a record to me?

Fixed in c579c56

DaveCTurner · 2024-10-09T12:53:28Z

...nalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryMetricsTests.java

+        private final Queue<S3ErrorResponse> errorStatusQueue;

-        S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<RestStatus> errorStatusQueue) {
+        S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorStatusQueue) {


Naming nit:

Suggested change

S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorStatusQueue) {

S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorResponseQueue) {

Fixed in b8467af

…unsuccessful_snapshot_deletions

…eout exception message

…unsuccessful_snapshot_deletions

DaveCTurner

Looks good, just a handful of nits.

docs/reference/snapshot-restore/repository-s3.asciidoc

DaveCTurner · 2024-10-11T09:44:31Z

docs/reference/snapshot-restore/repository-s3.asciidoc

+
+`throttled_delete_retry.maximum_delay`::
+
+    (integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.


Likewise

Suggested change

(integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.

(<<time-units,time value>>) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.

Fixed in 1e207e0

DaveCTurner · 2024-10-11T09:46:56Z

server/src/test/java/org/elasticsearch/common/BackoffPolicyTests.java

        }
    }

+    public void testLinearBackoffWithLimit() {


Maybe also a test without the limit?

Added in ab10b44

DaveCTurner · 2024-10-11T09:48:13Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java

+    );
+    static final Setting<TimeValue> RETRY_THROTTLED_DELETE_MAXIMUM_DELAY = Setting.timeSetting(
+        "throttled_delete_retry.maximum_delay",
+        new TimeValue(500, TimeUnit.MILLISECONDS),


500ms feels a little short for my taste, I'd expect something more like 5s here.

Fixed in 8868f2b

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java

docs/reference/snapshot-restore/repository-s3.asciidoc

Co-authored-by: David Turner <david.turner@elastic.co>

…ries/s3/S3Repository.java Co-authored-by: David Turner <david.turner@elastic.co>

Co-authored-by: David Turner <david.turner@elastic.co>

…unsuccessful_snapshot_deletions

DaveCTurner

LGTM

Closes ES-8562

Retry unsuccessful snapshot deletions

b336579

elasticsearchmachine added the v9.0.0 label Sep 20, 2024

nicktindall commented Sep 20, 2024

View reviewed changes

Disable AWS client retries

8ab9763

nicktindall added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug labels Sep 20, 2024

Update docs/changelog/113237.yaml

01fd52c

nicktindall requested a review from DaveCTurner September 20, 2024 07:33

DaveCTurner reviewed Sep 22, 2024

View reviewed changes

Make delay and retries configurable, re-acquire client before each retry

77273f8

nicktindall commented Sep 23, 2024

View reviewed changes

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java Show resolved Hide resolved

nicktindall added 2 commits September 23, 2024 17:53

Merge remote-tracking branch 'origin/main' into bugfix/ES-8562_retry_…

24bfb25

…unsuccessful_snapshot_deletions

Use client-reference-per-batch, remove test for aborting on closed repo

7c3d9f9

nicktindall commented Sep 23, 2024

View reviewed changes

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java Show resolved Hide resolved

nicktindall added 3 commits September 24, 2024 14:33

Add test for retry delete count histogram

17118de

Tidy

2773f44

Record metrics when retries are exhausted

5b940b3

nicktindall commented Sep 24, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into bugfix/ES-8562_retry_…

87d37bd

…unsuccessful_snapshot_deletions

nicktindall marked this pull request as ready for review September 24, 2024 06:09

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Sep 24, 2024

nicktindall changed the title ~~Retry unsuccessful snapshot deletions~~ Sep 24, 2024

Update changelog

b70a1c9

nicktindall requested a review from DaveCTurner September 24, 2024 07:56

Merge remote-tracking branch 'origin/main' into bugfix/ES-8562_retry_…

4103273

…unsuccessful_snapshot_deletions

nicktindall added 2 commits September 25, 2024 17:25

Merge branch 'main' into bugfix/ES-8562_retry_unsuccessful_snapshot_d…

704c499

…eletions

Use new BackoffPolicy package

f8e610e

DaveCTurner reviewed Oct 9, 2024

View reviewed changes

nicktindall added 4 commits October 10, 2024 09:35

Assert interrupt is preserved

53dad4b

Be specific about the retries that are being aborted

52a315b

Make S3ErrorResponse a record

c579c56

Fix naming

b8467af

nicktindall requested a review from DaveCTurner October 9, 2024 22:49

nicktindall added 4 commits October 10, 2024 10:16

Merge remote-tracking branch 'origin/main' into bugfix/ES-8562_retry_…

04dd6b2

…unsuccessful_snapshot_deletions

Use linear back-off, document back-off parameters, add details to tim…

3ed3bc4

…eout exception message

Merge remote-tracking branch 'origin/main' into bugfix/ES-8562_retry_…

882d6c0

…unsuccessful_snapshot_deletions

Grammar fix(?)

df5585e

DaveCTurner reviewed Oct 11, 2024

View reviewed changes

nicktindall and others added 8 commits October 12, 2024 08:36

Update docs/reference/snapshot-restore/repository-s3.asciidoc

75c4674

Co-authored-by: David Turner <david.turner@elastic.co>

Update modules/repository-s3/src/main/java/org/elasticsearch/reposito…

f186512

…ries/s3/S3Repository.java Co-authored-by: David Turner <david.turner@elastic.co>

Update modules/repository-s3/src/main/java/org/elasticsearch/reposito…

e6b4940

…ries/s3/S3Repository.java Co-authored-by: David Turner <david.turner@elastic.co>

Change default for maximum delay

8868f2b

Update docs/reference/snapshot-restore/repository-s3.asciidoc

c1c74e2

Co-authored-by: David Turner <david.turner@elastic.co>

Add test for unlimited linear backoff

ab10b44

Change unit for maximum delay

1e207e0

Merge remote-tracking branch 'origin/main' into bugfix/ES-8562_retry_…

7c5a2d8

…unsuccessful_snapshot_deletions

nicktindall requested a review from DaveCTurner October 14, 2024 22:20

DaveCTurner approved these changes Oct 15, 2024

View reviewed changes

nicktindall merged commit 16864e9 into elastic:main Oct 15, 2024

nicktindall deleted the bugfix/ES-8562_retry_unsuccessful_snapshot_deletions branch October 15, 2024 22:08

georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024

Retry throttled snapshot deletions (elastic#113237)

22c5f2e

Closes ES-8562

	S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorStatusQueue) {
	S3MetricErroneousHttpHandler(HttpHandler delegate, Queue<S3ErrorResponse> errorResponseQueue) {


		`throttled_delete_retry.maximum_delay`::

		(integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.

	(integer) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.
	(<<time-units,time value>>) This is the upper bound on how long the delays between retries will grow to. Default is 500ms, minimum is 0ms.

Conversation

nicktindall commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Sep 20, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

nicktindall Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicktindall Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

nicktindall Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Sep 24, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Labels

3 participants

nicktindall commented Sep 20, 2024 •

edited

Loading

nicktindall Sep 23, 2024 •

edited

Loading

nicktindall Sep 23, 2024 •

edited

Loading

nicktindall Sep 24, 2024 •

edited

Loading

nicktindall Sep 24, 2024 •

edited

Loading