Retry batch-delete items in GCS by mhl-b · Pull Request #138951 · elastic/elasticsearch

mhl-b · 2025-12-03T00:05:23Z

Add retry logic for batch-delete items in GCS blob store. Algorithm
maximizes the size of every batch by merging retryable items from the
previous batch and new items. When batch results have failed items, we
first retry only a single item using the SDK client's retry strategy
(exponential-backoff). Retrying a single item should provide enough
time to back-off from throttling or temporary GCS failures. Once a
single item successfully retries, we proceed with the next batch,
combining the remaining failures and new items.

Roughly this:

create batch of 100 new items
submit batch
receive 100 results, with 10 retryable failures
retry 1 failure using non-batched delete using SDK retry strategy
(loop) create batch from 9 remaining failures and 91 new items

Also extend test fixture to randomly fail individual items in batch.

fix #138364

mhl-b · 2025-12-03T00:08:33Z

@DaveCTurner, @joshua-adams-1
Using draft to align on the approach how to retry bulk items.

mhl-b · 2025-12-03T00:16:57Z

@nicktindall, I didn't have chance to review GCS retry refactoring PR. But want to verify that we still keep retry strategy for non stream calls.

nicktindall · 2025-12-03T00:36:04Z

@nicktindall, I didn't have chance to review GCS retry refactoring PR. But want to verify that we still keep retry strategy for non stream calls.

Yes, only get blob should be affected by my changes

joshua-adams-1

After reading #138364 this looks good to me. I'm happy to approve once the CI issues are resolved. Could we also add unit tests for the deleteBlobs function?

joshua-adams-1 · 2025-12-03T10:52:24Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java


+    private static boolean isRetryErrCode(int code) {
+        return switch (code) {
+            case 408 | 429 | 500 | 502 | 503 | 504 -> true;


Can we replace these raw values with HttpURLConnection like here

Is it worth explaining why we can retry on this?

I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.

Done https://github.com/elastic/elasticsearch/pull/138951/changes#diff-56e89befa62de07551e01d35a1c0d3018ac0b49d6476af5c7bd38fd81f56181bR703-R709

joshua-adams-1 · 2025-12-03T11:19:49Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java

+            if (failedItems.isEmpty() == false) {
+                final var retryBlobId = failedItems.getLast().blobId;
+                try {
+                    client().deleteBlob(retryBlobId);


Two questions:

I assume this blocks?

Does storage.delete(blobId); use an exponential back off retry strategy?

Yes and yes

joshua-adams-1 · 2025-12-03T11:23:02Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java

+                    // remaining items go the next bulk
+                    failedItems.removeLast();
+                } catch (StorageException e) {
+                    throw new IOException(


Can we also log the other elements in failedItems? Otherwise they would fail quietly

Added list of failed items and their status codes. https://github.com/elastic/elasticsearch/pull/138951/changes#diff-56e89befa62de07551e01d35a1c0d3018ac0b49d6476af5c7bd38fd81f56181bR780

joshua-adams-1 · 2025-12-03T11:23:33Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java

+                final var retryBlobId = failedItems.getLast().blobId;
+                try {
+                    client().deleteBlob(retryBlobId);
+                    // remaining items go the next bulk


Suggested change

// remaining items go the next bulk

// remaining items go into the next bulk

joshua-adams-1 · 2025-12-03T11:25:40Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java

+                    if (isRetryErrCode(errCode)) {
+                        failedItems.add(new DeleteFailure(deleteResult.blobId, e.getCode()));
+                    } else {
+                        throw new IOException("Failed to process bulk delete, non-retryable error for blobId=" + deleteResult.blobId, e);


General question: If we fail to delete N blobs, are these subsequently cleaned up?

I think so, there is a cleanup of dangling blobs occasionally. But don't know exactly.

DaveCTurner

Looks sensible to me, but needs supporting changes in GoogleCloudStorageHttpHandler to exercise the retries properly.

DaveCTurner · 2025-12-03T12:35:46Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java


+    private static boolean isRetryErrCode(int code) {
+        return switch (code) {
+            case 408 | 429 | 500 | 502 | 503 | 504 -> true;


I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.

elasticsearchmachine · 2025-12-19T04:54:36Z

Hi @mhl-b, I've created a changelog YAML for you.

elasticsearchmachine · 2025-12-19T05:05:07Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-12-19T05:09:36Z

Hi @mhl-b, I've updated the changelog YAML for you.

…h into gcs-bulk-delete-retry

nicktindall

LGTM

nicktindall · 2025-12-22T04:20:32Z

...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java

+                    if (isRetryErrCode(errCode)) {
+                        batchFailures.add(new DeleteFailure(deleteResult.blobId, e.getCode()));
+                    } else {
+                        throw new IOException("Failed to process batch delete, non-retryable error for blobId=" + deleteResult.blobId, e);


Is it worth accumulating all (within some limit, or summarised sensibly) of the failures in the current batch? Or are we happy to assume if they're there, they'll most likely be for the same reason?

Sure, I will collect all errors, they already sit in memory.

nicktindall · 2025-12-22T04:35:49Z

test/fixtures/gcs-fixture/src/main/java/fixture/gcs/GoogleCloudStorageHttpHandler.java

+                """.replace("$code", Integer.toString(itemStatus.getStatus()));
+            responseText
+                // SDK client will try to parse error as JSON despite these headers. Adding them for HTTP spec consistency.
+                .append("content-type: application/json")


SDK client will try to parse error as JSON despite these headers. Adding them for HTTP spec consistency.

Do you mean the SDK will parse the response as JSON even without these headers? despite makes it sound like the content type indicates it's something other than JSON

Do you mean the SDK will parse the response as JSON even without these headers?

Yes, as far as I can trace GCS code there is no check for the content type, it goes straight to JSON parsing. Having these headers does not hurt, if GCS at some point decides to be strict on part headers this code will continue to work.

I removed this comment line to avoid confusion. There is nothing to worry about.

(cherry picked from commit deeb06b)

mhl-b · 2026-01-21T17:12:25Z

💚 All backports created successfully

Status	Branch	Result
✅	9.3
✅	9.2
✅	9.1

Questions ?

Please refer to the Backport tool documentation

(cherry picked from commit deeb06b)

mhl-b requested review from DaveCTurner and joshua-adams-1 December 3, 2025 00:05

elasticsearchmachine added the v9.3.0 label Dec 3, 2025

mhl-b mentioned this pull request Dec 3, 2025

[CI] GCSRepositoryAnalysisRestIT testRepositoryAnalysis failing #138364

Closed

mhl-b requested a review from nicktindall December 3, 2025 00:08

joshua-adams-1 reviewed Dec 3, 2025

View reviewed changes

DaveCTurner reviewed Dec 3, 2025

View reviewed changes

mhl-b mentioned this pull request Dec 6, 2025

Resolve GCS-fixture batch-delete inconsistencies #139166

Merged

elasticsearchmachine added v9.4.0 and removed v9.3.0 labels Dec 17, 2025

add GCS batch delete retries

4216057

mhl-b force-pushed the gcs-bulk-delete-retry branch from 34d646d to 4216057 Compare December 19, 2025 04:43

mhl-b added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. labels Dec 19, 2025

Update docs/changelog/138951.yaml

6460cdf

mhl-b marked this pull request as ready for review December 19, 2025 05:04

mhl-b changed the title ~~Retry bulk-delete items in GCS~~ Dec 19, 2025

Update docs/changelog/138951.yaml

1b45a4c

mhl-b added 5 commits December 18, 2025 21:25

cleanup

4df5a3e

Merge branch 'gcs-bulk-delete-retry' of github.com:mhl-b/elasticsearc…

35fe1e4

…h into gcs-bulk-delete-retry

Merge branch 'main' into gcs-bulk-delete-retry

6a08f48

rethrow IOE

7214661

Merge branch 'gcs-bulk-delete-retry' of github.com:mhl-b/elasticsearc…

363856c

…h into gcs-bulk-delete-retry

mhl-b added 2 commits December 18, 2025 23:48

Merge remote-tracking branch 'upstream/main' into gcs-bulk-delete-retry

2d477b2

Merge branch 'main' into gcs-bulk-delete-retry

b8595bf

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Dec 20, 2025

nicktindall approved these changes Dec 22, 2025

View reviewed changes

mhl-b and others added 10 commits December 23, 2025 11:12

Merge branch 'main' into gcs-bulk-delete-retry

5e49dee

fixture regex parse

57bace1

fixture: allow some batches succeed fully

c8b13f8

[CI] Auto commit changes from spotless

6002f9f

accumulate errors

b2240ea

conflict

3cf0381

regex<3

9c7563c

Merge branch 'main' into gcs-bulk-delete-retry

37c4ed7

regex ( ꒪Д꒪)ノ

47cca7e

[CI] Auto commit changes from spotless

0093c85

mhl-b merged commit deeb06b into elastic:main Dec 24, 2025
35 checks passed

rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Dec 29, 2025

Retry batch-delete items in GCS (elastic#138951)

754a5cc

This was referenced Jan 20, 2026

[CI] GCSRepositoryAnalysisRestIT testRepositoryAnalysis failing #140966

Closed

[9.3] Retry batch-delete items in GCS (#138951) #141066

Merged

mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Jan 21, 2026

Retry batch-delete items in GCS (elastic#138951)

f6edc2d

(cherry picked from commit deeb06b)

This was referenced Jan 21, 2026

[9.2] Retry batch-delete items in GCS (#138951) #141067

Merged

[9.1] Retry batch-delete items in GCS (#138951) #141068

Merged

mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Jan 21, 2026

Retry batch-delete items in GCS (elastic#138951)

3bc5913

(cherry picked from commit deeb06b)

mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Jan 21, 2026

Retry batch-delete items in GCS (elastic#138951)

7977c0b

(cherry picked from commit deeb06b)

elasticsearchmachine pushed a commit that referenced this pull request Jan 21, 2026

Retry batch-delete items in GCS (#138951) (#141068)

103443a

(cherry picked from commit deeb06b)

elasticsearchmachine pushed a commit that referenced this pull request Jan 21, 2026

Retry batch-delete items in GCS (#138951) (#141067)

d48119a

(cherry picked from commit deeb06b)

elasticsearchmachine pushed a commit that referenced this pull request Jan 21, 2026

Retry batch-delete items in GCS (#138951) (#141066)

6be72a1

(cherry picked from commit deeb06b)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry batch-delete items in GCS#138951

Retry batch-delete items in GCS#138951
mhl-b merged 20 commits intoelastic:mainfrom
mhl-b:gcs-bulk-delete-retry

mhl-b commented Dec 3, 2025 •

edited

Loading

mhl-b commented Dec 3, 2025

mhl-b commented Dec 3, 2025

nicktindall commented Dec 3, 2025

joshua-adams-1 left a comment

joshua-adams-1 Dec 3, 2025

DaveCTurner Dec 3, 2025

mhl-b Dec 19, 2025

joshua-adams-1 Dec 3, 2025

mhl-b Dec 19, 2025

joshua-adams-1 Dec 3, 2025

mhl-b Dec 19, 2025 •

edited

Loading

joshua-adams-1 Dec 3, 2025

joshua-adams-1 Dec 3, 2025

mhl-b Dec 19, 2025

DaveCTurner left a comment

DaveCTurner Dec 3, 2025

elasticsearchmachine commented Dec 19, 2025

elasticsearchmachine commented Dec 19, 2025

elasticsearchmachine commented Dec 19, 2025

nicktindall left a comment

nicktindall Dec 22, 2025

mhl-b Dec 23, 2025

nicktindall Dec 22, 2025

mhl-b Dec 23, 2025

mhl-b Dec 24, 2025 •

edited

Loading

Uh oh!

mhl-b commented Jan 21, 2026

Labels

5 participants

	// remaining items go the next bulk
	// remaining items go into the next bulk

Conversation

mhl-b commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mhl-b commented Dec 3, 2025

mhl-b commented Dec 3, 2025

nicktindall commented Dec 3, 2025

joshua-adams-1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhl-b Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Dec 19, 2025

elasticsearchmachine commented Dec 19, 2025

elasticsearchmachine commented Dec 19, 2025

nicktindall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhl-b Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b commented Jan 21, 2026

💚 All backports created successfully

Questions ?

Labels

5 participants

mhl-b commented Dec 3, 2025 •

edited

Loading

mhl-b Dec 19, 2025 •

edited

Loading

mhl-b Dec 24, 2025 •

edited

Loading