Skip to content

Retry batch-delete items in GCS#138951

Merged
mhl-b merged 20 commits intoelastic:mainfrom
mhl-b:gcs-bulk-delete-retry
Dec 24, 2025
Merged

Retry batch-delete items in GCS#138951
mhl-b merged 20 commits intoelastic:mainfrom
mhl-b:gcs-bulk-delete-retry

Conversation

@mhl-b
Copy link
Contributor

@mhl-b mhl-b commented Dec 3, 2025

Add retry logic for batch-delete items in GCS blob store. Algorithm
maximizes the size of every batch by merging retryable items from the
previous batch and new items. When batch results have failed items, we
first retry only a single item using the SDK client's retry strategy
(exponential-backoff). Retrying a single item should provide enough
time to back-off from throttling or temporary GCS failures. Once a
single item successfully retries, we proceed with the next batch,
combining the remaining failures and new items.

Roughly this:

  • create batch of 100 new items
  • submit batch
  • receive 100 results, with 10 retryable failures
  • retry 1 failure using non-batched delete using SDK retry strategy
  • (loop) create batch from 9 remaining failures and 91 new items

Also extend test fixture to randomly fail individual items in batch.

fix #138364

@mhl-b
Copy link
Contributor Author

mhl-b commented Dec 3, 2025

@DaveCTurner, @joshua-adams-1
Using draft to align on the approach how to retry bulk items.

@mhl-b mhl-b requested a review from nicktindall December 3, 2025 00:08
@mhl-b
Copy link
Contributor Author

mhl-b commented Dec 3, 2025

@nicktindall, I didn't have chance to review GCS retry refactoring PR. But want to verify that we still keep retry strategy for non stream calls.

@nicktindall
Copy link
Contributor

@nicktindall, I didn't have chance to review GCS retry refactoring PR. But want to verify that we still keep retry strategy for non stream calls.

Yes, only get blob should be affected by my changes

Copy link
Contributor

@joshua-adams-1 joshua-adams-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading #138364 this looks good to me. I'm happy to approve once the CI issues are resolved. Could we also add unit tests for the deleteBlobs function?


private static boolean isRetryErrCode(int code) {
return switch (code) {
case 408 | 429 | 500 | 502 | 503 | 504 -> true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we replace these raw values with HttpURLConnection like here
  2. Is it worth explaining why we can retry on this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (failedItems.isEmpty() == false) {
final var retryBlobId = failedItems.getLast().blobId;
try {
client().deleteBlob(retryBlobId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. I assume this blocks?
  2. Does storage.delete(blobId); use an exponential back off retry strategy?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and yes

// remaining items go the next bulk
failedItems.removeLast();
} catch (StorageException e) {
throw new IOException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also log the other elements in failedItems? Otherwise they would fail quietly

Copy link
Contributor Author

@mhl-b mhl-b Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final var retryBlobId = failedItems.getLast().blobId;
try {
client().deleteBlob(retryBlobId);
// remaining items go the next bulk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// remaining items go the next bulk
// remaining items go into the next bulk
if (isRetryErrCode(errCode)) {
failedItems.add(new DeleteFailure(deleteResult.blobId, e.getCode()));
} else {
throw new IOException("Failed to process bulk delete, non-retryable error for blobId=" + deleteResult.blobId, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: If we fail to delete N blobs, are these subsequently cleaned up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, there is a cleanup of dangling blobs occasionally. But don't know exactly.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me, but needs supporting changes in GoogleCloudStorageHttpHandler to exercise the retries properly.


private static boolean isRetryErrCode(int code) {
return switch (code) {
case 408 | 429 | 500 | 502 | 503 | 504 -> true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.

@mhl-b mhl-b force-pushed the gcs-bulk-delete-retry branch from 34d646d to 4216057 Compare December 19, 2025 04:43
@mhl-b mhl-b added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. labels Dec 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @mhl-b, I've created a changelog YAML for you.

@mhl-b mhl-b marked this pull request as ready for review December 19, 2025 05:04
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@mhl-b mhl-b changed the title Retry bulk-delete items in GCS Dec 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @mhl-b, I've updated the changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Dec 20, 2025
Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (isRetryErrCode(errCode)) {
batchFailures.add(new DeleteFailure(deleteResult.blobId, e.getCode()));
} else {
throw new IOException("Failed to process batch delete, non-retryable error for blobId=" + deleteResult.blobId, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth accumulating all (within some limit, or summarised sensibly) of the failures in the current batch? Or are we happy to assume if they're there, they'll most likely be for the same reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will collect all errors, they already sit in memory.

""".replace("$code", Integer.toString(itemStatus.getStatus()));
responseText
// SDK client will try to parse error as JSON despite these headers. Adding them for HTTP spec consistency.
.append("content-type: application/json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK client will try to parse error as JSON despite these headers. Adding them for HTTP spec consistency.

Do you mean the SDK will parse the response as JSON even without these headers? despite makes it sound like the content type indicates it's something other than JSON

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the SDK will parse the response as JSON even without these headers?

Yes, as far as I can trace GCS code there is no check for the content type, it goes straight to JSON parsing. Having these headers does not hurt, if GCS at some point decides to be strict on part headers this code will continue to work.

Copy link
Contributor Author

@mhl-b mhl-b Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this comment line to avoid confusion. There is nothing to worry about.

@mhl-b mhl-b merged commit deeb06b into elastic:main Dec 24, 2025
35 checks passed
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Dec 29, 2025
mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Jan 21, 2026
@mhl-b
Copy link
Contributor Author

mhl-b commented Jan 21, 2026

💚 All backports created successfully

Status Branch Result
9.3
9.2
9.1

Questions ?

Please refer to the Backport tool documentation

mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Jan 21, 2026
mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Jan 21, 2026
elasticsearchmachine pushed a commit that referenced this pull request Jan 21, 2026
elasticsearchmachine pushed a commit that referenced this pull request Jan 21, 2026
elasticsearchmachine pushed a commit that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement serverless-linked Added by automation, don't add manually Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.4.0

5 participants