Retry batch-delete items in GCS#138951
Conversation
|
@DaveCTurner, @joshua-adams-1 |
|
@nicktindall, I didn't have chance to review GCS retry refactoring PR. But want to verify that we still keep retry strategy for non stream calls. |
Yes, only get blob should be affected by my changes |
joshua-adams-1
left a comment
There was a problem hiding this comment.
After reading #138364 this looks good to me. I'm happy to approve once the CI issues are resolved. Could we also add unit tests for the deleteBlobs function?
|
|
||
| private static boolean isRetryErrCode(int code) { | ||
| return switch (code) { | ||
| case 408 | 429 | 500 | 502 | 503 | 504 -> true; |
There was a problem hiding this comment.
- Can we replace these raw values with
HttpURLConnectionlike here - Is it worth explaining why we can retry on this?
There was a problem hiding this comment.
I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.
There was a problem hiding this comment.
| if (failedItems.isEmpty() == false) { | ||
| final var retryBlobId = failedItems.getLast().blobId; | ||
| try { | ||
| client().deleteBlob(retryBlobId); |
There was a problem hiding this comment.
Two questions:
- I assume this blocks?
- Does
storage.delete(blobId);use an exponential back off retry strategy?
| // remaining items go the next bulk | ||
| failedItems.removeLast(); | ||
| } catch (StorageException e) { | ||
| throw new IOException( |
There was a problem hiding this comment.
Can we also log the other elements in failedItems? Otherwise they would fail quietly
There was a problem hiding this comment.
Added list of failed items and their status codes. https://github.com/elastic/elasticsearch/pull/138951/changes#diff-56e89befa62de07551e01d35a1c0d3018ac0b49d6476af5c7bd38fd81f56181bR780
| final var retryBlobId = failedItems.getLast().blobId; | ||
| try { | ||
| client().deleteBlob(retryBlobId); | ||
| // remaining items go the next bulk |
There was a problem hiding this comment.
| // remaining items go the next bulk | |
| // remaining items go into the next bulk |
| if (isRetryErrCode(errCode)) { | ||
| failedItems.add(new DeleteFailure(deleteResult.blobId, e.getCode())); | ||
| } else { | ||
| throw new IOException("Failed to process bulk delete, non-retryable error for blobId=" + deleteResult.blobId, e); |
There was a problem hiding this comment.
General question: If we fail to delete N blobs, are these subsequently cleaned up?
There was a problem hiding this comment.
I think so, there is a cleanup of dangling blobs occasionally. But don't know exactly.
DaveCTurner
left a comment
There was a problem hiding this comment.
Looks sensible to me, but needs supporting changes in GoogleCloudStorageHttpHandler to exercise the retries properly.
|
|
||
| private static boolean isRetryErrCode(int code) { | ||
| return switch (code) { | ||
| case 408 | 429 | 500 | 502 | 503 | 504 -> true; |
There was a problem hiding this comment.
I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.
34d646d to
4216057
Compare
|
Hi @mhl-b, I've created a changelog YAML for you. |
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @mhl-b, I've updated the changelog YAML for you. |
…h into gcs-bulk-delete-retry
…h into gcs-bulk-delete-retry
| if (isRetryErrCode(errCode)) { | ||
| batchFailures.add(new DeleteFailure(deleteResult.blobId, e.getCode())); | ||
| } else { | ||
| throw new IOException("Failed to process batch delete, non-retryable error for blobId=" + deleteResult.blobId, e); |
There was a problem hiding this comment.
Is it worth accumulating all (within some limit, or summarised sensibly) of the failures in the current batch? Or are we happy to assume if they're there, they'll most likely be for the same reason?
There was a problem hiding this comment.
Sure, I will collect all errors, they already sit in memory.
| """.replace("$code", Integer.toString(itemStatus.getStatus())); | ||
| responseText | ||
| // SDK client will try to parse error as JSON despite these headers. Adding them for HTTP spec consistency. | ||
| .append("content-type: application/json") |
There was a problem hiding this comment.
SDK client will try to parse error as JSON despite these headers. Adding them for HTTP spec consistency.
Do you mean the SDK will parse the response as JSON even without these headers? despite makes it sound like the content type indicates it's something other than JSON
There was a problem hiding this comment.
Do you mean the SDK will parse the response as JSON even without these headers?
Yes, as far as I can trace GCS code there is no check for the content type, it goes straight to JSON parsing. Having these headers does not hurt, if GCS at some point decides to be strict on part headers this code will continue to work.
There was a problem hiding this comment.
I removed this comment line to avoid confusion. There is nothing to worry about.
(cherry picked from commit deeb06b)
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
(cherry picked from commit deeb06b)
(cherry picked from commit deeb06b)
Add retry logic for batch-delete items in GCS blob store. Algorithm
maximizes the size of every batch by merging retryable items from the
previous batch and new items. When batch results have failed items, we
first retry only a single item using the SDK client's retry strategy
(exponential-backoff). Retrying a single item should provide enough
time to back-off from throttling or temporary GCS failures. Once a
single item successfully retries, we proceed with the next batch,
combining the remaining failures and new items.
Roughly this:
Also extend test fixture to randomly fail individual items in batch.
fix #138364