Skip to content

Use common retry logic for GCS#138553

Merged
nicktindall merged 44 commits intoelastic:mainfrom
nicktindall:use_common_retry_logic_gcs
Dec 12, 2025
Merged

Use common retry logic for GCS#138553
nicktindall merged 44 commits intoelastic:mainfrom
nicktindall:use_common_retry_logic_gcs

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Nov 25, 2025

This is the second step for breaking up and merging #136663

I chose to do GCS next because it introduces safe-resume (where we remember a version of the blob we were downloading so we can request specifically that one when we resume). This will mean less refactoring than if we'd done Azure first.

I didn't implement that logic for S3 although its trivial. I will do that in a subsequent change.

.setRetryDelayMultiplier(options.getRetrySettings().getRetryDelayMultiplier())
.setMaxRetryDelay(Duration.ofSeconds(1L))
.setMaxAttempts(0)
.setJittered(false)
Copy link
Contributor Author

@nicktindall nicktindall Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test originally configured retries to be time-based (i.e. no limit on the attempts, just keep retrying for some amount of time). I changed it to just make the retry intervals small and depend on the configured retry limits because we don't support time-based retries anymore.

container.writeBlob(randomPurpose(), blobKey, new BytesArray(initialValue), true);

try (InputStream inputStream = container.readBlob(randomPurpose(), blobKey)) {
try (InputStream inputStream = container.readBlob(randomRetryingPurpose(), blobKey)) {
Copy link
Contributor Author

@nicktindall nicktindall Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful where we use randomPurpose() now because some purposes no longer retry (e.g. REPOSITORY_ANALYSIS)

@Override
public long getMeaningfulProgressSize() {
return Math.max(1L, GoogleCloudStorageBlobStore.SDK_DEFAULT_CHUNK_SIZE / 100L);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of this value is somewhat arbitrary, open to suggestions of whether we should make this consistent across CSPs or use some other value here. SDK default chunk size is 16MB, so this is 160KB

@nicktindall nicktindall added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement >non-issue labels Nov 25, 2025
PREFIX,
"max_retries",
(key) -> Setting.intSetting(key, 5, 0, Setting.Property.NodeScope)
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never configured number of retries for GCS previously. The default settings were for 6 attempts (aka 5 retries)

private long currentOffset;
private boolean closed;
private Long lastGeneration;
private static final StorageRetryStrategy STORAGE_RETRY_STRATEGY = GoogleCloudStorageService.createStorageRetryStrategy();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one we use is stateless, we might need to re-think this lifecycle if we switch to one that is not. You can't get it out of the client or StorageOptions as far as I could see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know important this is. If necessary, I think we can store the original strategy object to the MeteredStorage object so that we can get it in this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to using BaseService.EXCEPTION_HANDLER for the outer retries during open. This is consistent with the existing behaviour (more permissive retries on the inner loop, slightly less permissive on the outer)

}
return n;
} catch (IOException e) {
throw StorageException.translate(e);
Copy link
Contributor Author

@nicktindall nicktindall Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We translate these for consistency with the existing implementation. We retry anything when reading, consistent with the existing implementation, but when something goes wrong the translation might add some more context in the stack-trace.

…c_gcs

# Conflicts:
#	modules/repository-gcs/src/internalClusterTest/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStoreRepositoryTests.java
#	modules/repository-gcs/src/test/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobContainerRetriesTests.java
#	server/src/test/java/org/elasticsearch/common/blobstore/RetryingInputStreamTests.java
@nicktindall nicktindall requested a review from ywangd December 1, 2025 22:51
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only minor comments. I don't think I'll find anything. But would like sometime to read it through again. Thanks!

Comment on lines 219 to 220
@Override
public boolean equals(Object o) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should update both equals and hashCode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, added in 735a66e

Comment on lines 118 to 119
} catch (NoSuchFileException | RequestedRangeNotSatisfiedException e) {
throw e;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably missed it last time? Both s3 and gcs add failures as suppressed for these two exceptions as well. If so, I think we should preserve the behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, added back in 2f2ce0c

private <T extends Exception> long maybeLogAndComputeRetryDelay(String action, T e) throws T {
if (shouldRetry(attempt) == false) {
private <T extends Exception> long maybeLogAndComputeRetryDelay(StreamAction action, T e) throws T {
if (blobStoreServices.isRetryableException(action, e) == false || shouldRetry(attempt) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we move the blobStoreServices.isRetryableException check into shouldRetry as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, moved in 0687495

Comment on lines 411 to 416
public static boolean willRetry(OperationPurpose purpose) {
return purpose != OperationPurpose.REPOSITORY_ANALYSIS;
}

public static boolean willRetryForever(OperationPurpose purpose) {
return purpose == OperationPurpose.INDICES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are used in tests only. So I think we should either move them to test code or somehow use them in shouldRetry to justify them being here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually they're only used in one place and it's a natural home for this logic, moved them there in d444f89

Comment on lines 238 to 243
if (position < 0L) {
throw new IllegalArgumentException("position must be non-negative");
}
if (length < 0) {
throw new IllegalArgumentException("length must be non-negative");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are already handled production code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just duplicated them here because I didn't want to lose that validation (we don't call super here, I just overrode so I could override getRetryDelayInMillis). I changed them to assertions in 2f7cd06

If I've understood correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the same checks are done in RetryingInputStream's constructor which is invoked in the else branch. That said, it's not called when length == 0. Assertions are fine.

Comment on lines 1462 to 1469
protected OperationPurpose randomRetryingPurpose() {
return randomValueOtherThan(OperationPurpose.REPOSITORY_ANALYSIS, BlobStoreTestUtil::randomPurpose);
return BlobStoreTestUtil.randomRetryingPurpose();
}

@Override
protected OperationPurpose randomFiniteRetryingPurpose() {
return randomValueOtherThanMany(
purpose -> purpose == OperationPurpose.REPOSITORY_ANALYSIS || purpose == OperationPurpose.INDICES,
BlobStoreTestUtil::randomPurpose
);
return BlobStoreTestUtil.randomFiniteRetryingPurpose();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have these methods defined in AbstractBlobContainerRetriesTestCase? Same for GoogleCloudStorageBlobContainerRetriesTests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're now also used in classes that aren't descendants of AbstractBlobContainerRetriesTestCase, so that reuse is easier if they're available as static utilities.

The old pattern was to have them defined in AbstractBlobContainerRetriesTestCase and overridden in the subclasses because the operation-purpose specific logic was unique to S3. Once we get this and the Azure implementation implemented the behaviour will be consistent and we can not bother with the inheritance and overrides and just use the static methods everywhere.

Comment on lines +200 to +201
if (getAttempts() > 1) {
assertEquals(eTag, version);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this branch not tested since maxRetries is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, because I had used OperationPurpose.INDICES which retries infinitely regardless of the configured max retries. I changed it to use a random retrying purpose and configured max retries accordingly. It confused even me when I looked at it again :)

see 52f8659

@nicktindall nicktindall requested a review from ywangd December 10, 2025 07:31
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two more questions.

if (offset > 0 || start > 0 || end < Long.MAX_VALUE - 1) {
assert start + offset <= end : "requesting beyond end, start = " + start + " offset=" + offset + " end=" + end;
}
// noinspection TryWithIdenticalCatches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

Comment on lines 120 to 124
} catch (RuntimeException e) {
if (attempt == 1) {
blobStoreServices.onRetryStarted("open");
}
final long delayInMillis = maybeLogAndComputeRetryDelay("opening", e);
delayBeforeRetry(delayInMillis);
retryOrAbortOnOpen(e);
} catch (IOException e) {
retryOrAbortOnOpen(e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems impossible for GoogleCloudStorageRetryingInputStream to throw IOException on openning other than NoSuchFileException and RequestedRangeNotSatisfiedException which are handled separately. S3RetryingInputStream currently does not retry on IOException. If so, do we need this separate handling of IOException? Or is it for future Azure changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the Azure implementation does throw IOException if we migrate it as-is, but looking at it it's really just used as a catch-all. I'd be keen to minimise the differences until the dust has settled on these changes.

Also as long as BlobStoreServices#getInputStream throws IOException static analysis will require us to keep it.

Perhaps after Azure, we can change those exceptions to be more specific?

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall nicktindall merged commit c0e1c56 into elastic:main Dec 12, 2025
34 checks passed
@nicktindall nicktindall deleted the use_common_retry_logic_gcs branch December 12, 2025 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.0

4 participants