Skip to content

Add a new setting for s3 API call timeout#138072

Merged
elasticsearchmachine merged 17 commits intoelastic:mainfrom
ywangd:s3-write-hanging-fix
Dec 4, 2025
Merged

Add a new setting for s3 API call timeout#138072
elasticsearchmachine merged 17 commits intoelastic:mainfrom
ywangd:s3-write-hanging-fix

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Nov 14, 2025

This PR adds a new setting to configure AWS SDK API call timeout (https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/timeouts.html). Defaults to -1, i.e. no timeout.

@ywangd ywangd added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.3.0 labels Nov 14, 2025
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this'd be better as an internal-cluster test akin to e.g. S3BlobStoreRepositoryTests or S3BlobContainerRetriesTests. That way we can call writeBlob directly, rather than having to construct a shard of the right sort of size to trigger such a call during the snapshot, and also we can cancel the sleep when the request times out.

@ywangd ywangd marked this pull request as ready for review November 28, 2025 07:34
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Nov 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member Author

ywangd commented Nov 28, 2025

Replaced the REST test with an internal cluster test as suggested in 3c1a680

Also added a new api_call_timeout setting. It defaults to 0, i.e. disabled, so that it is a no-op for existing clusters. We can look into enabling it separately.

@ywangd ywangd requested a review from DaveCTurner November 28, 2025 07:37
@ywangd ywangd changed the title Test for S3 upload time unresponsiveness Nov 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions/nits but otherwise looks like a sensible approach to me.

new BytesArray(randomBytes((int) ByteSizeValue.ofMb(10).getBytes())),
randomBoolean()
);
fail("should have timed out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use expectThrows() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to ensure the writeBlob cannot succeed. So there is no exception to expect if the code reaches here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the whole try {...; fail();} catch (...) construct, not just this one line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes pushed 6afe3d0

} catch (IOException e) {
final var cause = ExceptionsHelper.unwrap(e, ApiCallTimeoutException.class);
assertNotNull(cause);
assertThat(cause.getMessage(), containsString("Client execution did not complete before the specified timeout configuration"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we assert something about how these outcomes are captured in the metrics? Should we add something specific to S3RepositoriesMetrics to track this case separately from other exceptions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it seems there is no unambiguous metric for this exception. First of all, the error types available in the metric are coarse grained. The relevant one should be ClientTimeout but it does not tell exactly which client side timeout it is. Secondly, this seems to be a bug in the SDK, the error type reported in this case is Other instead of ClientTimeout because the underlying exception is an InterruptedException which gets translated into Other.

I remember the sdk v1 allows access to request and response objects at metric reporting time which would be helpful in this case. But that is no longer possible.

With what we have now, I think what we can do is to add an additional metric label for error type when recording number of errors. Though it does not tell us definitely about the exact error, it at least gives us some hints on what have happened by correlating with the API call duration. If AWS fixes the error type for API call timeout, it should also benefit from it automatically. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes seems like a bug indeed, the code you link explicitly mentions ApiCallTimeoutException and ApiCallAttemptTimeoutException but there is some missing wrapping somewhere it seems.

And yes I think breaking down the metrics by error type would be helpful. There's only 5 of them.

But in the context of this change could we just say how it behaves today with respect to the existing metrics? That way, if things do change in future we'll at least be aware of the new behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed 28d5ffd Please let me know if it matches your suggestion. Thanks!

headerDecodedContentLength
);
try {
final var released = latch.await(60, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use safeAwait()? At least, I'd like us to be asserting that the latch is released in a reasonably timely manner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to safeAwait in ed85e84

I didn't use safeAwait because the test ends faster in failure scenario without it. In failure scenario, safeAwait will kill the server and leave the client hanging till the test timeout in 30m. But I guess we are more interested in the success path and ensure things are released in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting.

It used to be that this HTTP server would close the connection on an AssertionError here, see #68967, but this in fact only happens if we don't call httpServer.setExecutor() and instead just run the handling on the network thread. Now that we're forking this work to ESMockAPIBasedRepositoryIntegTestCase#executorService we are indeed not handling assertion errors in the handler correctly. We do at least bubble the assertion error up to the top level so that it fails the test now, unlike the inline-execution case where the AssertionError is swallowed, but we should also find some way to abort the in-flight requests too.

A separate and broader concern than this issue however.

@ywangd ywangd requested a review from DaveCTurner December 2, 2025 05:40
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (one comment but it's optional, no need for another review either way)

}
clientOverrideConfiguration.retryStrategy(retryStrategyBuilder.build());
final long apiCallTimeoutMillis = clientSettings.apiCallTimeout.millis();
if (apiCallTimeoutMillis > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would prefer to pass zero down to the SDK verbatim, even though it rejects it today. But I don't feel strongly about it in this particular case, it's a tiny thing, just more of a general practice for distinguishing "don't pass the value along" vs "pass along a zero".

Suggested change
if (apiCallTimeoutMillis > 0) {
if (apiCallTimeoutMillis >= 0) {
@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 4, 2025
@elasticsearchmachine elasticsearchmachine merged commit 2473604 into elastic:main Dec 4, 2025
34 checks passed
@ywangd ywangd deleted the s3-write-hanging-fix branch December 4, 2025 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.0

3 participants