Add a new setting for s3 API call timeout by ywangd · Pull Request #138072 · elastic/elasticsearch

ywangd · 2025-11-14T06:22:44Z

This PR adds a new setting to configure AWS SDK API call timeout (https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/timeouts.html). Defaults to -1, i.e. no timeout.

DaveCTurner

I think this'd be better as an internal-cluster test akin to e.g. S3BlobStoreRepositoryTests or S3BlobContainerRetriesTests. That way we can call writeBlob directly, rather than having to construct a shard of the right sort of size to trigger such a call during the snapshot, and also we can cancel the sleep when the request times out.

elasticsearchmachine · 2025-11-28T07:34:24Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd · 2025-11-28T07:37:19Z

Replaced the REST test with an internal cluster test as suggested in 3c1a680

Also added a new api_call_timeout setting. It defaults to 0, i.e. disabled, so that it is a no-op for existing clusters. We can look into enabling it separately.

elasticsearchmachine · 2025-11-28T07:38:30Z

Hi @ywangd, I've created a changelog YAML for you.

DaveCTurner

Some suggestions/nits but otherwise looks like a sensible approach to me.

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3ClientSettings.java

DaveCTurner · 2025-11-28T09:00:47Z

...nalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTimeoutTests.java

+                new BytesArray(randomBytes((int) ByteSizeValue.ofMb(10).getBytes())),
+                randomBoolean()
+            );
+            fail("should have timed out");


Could we use expectThrows() here?

This to ensure the writeBlob cannot succeed. So there is no exception to expect if the code reaches here?

I mean the whole try {...; fail();} catch (...) construct, not just this one line.

Ah yes pushed 6afe3d0

DaveCTurner · 2025-11-28T09:02:40Z

...nalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTimeoutTests.java

+        } catch (IOException e) {
+            final var cause = ExceptionsHelper.unwrap(e, ApiCallTimeoutException.class);
+            assertNotNull(cause);
+            assertThat(cause.getMessage(), containsString("Client execution did not complete before the specified timeout configuration"));


Could we assert something about how these outcomes are captured in the metrics? Should we add something specific to S3RepositoriesMetrics to track this case separately from other exceptions?

Unfortunately it seems there is no unambiguous metric for this exception. First of all, the error types available in the metric are coarse grained. The relevant one should be ClientTimeout but it does not tell exactly which client side timeout it is. Secondly, this seems to be a bug in the SDK, the error type reported in this case is Other instead of ClientTimeout because the underlying exception is an InterruptedException which gets translated into Other.

I remember the sdk v1 allows access to request and response objects at metric reporting time which would be helpful in this case. But that is no longer possible.

With what we have now, I think what we can do is to add an additional metric label for error type when recording number of errors. Though it does not tell us definitely about the exact error, it at least gives us some hints on what have happened by correlating with the API call duration. If AWS fixes the error type for API call timeout, it should also benefit from it automatically. What do you think?

Yes seems like a bug indeed, the code you link explicitly mentions ApiCallTimeoutException and ApiCallAttemptTimeoutException but there is some missing wrapping somewhere it seems.

And yes I think breaking down the metrics by error type would be helpful. There's only 5 of them.

But in the context of this change could we just say how it behaves today with respect to the existing metrics? That way, if things do change in future we'll at least be aware of the new behaviour.

Pushed 28d5ffd Please let me know if it matches your suggestion. Thanks!

DaveCTurner · 2025-11-28T09:04:00Z

...nalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTimeoutTests.java

+                    headerDecodedContentLength
+                );
+                try {
+                    final var released = latch.await(60, TimeUnit.SECONDS);


Could we use safeAwait()? At least, I'd like us to be asserting that the latch is released in a reasonably timely manner.

I changed it to safeAwait in ed85e84

I didn't use safeAwait because the test ends faster in failure scenario without it. In failure scenario, safeAwait will kill the server and leave the client hanging till the test timeout in 30m. But I guess we are more interested in the success path and ensure things are released in time.

Ah interesting.

It used to be that this HTTP server would close the connection on an AssertionError here, see #68967, but this in fact only happens if we don't call httpServer.setExecutor() and instead just run the handling on the network thread. Now that we're forking this work to ESMockAPIBasedRepositoryIntegTestCase#executorService we are indeed not handling assertion errors in the handler correctly. We do at least bubble the assertion error up to the top level so that it fails the test now, unlike the inline-execution case where the AssertionError is swallowed, but we should also find some way to abort the in-flight requests too.

A separate and broader concern than this issue however.

DaveCTurner

LGTM (one comment but it's optional, no need for another review either way)

DaveCTurner · 2025-12-02T10:31:05Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Service.java

        }
        clientOverrideConfiguration.retryStrategy(retryStrategyBuilder.build());
+        final long apiCallTimeoutMillis = clientSettings.apiCallTimeout.millis();
+        if (apiCallTimeoutMillis > 0) {


Nit: would prefer to pass zero down to the SDK verbatim, even though it rejects it today. But I don't feel strongly about it in this particular case, it's a tiny thing, just more of a general practice for distinguishing "don't pass the value along" vs "pass along a zero".

Suggested change

if (apiCallTimeoutMillis > 0) {

if (apiCallTimeoutMillis >= 0) {

Add a test to demostrate the issue

a92d132

ywangd added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.3.0 labels Nov 14, 2025

ywangd and others added 2 commits November 14, 2025 17:52

forbidden API

e36b2bd

[CI] Auto commit changes from spotless

02101a0

DaveCTurner reviewed Nov 18, 2025

View reviewed changes

ywangd added 3 commits November 28, 2025 17:15

Merge remote-tracking branch 'origin/main' into s3-write-hanging-fix

3153b52

Replace REST test with internal cluster test

3c1a680

add new apiCallTimoutSetting

a123c80

ywangd marked this pull request as ready for review November 28, 2025 07:34

elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Nov 28, 2025

ywangd requested a review from DaveCTurner November 28, 2025 07:37

ywangd changed the title ~~Test for S3 upload time unresponsiveness~~ Nov 28, 2025

ywangd added >enhancement and removed >non-issue labels Nov 28, 2025

ywangd and others added 2 commits November 28, 2025 18:38

Update docs/changelog/138072.yaml

c38f377

Merge branch 'main' into s3-write-hanging-fix

e8745de

DaveCTurner reviewed Nov 28, 2025

View reviewed changes

ywangd added 7 commits December 1, 2025 10:03

Default to -1. Keep TimeValue

4e3c012

safeAwait

ed85e84

test skip

ad4b7b9

Merge remote-tracking branch 'origin/main' into s3-write-hanging-fix

cd6bff2

Merge remote-tracking branch 'origin/main' into s3-write-hanging-fix

4383f82

expectThrows

6afe3d0

error type

28d5ffd

ywangd requested a review from DaveCTurner December 2, 2025 05:40

DaveCTurner approved these changes Dec 2, 2025

View reviewed changes

ywangd added 2 commits December 4, 2025 17:55

Merge remote-tracking branch 'origin/main' into s3-write-hanging-fix

4260982

pass 0

14956f9

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 4, 2025

elasticsearchmachine merged commit 2473604 into elastic:main Dec 4, 2025
34 checks passed

ywangd deleted the s3-write-hanging-fix branch December 4, 2025 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new setting for s3 API call timeout#138072

Add a new setting for s3 API call timeout#138072
elasticsearchmachine merged 17 commits intoelastic:mainfrom
ywangd:s3-write-hanging-fix

ywangd commented Nov 14, 2025 •

edited

Loading

DaveCTurner left a comment

elasticsearchmachine commented Nov 28, 2025

ywangd commented Nov 28, 2025

elasticsearchmachine commented Nov 28, 2025

DaveCTurner left a comment

Uh oh!

Uh oh!

DaveCTurner Nov 28, 2025

ywangd Nov 30, 2025

DaveCTurner Dec 1, 2025

ywangd Dec 2, 2025

DaveCTurner Nov 28, 2025

ywangd Dec 1, 2025

DaveCTurner Dec 1, 2025

ywangd Dec 2, 2025

DaveCTurner Nov 28, 2025

ywangd Nov 30, 2025

DaveCTurner Dec 1, 2025

DaveCTurner left a comment

DaveCTurner Dec 2, 2025

Uh oh!

Labels

3 participants

	if (apiCallTimeoutMillis > 0) {
	if (apiCallTimeoutMillis >= 0) {

Conversation

ywangd commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Nov 28, 2025

ywangd commented Nov 28, 2025

elasticsearchmachine commented Nov 28, 2025

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

ywangd commented Nov 14, 2025 •

edited

Loading