Add a new setting for s3 API call timeout#138072
Add a new setting for s3 API call timeout#138072elasticsearchmachine merged 17 commits intoelastic:mainfrom
Conversation
DaveCTurner
left a comment
There was a problem hiding this comment.
I think this'd be better as an internal-cluster test akin to e.g. S3BlobStoreRepositoryTests or S3BlobContainerRetriesTests. That way we can call writeBlob directly, rather than having to construct a shard of the right sort of size to trigger such a call during the snapshot, and also we can cancel the sleep when the request times out.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Replaced the REST test with an internal cluster test as suggested in 3c1a680 Also added a new |
|
Hi @ywangd, I've created a changelog YAML for you. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Some suggestions/nits but otherwise looks like a sensible approach to me.
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3ClientSettings.java
Outdated
Show resolved
Hide resolved
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3ClientSettings.java
Outdated
Show resolved
Hide resolved
| new BytesArray(randomBytes((int) ByteSizeValue.ofMb(10).getBytes())), | ||
| randomBoolean() | ||
| ); | ||
| fail("should have timed out"); |
There was a problem hiding this comment.
Could we use expectThrows() here?
There was a problem hiding this comment.
This to ensure the writeBlob cannot succeed. So there is no exception to expect if the code reaches here?
There was a problem hiding this comment.
I mean the whole try {...; fail();} catch (...) construct, not just this one line.
| } catch (IOException e) { | ||
| final var cause = ExceptionsHelper.unwrap(e, ApiCallTimeoutException.class); | ||
| assertNotNull(cause); | ||
| assertThat(cause.getMessage(), containsString("Client execution did not complete before the specified timeout configuration")); |
There was a problem hiding this comment.
Could we assert something about how these outcomes are captured in the metrics? Should we add something specific to S3RepositoriesMetrics to track this case separately from other exceptions?
There was a problem hiding this comment.
Unfortunately it seems there is no unambiguous metric for this exception. First of all, the error types available in the metric are coarse grained. The relevant one should be ClientTimeout but it does not tell exactly which client side timeout it is. Secondly, this seems to be a bug in the SDK, the error type reported in this case is Other instead of ClientTimeout because the underlying exception is an InterruptedException which gets translated into Other.
I remember the sdk v1 allows access to request and response objects at metric reporting time which would be helpful in this case. But that is no longer possible.
With what we have now, I think what we can do is to add an additional metric label for error type when recording number of errors. Though it does not tell us definitely about the exact error, it at least gives us some hints on what have happened by correlating with the API call duration. If AWS fixes the error type for API call timeout, it should also benefit from it automatically. What do you think?
There was a problem hiding this comment.
Yes seems like a bug indeed, the code you link explicitly mentions ApiCallTimeoutException and ApiCallAttemptTimeoutException but there is some missing wrapping somewhere it seems.
And yes I think breaking down the metrics by error type would be helpful. There's only 5 of them.
But in the context of this change could we just say how it behaves today with respect to the existing metrics? That way, if things do change in future we'll at least be aware of the new behaviour.
There was a problem hiding this comment.
Pushed 28d5ffd Please let me know if it matches your suggestion. Thanks!
| headerDecodedContentLength | ||
| ); | ||
| try { | ||
| final var released = latch.await(60, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Could we use safeAwait()? At least, I'd like us to be asserting that the latch is released in a reasonably timely manner.
There was a problem hiding this comment.
I changed it to safeAwait in ed85e84
I didn't use safeAwait because the test ends faster in failure scenario without it. In failure scenario, safeAwait will kill the server and leave the client hanging till the test timeout in 30m. But I guess we are more interested in the success path and ensure things are released in time.
There was a problem hiding this comment.
Ah interesting.
It used to be that this HTTP server would close the connection on an AssertionError here, see #68967, but this in fact only happens if we don't call httpServer.setExecutor() and instead just run the handling on the network thread. Now that we're forking this work to ESMockAPIBasedRepositoryIntegTestCase#executorService we are indeed not handling assertion errors in the handler correctly. We do at least bubble the assertion error up to the top level so that it fails the test now, unlike the inline-execution case where the AssertionError is swallowed, but we should also find some way to abort the in-flight requests too.
A separate and broader concern than this issue however.
DaveCTurner
left a comment
There was a problem hiding this comment.
LGTM (one comment but it's optional, no need for another review either way)
| } | ||
| clientOverrideConfiguration.retryStrategy(retryStrategyBuilder.build()); | ||
| final long apiCallTimeoutMillis = clientSettings.apiCallTimeout.millis(); | ||
| if (apiCallTimeoutMillis > 0) { |
There was a problem hiding this comment.
Nit: would prefer to pass zero down to the SDK verbatim, even though it rejects it today. But I don't feel strongly about it in this particular case, it's a tiny thing, just more of a general practice for distinguishing "don't pass the value along" vs "pass along a zero".
| if (apiCallTimeoutMillis > 0) { | |
| if (apiCallTimeoutMillis >= 0) { |
This PR adds a new setting to configure AWS SDK API call timeout (https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/timeouts.html). Defaults to -1, i.e. no timeout.