Retry when the server can't be resolved#123852
Conversation
There was a problem hiding this comment.
To prevent opencensus thread leak you need to override GoogleCloudStoragePlugin.close() to shutdown tracer's export component. In main code it doesn't do anything, it's NOOP, in testing it will shutdown thread that poll traces queue.
@Override
public void close() throws IOException {
Tracing.getExportComponent().shutdown();
}| } | ||
|
|
||
| protected StorageRetryStrategy getRetryStrategy() { | ||
| return new RetryOnNetworkOutageRetryStrategy(StorageRetryStrategy.getLegacyStorageRetryStrategy()); |
There was a problem hiding this comment.
Why not to use StorageRetryStrategy.getDefaultStorageRetryStrategy()? getLegacyStorageRetryStrategy is deprecated and will be removed soon.
There was a problem hiding this comment.
I think that's worth doing as a separate change, I think it comes with more changes due to some test breaks that result (I saw a comment about it somewhere). The behaviour for unknown host is consistent between the two implementations.
| import java.util.concurrent.CancellationException; | ||
| import java.util.function.Function; | ||
|
|
||
| public class DelegatingStorageRetryStrategy<T> implements StorageRetryStrategy { |
There was a problem hiding this comment.
Quite a convoluted approach to decorating the StorageRetryStrategy, but necessary to allow wrapping of the standard ones (final, package-private)
There was a problem hiding this comment.
Not too bad, it's easy to read later when you decorate LegacyStorageRetryStrategy
There was a problem hiding this comment.
Looking again into this, the delegator seems overkill. I would rather do specialized class that compose default strategy and enhance with our own bits of logic. I cant imagine why we would need multiple retry strategies. Basically remove layer of indirection. Would be easier to read.
| if (randomBoolean()) { | ||
| logger.info("Failing due to connection refused"); | ||
| endpointUrlOverride = "http://127.0.0.1:" | ||
| + randomValueOtherThan(httpServer.getAddress().getPort(), () -> randomIntBetween(49152, 65535)); |
There was a problem hiding this comment.
I wonder if this will randomly select another open port one day and cause a flap. Not sure what the best approach is here.
There was a problem hiding this comment.
We use 127.0.0.1:1 in other places where we need a port that definitely won't be open, see e.g. DiscoveryEc2AvailabilityZoneAttributeNoImdsIT.
Thanks for this. I ended up using a different approach, but that is useful info in case we need to go back to it. |
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @nicktindall, I've created a changelog YAML for you. |
|
I think we need to pair this with DNS caching removal for the unsuccessful resolutions, |
...ory-gcs/src/main/java/org/elasticsearch/repositories/gcs/DelegatingStorageRetryStrategy.java
Outdated
Show resolved
Hide resolved
| public void testShouldRetryOnNetworkOutage() { | ||
| final int maxRetries = randomIntBetween(3, 5); | ||
|
|
||
| if (randomBoolean()) { |
There was a problem hiding this comment.
Could we have two test cases, one for each branch, rather than picking randomly like this? They can both call the same underlying implementation.
| BlobContainer blobContainer = createBlobContainer(maxRetries, null, null, null, null); | ||
| try { | ||
| blobContainer.listBlobs(randomPurpose()); | ||
| fail("Should have thrown an exception"); |
There was a problem hiding this comment.
I'd prefer expectThrows over doing this manually.
…ories/gcs/DelegatingStorageRetryStrategy.java Co-authored-by: Yang Wang <ywangd@gmail.com>
This is a good point. This being a JVM-wide setting I'd be inclined to not do anything with it specifically to serve the Google client, but open to doing so (assuming we don't already somewhere else for something else). Perhaps its something we should set in our serverless/ECH configs. The default value of 10 would be OK with the default retry config (which we seem to use). That will retry with intervals 1, 2, 4, 8, 16, 32 seconds if I'm reading it right, so we should attempt to re-resolve at the 8/16/32 attempts. Perhaps its worth logging if the value is set to |
mhl-b
left a comment
There was a problem hiding this comment.
LGTM. I still think we dont need DelegatingResultRetryAlgorithm, can be concrete class that adds UnknownHost retry.
What do you think about the new approach, it would be trivial to in-line that lambda, but it means the decoration logic would be hidden in the massive amount of boilerplate plumbing necessary to do the decoration. The way it is now we can change the underlying implementation easily (which I think we know we need to do) and also have the small amount of |
|
Not a big deal at all. Feel free to merge. I had something like this in mind. record CustomRetryStrategy(StorageRetryStrategy base) implements StorageRetryStrategy {
static final CustomRetryStrategy INSTANCE = new CustomRetryStrategy(StorageRetryStrategy.getLegacyStorageRetryStrategy());
@Override
public ResultRetryAlgorithm<?> getIdempotentHandler() {
return new WithDNSRetry<>(base.getIdempotentHandler());
}
@Override
public ResultRetryAlgorithm<?> getNonidempotentHandler() {
return new WithDNSRetry<>(base.getNonidempotentHandler());
}
}
record WithDNSRetry<T>(ResultRetryAlgorithm<T> baseAlgo) implements ResultRetryAlgorithm<T> {
@Override
public TimedAttemptSettings createNextAttempt(Throwable prevThrowable, T prevResponse, TimedAttemptSettings prevSettings) {
return baseAlgo.createNextAttempt(prevThrowable, prevResponse, prevSettings);
}
@Override
public boolean shouldRetry(Throwable prevThrowable, T prevResponse) throws CancellationException {
if (ExceptionsHelper.unwrap(prevThrowable, UnknownHostException.class) != null) {
return true;
}
return baseAlgo.shouldRetry(prevThrowable, prevResponse);
}
} final StorageOptions.Builder storageOptionsBuilder = StorageOptions.newBuilder()
.setStorageRetryStrategy(CustomRetryStrategy.INSTANCE)
.setTransportOptions(httpTransportOptions) |
The stack trace from ES-10574 seemed to indicate the root case was an
UnknownHostException(see here)This change means we'll use the configured number of retries when an exception caused by an
UnknownHostExceptionoccurs. Previously there would have been zero retries when such an error occurred.This was quite tricky to test, but we got there in the end.
I don't know what the other CSPs do under these circumstances, I think it'd be tricky to promote this test to the
AbstractBlobContainerRetriesTestCase, because the means of counting the requests would be different for each client impl. But perhaps worthwhile if we want this behaviour to be consistent.Relates ES-10574