Skip to content

Retry when the server can't be resolved#123852

Merged
nicktindall merged 21 commits intoelastic:mainfrom
nicktindall:ES-10574_retry_on_unknown_host
Mar 13, 2025
Merged

Retry when the server can't be resolved#123852
nicktindall merged 21 commits intoelastic:mainfrom
nicktindall:ES-10574_retry_on_unknown_host

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Mar 3, 2025

The stack trace from ES-10574 seemed to indicate the root case was an UnknownHostException (see here)

This change means we'll use the configured number of retries when an exception caused by an UnknownHostException occurs. Previously there would have been zero retries when such an error occurred.

This was quite tricky to test, but we got there in the end.

I don't know what the other CSPs do under these circumstances, I think it'd be tricky to promote this test to the AbstractBlobContainerRetriesTestCase, because the means of counting the requests would be different for each client impl. But perhaps worthwhile if we want this behaviour to be consistent.

Relates ES-10574

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent opencensus thread leak you need to override GoogleCloudStoragePlugin.close() to shutdown tracer's export component. In main code it doesn't do anything, it's NOOP, in testing it will shutdown thread that poll traces queue.

    @Override
    public void close() throws IOException {
        Tracing.getExportComponent().shutdown();
    }
}

protected StorageRetryStrategy getRetryStrategy() {
return new RetryOnNetworkOutageRetryStrategy(StorageRetryStrategy.getLegacyStorageRetryStrategy());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use StorageRetryStrategy.getDefaultStorageRetryStrategy()? getLegacyStorageRetryStrategy is deprecated and will be removed soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's worth doing as a separate change, I think it comes with more changes due to some test breaks that result (I saw a comment about it somewhere). The behaviour for unknown host is consistent between the two implementations.

import java.util.concurrent.CancellationException;
import java.util.function.Function;

public class DelegatingStorageRetryStrategy<T> implements StorageRetryStrategy {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a convoluted approach to decorating the StorageRetryStrategy, but necessary to allow wrapping of the standard ones (final, package-private)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too bad, it's easy to read later when you decorate LegacyStorageRetryStrategy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking again into this, the delegator seems overkill. I would rather do specialized class that compose default strategy and enhance with our own bits of logic. I cant imagine why we would need multiple retry strategies. Basically remove layer of indirection. Would be easier to read.

if (randomBoolean()) {
logger.info("Failing due to connection refused");
endpointUrlOverride = "http://127.0.0.1:"
+ randomValueOtherThan(httpServer.getAddress().getPort(), () -> randomIntBetween(49152, 65535));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this will randomly select another open port one day and cause a flap. Not sure what the best approach is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use 127.0.0.1:1 in other places where we need a port that definitely won't be open, see e.g. DiscoveryEc2AvailabilityZoneAttributeNoImdsIT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in a3d37d5

@nicktindall nicktindall marked this pull request as ready for review March 4, 2025 02:00
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 4, 2025
@nicktindall
Copy link
Contributor Author

To prevent opencensus thread leak you need to override GoogleCloudStoragePlugin.close() to shutdown tracer's export component. In main code it doesn't do anything, it's NOOP, in testing it will shutdown thread that poll traces queue.

Thanks for this. I ended up using a different approach, but that is useful info in case we need to go back to it.

@nicktindall nicktindall added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Mar 4, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. and removed needs:triage Requires assignment of a team area label labels Mar 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

@nicktindall nicktindall requested a review from DaveCTurner March 4, 2025 04:03
@mhl-b
Copy link
Contributor

mhl-b commented Mar 4, 2025

I think we need to pair this with DNS caching removal for the unsuccessful resolutions, networkaddress.cache.negative.ttl=0 link

public void testShouldRetryOnNetworkOutage() {
final int maxRetries = randomIntBetween(3, 5);

if (randomBoolean()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have two test cases, one for each branch, rather than picking randomly like this? They can both call the same underlying implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6159881

BlobContainer blobContainer = createBlobContainer(maxRetries, null, null, null, null);
try {
blobContainer.listBlobs(randomPurpose());
fail("Should have thrown an exception");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer expectThrows over doing this manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8b2e376

@nicktindall
Copy link
Contributor Author

nicktindall commented Mar 4, 2025

I think we need to pair this with DNS caching removal for the unsuccessful resolutions, networkaddress.cache.negative.ttl=0 link

This is a good point. This being a JVM-wide setting I'd be inclined to not do anything with it specifically to serve the Google client, but open to doing so (assuming we don't already somewhere else for something else). Perhaps its something we should set in our serverless/ECH configs.

The default value of 10 would be OK with the default retry config (which we seem to use). That will retry with intervals 1, 2, 4, 8, 16, 32 seconds if I'm reading it right, so we should attempt to re-resolve at the 8/16/32 attempts.

Perhaps its worth logging if the value is set to -1?, but you'd have to limit that or it could get very noisy I imagine.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I still think we dont need DelegatingResultRetryAlgorithm, can be concrete class that adds UnknownHost retry.

@nicktindall
Copy link
Contributor Author

LGTM. I still think we dont need DelegatingResultRetryAlgorithm, can be concrete class that adds UnknownHost retry.

What do you think about the new approach, it would be trivial to in-line that lambda, but it means the decoration logic would be hidden in the massive amount of boilerplate plumbing necessary to do the decoration. The way it is now we can change the underlying implementation easily (which I think we know we need to do) and also have the small amount of shouldRetry decoration logic right there next to it.

@mhl-b
Copy link
Contributor

mhl-b commented Mar 13, 2025

Not a big deal at all. Feel free to merge.

I had something like this in mind.

    record CustomRetryStrategy(StorageRetryStrategy base) implements StorageRetryStrategy {
        static final CustomRetryStrategy INSTANCE = new CustomRetryStrategy(StorageRetryStrategy.getLegacyStorageRetryStrategy());

        @Override
        public ResultRetryAlgorithm<?> getIdempotentHandler() {
            return new WithDNSRetry<>(base.getIdempotentHandler());
        }

        @Override
        public ResultRetryAlgorithm<?> getNonidempotentHandler() {
            return new WithDNSRetry<>(base.getNonidempotentHandler());
        }
    }

    record WithDNSRetry<T>(ResultRetryAlgorithm<T> baseAlgo) implements ResultRetryAlgorithm<T> {

        @Override
        public TimedAttemptSettings createNextAttempt(Throwable prevThrowable, T prevResponse, TimedAttemptSettings prevSettings) {
            return baseAlgo.createNextAttempt(prevThrowable, prevResponse, prevSettings);
        }

        @Override
        public boolean shouldRetry(Throwable prevThrowable, T prevResponse) throws CancellationException {
            if (ExceptionsHelper.unwrap(prevThrowable, UnknownHostException.class) != null) {
                return true;
            }
            return baseAlgo.shouldRetry(prevThrowable, prevResponse);
        }
    }
        final StorageOptions.Builder storageOptionsBuilder = StorageOptions.newBuilder()
            .setStorageRetryStrategy(CustomRetryStrategy.INSTANCE)
            .setTransportOptions(httpTransportOptions)
@nicktindall nicktindall merged commit 74d61a4 into elastic:main Mar 13, 2025
17 checks passed
@nicktindall nicktindall deleted the ES-10574_retry_on_unknown_host branch March 13, 2025 01:38
albertzaharovits pushed a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 13, 2025
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.1.0

5 participants