Retry when the server can't be resolved by nicktindall · Pull Request #123852 · elastic/elasticsearch

nicktindall · 2025-03-03T08:10:27Z

The stack trace from ES-10574 seemed to indicate the root case was an UnknownHostException (see here)

This change means we'll use the configured number of retries when an exception caused by an UnknownHostException occurs. Previously there would have been zero retries when such an error occurred.

This was quite tricky to test, but we got there in the end.

I don't know what the other CSPs do under these circumstances, I think it'd be tricky to promote this test to the AbstractBlobContainerRetriesTestCase, because the means of counting the requests would be different for each client impl. But perhaps worthwhile if we want this behaviour to be consistent.

Relates ES-10574

…nown_host

mhl-b

To prevent opencensus thread leak you need to override GoogleCloudStoragePlugin.close() to shutdown tracer's export component. In main code it doesn't do anything, it's NOOP, in testing it will shutdown thread that poll traces queue.

    @Override
    public void close() throws IOException {
        Tracing.getExportComponent().shutdown();
    }

mhl-b · 2025-03-03T23:58:25Z

...pository-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageService.java

    }

+    protected StorageRetryStrategy getRetryStrategy() {
+        return new RetryOnNetworkOutageRetryStrategy(StorageRetryStrategy.getLegacyStorageRetryStrategy());


Why not to use StorageRetryStrategy.getDefaultStorageRetryStrategy()? getLegacyStorageRetryStrategy is deprecated and will be removed soon.

I think that's worth doing as a separate change, I think it comes with more changes due to some test breaks that result (I saw a comment about it somewhere). The behaviour for unknown host is consistent between the two implementations.

nicktindall · 2025-03-04T01:26:13Z

...ory-gcs/src/main/java/org/elasticsearch/repositories/gcs/DelegatingStorageRetryStrategy.java

+import java.util.concurrent.CancellationException;
+import java.util.function.Function;
+
+public class DelegatingStorageRetryStrategy<T> implements StorageRetryStrategy {


Quite a convoluted approach to decorating the StorageRetryStrategy, but necessary to allow wrapping of the standard ones (final, package-private)

Not too bad, it's easy to read later when you decorate LegacyStorageRetryStrategy

Looking again into this, the delegator seems overkill. I would rather do specialized class that compose default strategy and enhance with our own bits of logic. I cant imagine why we would need multiple retry strategies. Basically remove layer of indirection. Would be easier to read.

nicktindall · 2025-03-04T01:53:09Z

...est/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobContainerRetriesTests.java

+        if (randomBoolean()) {
+            logger.info("Failing due to connection refused");
+            endpointUrlOverride = "http://127.0.0.1:"
+                + randomValueOtherThan(httpServer.getAddress().getPort(), () -> randomIntBetween(49152, 65535));


I wonder if this will randomly select another open port one day and cause a flap. Not sure what the best approach is here.

We use 127.0.0.1:1 in other places where we need a port that definitely won't be open, see e.g. DiscoveryEc2AvailabilityZoneAttributeNoImdsIT.

Addressed in a3d37d5

nicktindall · 2025-03-04T02:08:24Z

To prevent opencensus thread leak you need to override GoogleCloudStoragePlugin.close() to shutdown tracer's export component. In main code it doesn't do anything, it's NOOP, in testing it will shutdown thread that poll traces queue.

Thanks for this. I ended up using a different approach, but that is useful info in case we need to go back to it.

elasticsearchmachine · 2025-03-04T03:52:30Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-03-04T03:52:30Z

Hi @nicktindall, I've created a changelog YAML for you.

mhl-b · 2025-03-04T04:49:23Z

I think we need to pair this with DNS caching removal for the unsuccessful resolutions, networkaddress.cache.negative.ttl=0 link

...ory-gcs/src/main/java/org/elasticsearch/repositories/gcs/DelegatingStorageRetryStrategy.java

DaveCTurner · 2025-03-04T14:47:19Z

...est/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobContainerRetriesTests.java

+    public void testShouldRetryOnNetworkOutage() {
+        final int maxRetries = randomIntBetween(3, 5);
+
+        if (randomBoolean()) {


Could we have two test cases, one for each branch, rather than picking randomly like this? They can both call the same underlying implementation.

Addressed in 6159881

DaveCTurner · 2025-03-04T14:47:55Z

...est/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobContainerRetriesTests.java

+        BlobContainer blobContainer = createBlobContainer(maxRetries, null, null, null, null);
+        try {
+            blobContainer.listBlobs(randomPurpose());
+            fail("Should have thrown an exception");


I'd prefer expectThrows over doing this manually.

Addressed in 8b2e376

…ories/gcs/DelegatingStorageRetryStrategy.java Co-authored-by: Yang Wang <ywangd@gmail.com>

…nown_host

nicktindall · 2025-03-04T22:41:52Z

I think we need to pair this with DNS caching removal for the unsuccessful resolutions, networkaddress.cache.negative.ttl=0 link

This is a good point. This being a JVM-wide setting I'd be inclined to not do anything with it specifically to serve the Google client, but open to doing so (assuming we don't already somewhere else for something else). Perhaps its something we should set in our serverless/ECH configs.

The default value of 10 would be OK with the default retry config (which we seem to use). That will retry with intervals 1, 2, 4, 8, 16, 32 seconds if I'm reading it right, so we should attempt to re-resolve at the 8/16/32 attempts.

Perhaps its worth logging if the value is set to -1?, but you'd have to limit that or it could get very noisy I imagine.

mhl-b

LGTM. I still think we dont need DelegatingResultRetryAlgorithm, can be concrete class that adds UnknownHost retry.

…nown_host

nicktindall · 2025-03-12T23:48:07Z

LGTM. I still think we dont need DelegatingResultRetryAlgorithm, can be concrete class that adds UnknownHost retry.

What do you think about the new approach, it would be trivial to in-line that lambda, but it means the decoration logic would be hidden in the massive amount of boilerplate plumbing necessary to do the decoration. The way it is now we can change the underlying implementation easily (which I think we know we need to do) and also have the small amount of shouldRetry decoration logic right there next to it.

mhl-b · 2025-03-13T01:25:21Z

Not a big deal at all. Feel free to merge.

I had something like this in mind.

    record CustomRetryStrategy(StorageRetryStrategy base) implements StorageRetryStrategy {
        static final CustomRetryStrategy INSTANCE = new CustomRetryStrategy(StorageRetryStrategy.getLegacyStorageRetryStrategy());

        @Override
        public ResultRetryAlgorithm<?> getIdempotentHandler() {
            return new WithDNSRetry<>(base.getIdempotentHandler());
        }

        @Override
        public ResultRetryAlgorithm<?> getNonidempotentHandler() {
            return new WithDNSRetry<>(base.getNonidempotentHandler());
        }
    }

    record WithDNSRetry<T>(ResultRetryAlgorithm<T> baseAlgo) implements ResultRetryAlgorithm<T> {

        @Override
        public TimedAttemptSettings createNextAttempt(Throwable prevThrowable, T prevResponse, TimedAttemptSettings prevSettings) {
            return baseAlgo.createNextAttempt(prevThrowable, prevResponse, prevSettings);
        }

        @Override
        public boolean shouldRetry(Throwable prevThrowable, T prevResponse) throws CancellationException {
            if (ExceptionsHelper.unwrap(prevThrowable, UnknownHostException.class) != null) {
                return true;
            }
            return baseAlgo.shouldRetry(prevThrowable, prevResponse);
        }
    }

        final StorageOptions.Builder storageOptionsBuilder = StorageOptions.newBuilder()
            .setStorageRetryStrategy(CustomRetryStrategy.INSTANCE)
            .setTransportOptions(httpTransportOptions)

Retry when the server can't be resolved

7347385

elasticsearchmachine added the v9.1.0 label Mar 3, 2025

nicktindall requested a review from mhl-b March 3, 2025 22:02

Merge remote-tracking branch 'origin/main' into ES-10574_retry_on_unk…

55127ec

…nown_host

mhl-b reviewed Mar 4, 2025

View reviewed changes

nicktindall added 4 commits March 4, 2025 12:17

Count requests using interceptor

614b763

Factor out DelegatingStorageRetryStrategy

0967e26

Remove opencensus remnants

456d916

Fix spotless

166c8bc

nicktindall commented Mar 4, 2025

View reviewed changes

nicktindall added 2 commits March 4, 2025 12:29

Remove unnecessary @before

4df0a4c

Tidy/document

5b46d60

nicktindall commented Mar 4, 2025

View reviewed changes

nicktindall added 2 commits March 4, 2025 12:54

Tidy

ab29e14

Merge branch 'main' into ES-10574_retry_on_unknown_host

24a108d

nicktindall marked this pull request as ready for review March 4, 2025 02:00

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 4, 2025

nicktindall added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Mar 4, 2025

elasticsearchmachine added Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. and removed needs:triage Requires assignment of a team area label labels Mar 4, 2025

Update docs/changelog/123852.yaml

0cccf7e

nicktindall requested a review from DaveCTurner March 4, 2025 04:03

Merge branch 'main' into ES-10574_retry_on_unknown_host

64258ad

ywangd reviewed Mar 4, 2025

View reviewed changes

...ory-gcs/src/main/java/org/elasticsearch/repositories/gcs/DelegatingStorageRetryStrategy.java Outdated Show resolved Hide resolved

DaveCTurner reviewed Mar 4, 2025

View reviewed changes

nicktindall and others added 5 commits March 5, 2025 09:17

Update modules/repository-gcs/src/main/java/org/elasticsearch/reposit…

87398cc

…ories/gcs/DelegatingStorageRetryStrategy.java Co-authored-by: Yang Wang <ywangd@gmail.com>

Merge remote-tracking branch 'origin/main' into ES-10574_retry_on_unk…

e4df9b2

…nown_host

Use port 1 for connection refused test

a3d37d5

Split unresolvable and connection refused into separate test cases

6159881

Use expectThrows

8b2e376

nicktindall added 2 commits March 5, 2025 09:46

Make summary more specific

54d7090

Merge branch 'main' into ES-10574_retry_on_unknown_host

004319f

mhl-b approved these changes Mar 12, 2025

View reviewed changes

nicktindall added 2 commits March 13, 2025 10:35

Try and simplify shouldRetry override

904e045

Merge remote-tracking branch 'origin/main' into ES-10574_retry_on_unk…

1a77ba8

…nown_host

nicktindall merged commit 74d61a4 into elastic:main Mar 13, 2025
17 checks passed

nicktindall deleted the ES-10574_retry_on_unknown_host branch March 13, 2025 01:38

albertzaharovits pushed a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 13, 2025

Retry when the server can't be resolved (elastic#123852)

a9f0779

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025

Retry when the server can't be resolved (elastic#123852)

3681095

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry when the server can't be resolved#123852

Retry when the server can't be resolved#123852
nicktindall merged 21 commits intoelastic:mainfrom
nicktindall:ES-10574_retry_on_unknown_host

nicktindall commented Mar 3, 2025 •

edited

Loading

mhl-b left a comment •

edited

Loading

mhl-b Mar 3, 2025

nicktindall Mar 4, 2025

nicktindall Mar 4, 2025

mhl-b Mar 4, 2025

mhl-b Mar 7, 2025

nicktindall Mar 4, 2025

DaveCTurner Mar 4, 2025

nicktindall Mar 4, 2025

nicktindall commented Mar 4, 2025

elasticsearchmachine commented Mar 4, 2025

elasticsearchmachine commented Mar 4, 2025

mhl-b commented Mar 4, 2025 •

edited

Loading

Uh oh!

DaveCTurner Mar 4, 2025

nicktindall Mar 4, 2025

DaveCTurner Mar 4, 2025

nicktindall Mar 4, 2025

nicktindall commented Mar 4, 2025 •

edited

Loading

mhl-b left a comment

nicktindall commented Mar 12, 2025

mhl-b commented Mar 13, 2025

Uh oh!

Labels

5 participants

Conversation

nicktindall commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mhl-b left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall commented Mar 4, 2025

elasticsearchmachine commented Mar 4, 2025

elasticsearchmachine commented Mar 4, 2025

mhl-b commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

nicktindall commented Mar 12, 2025

mhl-b commented Mar 13, 2025

Uh oh!

Labels

5 participants

nicktindall commented Mar 3, 2025 •

edited

Loading

mhl-b left a comment •

edited

Loading

mhl-b commented Mar 4, 2025 •

edited

Loading

nicktindall commented Mar 4, 2025 •

edited

Loading