Retry on TLS handshake errors in ES client#6767
Merged
ycombinator merged 3 commits intoelastic:mainfrom Apr 8, 2026
Merged
Conversation
Contributor
|
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
|
michel-laterman
approved these changes
Apr 8, 2026
Contributor
michel-laterman
left a comment
There was a problem hiding this comment.
LGTM. We should consider replaceing isTLSHandshakeError with a more generic call in the future
When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
928c3c8 to
73859a4
Compare
Contributor
|
@Mergifyio backport 8.19 9.3 9.4 |
Contributor
✅ Backports have been createdDetails
Cherry-pick of 52f9cbe has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
|
mergify Bot
pushed a commit
that referenced
this pull request
Apr 8, 2026
* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Regenerate changelog fragment via elastic-agent-changelog-tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 52f9cbe) # Conflicts: # internal/pkg/es/client.go
6 tasks
mergify Bot
pushed a commit
that referenced
this pull request
Apr 8, 2026
* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Regenerate changelog fragment via elastic-agent-changelog-tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 52f9cbe)
6 tasks
mergify Bot
pushed a commit
that referenced
this pull request
Apr 8, 2026
* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Regenerate changelog fragment via elastic-agent-changelog-tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 52f9cbe)
6 tasks
ycombinator
added a commit
that referenced
this pull request
Apr 8, 2026
The 8.19 es/client.go has none of the applyDefaultOptions retry infrastructure from main, so the cherry-pick left conflict markers around imports, applyDefaultOptions, and the new helpers. Drop the applyDefaultOptions block (depends on backoff/syscall/WithRetryOnStatus which don't exist on this branch), keep WithRetryOnErrs, WithRetryOnTLSHandshakeError, and isTLSHandshakeError, and wire WithRetryOnTLSHandshakeError into the elasticsearchOptions slices in server/fleet.go and bulk/engine.go so the fix actually takes effect on this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ycombinator
added a commit
that referenced
this pull request
Apr 8, 2026
* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. * Regenerate changelog fragment via elastic-agent-changelog-tool * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. --------- (cherry picked from commit 52f9cbe) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ycombinator
added a commit
that referenced
this pull request
Apr 8, 2026
* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. * Regenerate changelog fragment via elastic-agent-changelog-tool * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. --------- (cherry picked from commit 52f9cbe) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ycombinator
added a commit
that referenced
this pull request
Apr 8, 2026
) * Retry on TLS handshake errors in ES client (#6767) * Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Regenerate changelog fragment via elastic-agent-changelog-tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 52f9cbe) # Conflicts: # internal/pkg/es/client.go * Resolve 8.19 backport conflicts for #6767 The 8.19 es/client.go has none of the applyDefaultOptions retry infrastructure from main, so the cherry-pick left conflict markers around imports, applyDefaultOptions, and the new helpers. Drop the applyDefaultOptions block (depends on backoff/syscall/WithRetryOnStatus which don't exist on this branch), keep WithRetryOnErrs, WithRetryOnTLSHandshakeError, and isTLSHandshakeError, and wire WithRetryOnTLSHandshakeError into the elasticsearchOptions slices in server/fleet.go and bulk/engine.go so the fix actually takes effect on this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the problem this PR solves?
When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs (or a temporarily untrusted host coexists with a trusted one), a TLS handshake failure against one host aborts the request outright — even though the underlying connection pool's dead-host failover would otherwise route the next attempt to a healthy host.
The pool already marks the failed host dead via
OnFailureunconditionally on any transport error (elastic-transport-go/v8.8.0 elastictransport.go#L420), butPerform()only loops to the next host ifRetryOnErrorreturns true (elastic-transport-go/v8.8.0 elastictransport.go#L424). The existing fleet-server classifier only coveredECONNREFUSED/ECONNRESET(internal/pkg/es/client.go#L42), so cert verification errors leaked out and — for the startup data migrations ininternal/pkg/dl/migration.go, which callclient.UpdateByQueryonce with no per-host retry of their own — crashed the process in a supervisor restart loop.This was hit in the field on a Fleet Server 9.2.3 deployment whose policy carried two ES output hosts but whose trust store only contained the ECK-injected CA for one of them; the migration kept landing on the untrusted host via round-robin.
How does this PR solve the problem?
Adds a new
WithRetryOnTLSHandshakeErrorConfigOption, applied insideapplyDefaultOptions, which extends theRetryOnErrorclassifier to return true for*tls.CertificateVerificationErrorunwrapped viaerrors.As(covering wrapping by*url.Error,fmt.Errorf, etc.).The option composes with any previously-installed
RetryOnErrorpredicate via OR semantics —WithRetryOnErrs(ECONNREFUSED, ECONNRESET)is still honored — so it only widens the set of retried errors and never clobbers an existing classifier.With this in place, on a TLS handshake error against host A:
OnFailure(existing behavior).RetryOnErrornow returns true, soPerform()loops.pool.Next()is called again; A is no longer in the live list, so the request goes to B.This is a no-op when
ssl.verification_mode: noneis configured, since that path setsInsecureSkipVerify=trueandcrypto/tlsnever produces a*tls.CertificateVerificationError. The option does not bypass or override that setting.It is also bounded by the existing
MaxRetriessetting (5), so a single-host pool with a permanently bad cert still terminates — at most one extra retry attempt before giving up.How to test this PR locally
Unit tests cover both pieces:
TestIsTLSHandshakeError— table-driven, exercises the unwrap chain: nil, unrelated error,syscall.ECONNREFUSED, direct*tls.CertificateVerificationError, wrapped viafmt.Errorf %w, wrapped via*url.Error(mirrors hownet/httpactually surfaces it), and doubly wrapped (fmt.Errorfaround*url.Error).TestWithRetryOnTLSHandshakeError— verifies (a) the option works standalone, and (b) when layered on top ofWithRetryOnErrs(syscall.ECONNREFUSED), both predicates fire (OR semantics), and neither matchesECONNRESET.End-to-end manual reproduction: configure a Fleet Server policy with two ES output hosts where only one CA is in the trust store, and observe that startup migrations no longer crash the process when the untrusted host is picked first by round-robin.
Design Checklist
Checklist
./changelog/fragmentsusing the changelog tool