Retry on TLS handshake errors in ES client by ycombinator · Pull Request #6767 · elastic/fleet-server

ycombinator · 2026-04-07T19:58:27Z

What is the problem this PR solves?

When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs (or a temporarily untrusted host coexists with a trusted one), a TLS handshake failure against one host aborts the request outright — even though the underlying connection pool's dead-host failover would otherwise route the next attempt to a healthy host.

The pool already marks the failed host dead via OnFailure unconditionally on any transport error (elastic-transport-go/v8.8.0 elastictransport.go#L420), but Perform() only loops to the next host if RetryOnError returns true (elastic-transport-go/v8.8.0 elastictransport.go#L424). The existing fleet-server classifier only covered ECONNREFUSED/ECONNRESET (internal/pkg/es/client.go#L42), so cert verification errors leaked out and — for the startup data migrations in internal/pkg/dl/migration.go, which call client.UpdateByQuery once with no per-host retry of their own — crashed the process in a supervisor restart loop.

This was hit in the field on a Fleet Server 9.2.3 deployment whose policy carried two ES output hosts but whose trust store only contained the ECK-injected CA for one of them; the migration kept landing on the untrusted host via round-robin.

How does this PR solve the problem?

Adds a new WithRetryOnTLSHandshakeError ConfigOption, applied inside applyDefaultOptions, which extends the RetryOnError classifier to return true for *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error, fmt.Errorf, etc.).

The option composes with any previously-installed RetryOnError predicate via OR semantics — WithRetryOnErrs(ECONNREFUSED, ECONNRESET) is still honored — so it only widens the set of retried errors and never clobbers an existing classifier.

With this in place, on a TLS handshake error against host A:

The pool marks A dead via OnFailure (existing behavior).
RetryOnError now returns true, so Perform() loops.
pool.Next() is called again; A is no longer in the live list, so the request goes to B.

This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify=true and crypto/tls never produces a *tls.CertificateVerificationError. The option does not bypass or override that setting.

It is also bounded by the existing MaxRetries setting (5), so a single-host pool with a permanently bad cert still terminates — at most one extra retry attempt before giving up.

How to test this PR locally

Unit tests cover both pieces:

go test ./internal/pkg/es/ -run 'TestIsTLSHandshakeError|TestWithRetryOnTLSHandshakeError' -v

TestIsTLSHandshakeError — table-driven, exercises the unwrap chain: nil, unrelated error, syscall.ECONNREFUSED, direct *tls.CertificateVerificationError, wrapped via fmt.Errorf %w, wrapped via *url.Error (mirrors how net/http actually surfaces it), and doubly wrapped (fmt.Errorf around *url.Error).
TestWithRetryOnTLSHandshakeError — verifies (a) the option works standalone, and (b) when layered on top of WithRetryOnErrs(syscall.ECONNREFUSED), both predicates fire (OR semantics), and neither matches ECONNRESET.

End-to-end manual reproduction: configure a Fleet Server policy with two ES output hosts where only one CA is in the trust store, and observe that startup migrations no longer crash the process when the untrusted host is picked first by round-robin.

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

mergify · 2026-04-07T19:59:22Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

michel-laterman

LGTM. We should consider replaceing isTLSHandshakeError with a more generic call in the future

When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-08T17:59:05Z

@Mergifyio backport 8.19 9.3 9.4

mergify · 2026-04-08T17:59:19Z

backport 8.19 9.3 9.4

✅ Backports have been created

Details

#6780 [8.19](backport #6767) Retry on TLS handshake errors in ES client has been created for branch 8.19 but encountered conflicts

Cherry-pick of 52f9cbe has failed:

On branch mergify/bp/8.19/pr-6767
Your branch is up to date with 'origin/8.19'.

You are currently cherry-picking commit 52f9cbe.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   changelog/fragments/1775592233-retry-on-tls-handshake-errors-in-es-client.yaml
	modified:   internal/pkg/es/client_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   internal/pkg/es/client.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

#6781 [9.3](backport #6767) Retry on TLS handshake errors in ES client has been created for branch 9.3
#6782 [9.4](backport #6767) Retry on TLS handshake errors in ES client has been created for branch 9.4

* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Regenerate changelog fragment via elastic-agent-changelog-tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 52f9cbe) # Conflicts: # internal/pkg/es/client.go

* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Regenerate changelog fragment via elastic-agent-changelog-tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 52f9cbe)

The 8.19 es/client.go has none of the applyDefaultOptions retry infrastructure from main, so the cherry-pick left conflict markers around imports, applyDefaultOptions, and the new helpers. Drop the applyDefaultOptions block (depends on backoff/syscall/WithRetryOnStatus which don't exist on this branch), keep WithRetryOnErrs, WithRetryOnTLSHandshakeError, and isTLSHandshakeError, and wire WithRetryOnTLSHandshakeError into the elasticsearchOptions slices in server/fleet.go and bulk/engine.go so the fix actually takes effect on this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. * Regenerate changelog fragment via elastic-agent-changelog-tool * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. --------- (cherry picked from commit 52f9cbe) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

) * Retry on TLS handshake errors in ES client (#6767) * Retry on TLS handshake errors in ES client When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs, a TLS handshake failure against one host aborts the request even though the connection pool's dead-host failover would otherwise route the next attempt to a healthy host. The pool already marks the failed host dead via OnFailure unconditionally, but Perform() only loops to the next host if RetryOnError returns true, and the existing classifier in fleet-server only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked out and (for startup data migrations against a multi-host output) crashed the process in a restart loop. Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions via OR semantics so it widens the set of retried errors without clobbering WithRetryOnErrs. The new predicate matches *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error / fmt.Errorf). This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify and never produces a *tls.CertificateVerificationError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Regenerate changelog fragment via elastic-agent-changelog-tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Drop redundant nil guard in isTLSHandshakeError errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 52f9cbe) # Conflicts: # internal/pkg/es/client.go * Resolve 8.19 backport conflicts for #6767 The 8.19 es/client.go has none of the applyDefaultOptions retry infrastructure from main, so the cherry-pick left conflict markers around imports, applyDefaultOptions, and the new helpers. Drop the applyDefaultOptions block (depends on backoff/syscall/WithRetryOnStatus which don't exist on this branch), keep WithRetryOnErrs, WithRetryOnTLSHandshakeError, and isTLSHandshakeError, and wire WithRetryOnTLSHandshakeError into the elasticsearchOptions slices in server/fleet.go and bulk/engine.go so the fix actually takes effect on this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

ycombinator requested a review from a team as a code owner April 7, 2026 19:58

ycombinator requested review from blakerouse and michel-laterman April 7, 2026 19:58

ycombinator added bug Something isn't working backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Apr 7, 2026

michel-laterman approved these changes Apr 8, 2026

View reviewed changes

ycombinator and others added 3 commits April 8, 2026 09:48

Regenerate changelog fragment via elastic-agent-changelog-tool

da881bb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Drop redundant nil guard in isTLSHandshakeError

73859a4

errors.As already returns false on a nil error, so the explicit early-out is unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ycombinator force-pushed the retry-on-tls-errors branch from 928c3c8 to 73859a4 Compare April 8, 2026 16:48

ycombinator enabled auto-merge (squash) April 8, 2026 17:43

ycombinator merged commit 52f9cbe into elastic:main Apr 8, 2026
10 checks passed

mergify Bot mentioned this pull request Apr 8, 2026

[8.19](backport #6767) Retry on TLS handshake errors in ES client #6780

Merged

6 tasks

mergify Bot mentioned this pull request Apr 8, 2026

[9.3](backport #6767) Retry on TLS handshake errors in ES client #6781

Merged

6 tasks

mergify Bot mentioned this pull request Apr 8, 2026

[9.4](backport #6767) Retry on TLS handshake errors in ES client #6782

Merged

6 tasks

ycombinator deleted the retry-on-tls-errors branch April 8, 2026 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on TLS handshake errors in ES client#6767

Retry on TLS handshake errors in ES client#6767
ycombinator merged 3 commits intoelastic:mainfrom
ycombinator:retry-on-tls-errors

ycombinator commented Apr 7, 2026 •

edited

Loading

mergify Bot commented Apr 7, 2026

michel-laterman left a comment

Uh oh!

github-actions Bot commented Apr 8, 2026

mergify Bot commented Apr 8, 2026 •

edited

Loading

Labels

2 participants

Conversation

ycombinator commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

mergify Bot commented Apr 7, 2026

michel-laterman left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 8, 2026

mergify Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Labels

2 participants

ycombinator commented Apr 7, 2026 •

edited

Loading

mergify Bot commented Apr 8, 2026 •

edited

Loading