Skip to content

Retry on TLS handshake errors in ES client#6767

Merged
ycombinator merged 3 commits intoelastic:mainfrom
ycombinator:retry-on-tls-errors
Apr 8, 2026
Merged

Retry on TLS handshake errors in ES client#6767
ycombinator merged 3 commits intoelastic:mainfrom
ycombinator:retry-on-tls-errors

Conversation

@ycombinator
Copy link
Copy Markdown
Contributor

@ycombinator ycombinator commented Apr 7, 2026

What is the problem this PR solves?

When the Elasticsearch output is configured with multiple hosts whose certificates chain to different CAs (or a temporarily untrusted host coexists with a trusted one), a TLS handshake failure against one host aborts the request outright — even though the underlying connection pool's dead-host failover would otherwise route the next attempt to a healthy host.

The pool already marks the failed host dead via OnFailure unconditionally on any transport error (elastic-transport-go/v8.8.0 elastictransport.go#L420), but Perform() only loops to the next host if RetryOnError returns true (elastic-transport-go/v8.8.0 elastictransport.go#L424). The existing fleet-server classifier only covered ECONNREFUSED/ECONNRESET (internal/pkg/es/client.go#L42), so cert verification errors leaked out and — for the startup data migrations in internal/pkg/dl/migration.go, which call client.UpdateByQuery once with no per-host retry of their own — crashed the process in a supervisor restart loop.

This was hit in the field on a Fleet Server 9.2.3 deployment whose policy carried two ES output hosts but whose trust store only contained the ECK-injected CA for one of them; the migration kept landing on the untrusted host via round-robin.

How does this PR solve the problem?

Adds a new WithRetryOnTLSHandshakeError ConfigOption, applied inside applyDefaultOptions, which extends the RetryOnError classifier to return true for *tls.CertificateVerificationError unwrapped via errors.As (covering wrapping by *url.Error, fmt.Errorf, etc.).

The option composes with any previously-installed RetryOnError predicate via OR semantics — WithRetryOnErrs(ECONNREFUSED, ECONNRESET) is still honored — so it only widens the set of retried errors and never clobbers an existing classifier.

With this in place, on a TLS handshake error against host A:

  1. The pool marks A dead via OnFailure (existing behavior).
  2. RetryOnError now returns true, so Perform() loops.
  3. pool.Next() is called again; A is no longer in the live list, so the request goes to B.

This is a no-op when ssl.verification_mode: none is configured, since that path sets InsecureSkipVerify=true and crypto/tls never produces a *tls.CertificateVerificationError. The option does not bypass or override that setting.

It is also bounded by the existing MaxRetries setting (5), so a single-host pool with a permanently bad cert still terminates — at most one extra retry attempt before giving up.

How to test this PR locally

Unit tests cover both pieces:

go test ./internal/pkg/es/ -run 'TestIsTLSHandshakeError|TestWithRetryOnTLSHandshakeError' -v
  • TestIsTLSHandshakeError — table-driven, exercises the unwrap chain: nil, unrelated error, syscall.ECONNREFUSED, direct *tls.CertificateVerificationError, wrapped via fmt.Errorf %w, wrapped via *url.Error (mirrors how net/http actually surfaces it), and doubly wrapped (fmt.Errorf around *url.Error).
  • TestWithRetryOnTLSHandshakeError — verifies (a) the option works standalone, and (b) when layered on top of WithRetryOnErrs(syscall.ECONNREFUSED), both predicates fire (OR semantics), and neither matches ECONNRESET.

End-to-end manual reproduction: configure a Fleet Server policy with two ES output hosts where only one CA is in the trust store, and observe that startup migrations no longer crash the process when the untrusted host is picked first by round-robin.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
@ycombinator ycombinator requested a review from a team as a code owner April 7, 2026 19:58
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 7, 2026

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
@ycombinator ycombinator added bug Something isn't working backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Apr 7, 2026
Copy link
Copy Markdown
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should consider replaceing isTLSHandshakeError with a more generic call in the future

ycombinator and others added 3 commits April 8, 2026 09:48
When the Elasticsearch output is configured with multiple hosts whose
certificates chain to different CAs, a TLS handshake failure against
one host aborts the request even though the connection pool's
dead-host failover would otherwise route the next attempt to a healthy
host. The pool already marks the failed host dead via OnFailure
unconditionally, but Perform() only loops to the next host if
RetryOnError returns true, and the existing classifier in fleet-server
only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked
out and (for startup data migrations against a multi-host output)
crashed the process in a restart loop.

Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions
via OR semantics so it widens the set of retried errors without
clobbering WithRetryOnErrs. The new predicate matches
*tls.CertificateVerificationError unwrapped via errors.As (covering
wrapping by *url.Error / fmt.Errorf).

This is a no-op when ssl.verification_mode: none is configured, since
that path sets InsecureSkipVerify and never produces a
*tls.CertificateVerificationError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
errors.As already returns false on a nil error, so the explicit
early-out is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ycombinator ycombinator force-pushed the retry-on-tls-errors branch from 928c3c8 to 73859a4 Compare April 8, 2026 16:48
@ycombinator ycombinator enabled auto-merge (squash) April 8, 2026 17:43
@ycombinator ycombinator merged commit 52f9cbe into elastic:main Apr 8, 2026
10 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

@Mergifyio backport 8.19 9.3 9.4

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 8, 2026

backport 8.19 9.3 9.4

✅ Backports have been created

Details

Cherry-pick of 52f9cbe has failed:

On branch mergify/bp/8.19/pr-6767
Your branch is up to date with 'origin/8.19'.

You are currently cherry-picking commit 52f9cbe.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   changelog/fragments/1775592233-retry-on-tls-handshake-errors-in-es-client.yaml
	modified:   internal/pkg/es/client_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   internal/pkg/es/client.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

mergify Bot pushed a commit that referenced this pull request Apr 8, 2026
* Retry on TLS handshake errors in ES client

When the Elasticsearch output is configured with multiple hosts whose
certificates chain to different CAs, a TLS handshake failure against
one host aborts the request even though the connection pool's
dead-host failover would otherwise route the next attempt to a healthy
host. The pool already marks the failed host dead via OnFailure
unconditionally, but Perform() only loops to the next host if
RetryOnError returns true, and the existing classifier in fleet-server
only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked
out and (for startup data migrations against a multi-host output)
crashed the process in a restart loop.

Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions
via OR semantics so it widens the set of retried errors without
clobbering WithRetryOnErrs. The new predicate matches
*tls.CertificateVerificationError unwrapped via errors.As (covering
wrapping by *url.Error / fmt.Errorf).

This is a no-op when ssl.verification_mode: none is configured, since
that path sets InsecureSkipVerify and never produces a
*tls.CertificateVerificationError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Regenerate changelog fragment via elastic-agent-changelog-tool

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Drop redundant nil guard in isTLSHandshakeError

errors.As already returns false on a nil error, so the explicit
early-out is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 52f9cbe)

# Conflicts:
#	internal/pkg/es/client.go
mergify Bot pushed a commit that referenced this pull request Apr 8, 2026
* Retry on TLS handshake errors in ES client

When the Elasticsearch output is configured with multiple hosts whose
certificates chain to different CAs, a TLS handshake failure against
one host aborts the request even though the connection pool's
dead-host failover would otherwise route the next attempt to a healthy
host. The pool already marks the failed host dead via OnFailure
unconditionally, but Perform() only loops to the next host if
RetryOnError returns true, and the existing classifier in fleet-server
only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked
out and (for startup data migrations against a multi-host output)
crashed the process in a restart loop.

Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions
via OR semantics so it widens the set of retried errors without
clobbering WithRetryOnErrs. The new predicate matches
*tls.CertificateVerificationError unwrapped via errors.As (covering
wrapping by *url.Error / fmt.Errorf).

This is a no-op when ssl.verification_mode: none is configured, since
that path sets InsecureSkipVerify and never produces a
*tls.CertificateVerificationError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Regenerate changelog fragment via elastic-agent-changelog-tool

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Drop redundant nil guard in isTLSHandshakeError

errors.As already returns false on a nil error, so the explicit
early-out is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 52f9cbe)
mergify Bot pushed a commit that referenced this pull request Apr 8, 2026
* Retry on TLS handshake errors in ES client

When the Elasticsearch output is configured with multiple hosts whose
certificates chain to different CAs, a TLS handshake failure against
one host aborts the request even though the connection pool's
dead-host failover would otherwise route the next attempt to a healthy
host. The pool already marks the failed host dead via OnFailure
unconditionally, but Perform() only loops to the next host if
RetryOnError returns true, and the existing classifier in fleet-server
only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked
out and (for startup data migrations against a multi-host output)
crashed the process in a restart loop.

Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions
via OR semantics so it widens the set of retried errors without
clobbering WithRetryOnErrs. The new predicate matches
*tls.CertificateVerificationError unwrapped via errors.As (covering
wrapping by *url.Error / fmt.Errorf).

This is a no-op when ssl.verification_mode: none is configured, since
that path sets InsecureSkipVerify and never produces a
*tls.CertificateVerificationError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Regenerate changelog fragment via elastic-agent-changelog-tool

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Drop redundant nil guard in isTLSHandshakeError

errors.As already returns false on a nil error, so the explicit
early-out is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 52f9cbe)
@ycombinator ycombinator deleted the retry-on-tls-errors branch April 8, 2026 18:11
ycombinator added a commit that referenced this pull request Apr 8, 2026
The 8.19 es/client.go has none of the applyDefaultOptions retry
infrastructure from main, so the cherry-pick left conflict markers
around imports, applyDefaultOptions, and the new helpers. Drop the
applyDefaultOptions block (depends on backoff/syscall/WithRetryOnStatus
which don't exist on this branch), keep WithRetryOnErrs,
WithRetryOnTLSHandshakeError, and isTLSHandshakeError, and wire
WithRetryOnTLSHandshakeError into the elasticsearchOptions slices in
server/fleet.go and bulk/engine.go so the fix actually takes effect on
this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ycombinator added a commit that referenced this pull request Apr 8, 2026
* Retry on TLS handshake errors in ES client

When the Elasticsearch output is configured with multiple hosts whose
certificates chain to different CAs, a TLS handshake failure against
one host aborts the request even though the connection pool's
dead-host failover would otherwise route the next attempt to a healthy
host. The pool already marks the failed host dead via OnFailure
unconditionally, but Perform() only loops to the next host if
RetryOnError returns true, and the existing classifier in fleet-server
only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked
out and (for startup data migrations against a multi-host output)
crashed the process in a restart loop.

Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions
via OR semantics so it widens the set of retried errors without
clobbering WithRetryOnErrs. The new predicate matches
*tls.CertificateVerificationError unwrapped via errors.As (covering
wrapping by *url.Error / fmt.Errorf).

This is a no-op when ssl.verification_mode: none is configured, since
that path sets InsecureSkipVerify and never produces a
*tls.CertificateVerificationError.



* Regenerate changelog fragment via elastic-agent-changelog-tool



* Drop redundant nil guard in isTLSHandshakeError

errors.As already returns false on a nil error, so the explicit
early-out is unnecessary.



---------


(cherry picked from commit 52f9cbe)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ycombinator added a commit that referenced this pull request Apr 8, 2026
* Retry on TLS handshake errors in ES client

When the Elasticsearch output is configured with multiple hosts whose
certificates chain to different CAs, a TLS handshake failure against
one host aborts the request even though the connection pool's
dead-host failover would otherwise route the next attempt to a healthy
host. The pool already marks the failed host dead via OnFailure
unconditionally, but Perform() only loops to the next host if
RetryOnError returns true, and the existing classifier in fleet-server
only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked
out and (for startup data migrations against a multi-host output)
crashed the process in a restart loop.

Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions
via OR semantics so it widens the set of retried errors without
clobbering WithRetryOnErrs. The new predicate matches
*tls.CertificateVerificationError unwrapped via errors.As (covering
wrapping by *url.Error / fmt.Errorf).

This is a no-op when ssl.verification_mode: none is configured, since
that path sets InsecureSkipVerify and never produces a
*tls.CertificateVerificationError.



* Regenerate changelog fragment via elastic-agent-changelog-tool



* Drop redundant nil guard in isTLSHandshakeError

errors.As already returns false on a nil error, so the explicit
early-out is unnecessary.



---------


(cherry picked from commit 52f9cbe)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ycombinator added a commit that referenced this pull request Apr 8, 2026
)

* Retry on TLS handshake errors in ES client (#6767)

* Retry on TLS handshake errors in ES client

When the Elasticsearch output is configured with multiple hosts whose
certificates chain to different CAs, a TLS handshake failure against
one host aborts the request even though the connection pool's
dead-host failover would otherwise route the next attempt to a healthy
host. The pool already marks the failed host dead via OnFailure
unconditionally, but Perform() only loops to the next host if
RetryOnError returns true, and the existing classifier in fleet-server
only covered ECONNREFUSED/ECONNRESET. Cert verification errors leaked
out and (for startup data migrations against a multi-host output)
crashed the process in a restart loop.

Add WithRetryOnTLSHandshakeError, composed into applyDefaultOptions
via OR semantics so it widens the set of retried errors without
clobbering WithRetryOnErrs. The new predicate matches
*tls.CertificateVerificationError unwrapped via errors.As (covering
wrapping by *url.Error / fmt.Errorf).

This is a no-op when ssl.verification_mode: none is configured, since
that path sets InsecureSkipVerify and never produces a
*tls.CertificateVerificationError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Regenerate changelog fragment via elastic-agent-changelog-tool

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Drop redundant nil guard in isTLSHandshakeError

errors.As already returns false on a nil error, so the explicit
early-out is unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 52f9cbe)

# Conflicts:
#	internal/pkg/es/client.go

* Resolve 8.19 backport conflicts for #6767

The 8.19 es/client.go has none of the applyDefaultOptions retry
infrastructure from main, so the cherry-pick left conflict markers
around imports, applyDefaultOptions, and the new helpers. Drop the
applyDefaultOptions block (depends on backoff/syscall/WithRetryOnStatus
which don't exist on this branch), keep WithRetryOnErrs,
WithRetryOnTLSHandshakeError, and isTLSHandshakeError, and wire
WithRetryOnTLSHandshakeError into the elasticsearchOptions slices in
server/fleet.go and bulk/engine.go so the fix actually takes effect on
this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

2 participants