Skip to content

Detect stale CA in certificate chain and trigger certificates reissuance#9197

Merged
pkoutsovasilis merged 9 commits intoelastic:mainfrom
pkoutsovasilis:fix/ca_rotation_same_private
Mar 9, 2026
Merged

Detect stale CA in certificate chain and trigger certificates reissuance#9197
pkoutsovasilis merged 9 commits intoelastic:mainfrom
pkoutsovasilis:fix/ca_rotation_same_private

Conversation

@pkoutsovasilis
Copy link
Copy Markdown
Contributor

@pkoutsovasilis pkoutsovasilis commented Mar 4, 2026

Overview

This PR updates certificate reconciliation to detect CA rotations where the CA is reissued with the same private key. In this scenario, x509.Verify() alone still succeeds (valid signature), but the CA certificate embedded in the chain becomes stale.

Changes

  • Transport cert reconciliation: Update shouldIssueNewCertificate() to extract both leaf and CA certs from the chain and trigger reissuance when the CA cert in the chain differs from the current CA.
  • HTTP cert reconciliation: Update getHTTPCertificate() to invalidate and reissue when the CA cert in the chain differs from the current CA.
@pkoutsovasilis pkoutsovasilis self-assigned this Mar 4, 2026
@prodsecmachine
Copy link
Copy Markdown
Collaborator

prodsecmachine commented Mar 4, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@botelastic botelastic Bot added the triage label Mar 4, 2026
@pkoutsovasilis pkoutsovasilis force-pushed the fix/ca_rotation_same_private branch 2 times, most recently from 26d29a3 to 0311668 Compare March 5, 2026 06:53
@pkoutsovasilis pkoutsovasilis requested a review from Copilot March 5, 2026 06:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates certificate reconciliation to detect CA rotations where the CA is reissued with the same private key (a case where x509.Verify() alone may still succeed), and adds regression tests to ensure leaf certificates are reissued when the CA certificate itself changes.

Changes:

  • Update transport cert reconciliation to extract both leaf and CA certs from the chain and reissue when the CA cert in the chain differs from the current CA.
  • Update HTTP cert reconciliation to invalidate/reissue when the CA cert in the chain differs from the current CA.
  • Add unit tests covering CA rotation with the same private key.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/controller/elasticsearch/certificates/transport/reconcile.go Switches to new extraction helper returning leaf+CA (leaf used for rotation timing).
pkg/controller/elasticsearch/certificates/transport/pod_secret.go Adds CA-in-chain equality check and replaces leaf-only extraction with leaf+CA extraction helper.
pkg/controller/elasticsearch/certificates/transport/pod_secret_test.go Adds a regression unit test for CA rotation with the same private key.
pkg/controller/common/certificates/http_reconcile.go Adds CA-in-chain equality check when deciding whether an existing HTTP cert is valid.
pkg/controller/common/certificates/http_reconcile_test.go Updates/strengthens internal HTTP cert reconcile test expectations for CA rotation with same key.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread pkg/controller/elasticsearch/certificates/transport/pod_secret.go Outdated
Comment thread pkg/controller/common/certificates/http_reconcile.go Outdated
Comment thread pkg/controller/common/certificates/http_reconcile.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@pkoutsovasilis pkoutsovasilis added >bug Something isn't working v3.4.0 labels Mar 5, 2026
@botelastic botelastic Bot removed the triage label Mar 5, 2026
@pkoutsovasilis pkoutsovasilis linked an issue Mar 5, 2026 that may be closed by this pull request
@pkoutsovasilis pkoutsovasilis changed the title [wip] add unit-test for CA rotation using the same private key Mar 5, 2026
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review March 5, 2026 07:18
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner March 5, 2026 07:18
@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Mar 5, 2026

Thanks for the fix — the root cause analysis (Go 1.25 changing SKI computation from SHA-1 to truncated SHA-256 breaking Java's AKI→SKI PKIX validation) is clear and the approach is solid.

One thing I wanted to raise: have you considered pinning the Subject Key Identifier (SKI) during CA renewal instead of (or in addition to) triggering full leaf cert reissuance?

In NewSelfSignedCA, the certificate template doesn't set SubjectKeyId, so Go computes it automatically — which is where the cross-version breakage comes from. But x509.CreateCertificate uses SubjectKeyId as-is if it's set on the template. So renewCAFromExisting could carry forward the old CA's SKI:

// In CABuilderOptions:
SubjectKeyId []byte

// In NewSelfSignedCA, on the template:
SubjectKeyId: options.SubjectKeyId,

// In renewCAFromExisting:
SubjectKeyId: oldCA.Cert.SubjectKeyId,

This would keep the SKI stable across Go versions, so existing leaf certs' AKI→SKI chains remain valid and no reissuance is needed. You would obviously also have to verify that SKI has not changed when checking certificates for validity.

The reason I bring this up is the transient disruption window during hot-reload. With the current PR, when the CA rotates and all leaf certs are reissued simultaneously, there's a window (up to kubelet sync interval, ~1 minute) where some pods have reloaded the new certs and others haven't. During that window, pods that have reloaded (new leaf cert with AKI→new_SKI, trusting new CA) and pods that haven't (old leaf cert with AKI→old_SKI, trusting old CA) cannot validate each other's certs in either direction — both A→B and B→A fail Java's PKIX check. With the byte-swap approach at least one direction still worked during that window.

Pinning the SKI avoids this entirely since all existing certs remain valid throughout the transition.

The tradeoff is that the byte-swap leaves a technically imprecise state (leaf signed by old CA, chain contains new CA), but that state is valid from both Go's and Java's perspective as long as AKI→SKI chains correctly and the key hasn't changed — which SKI pinning guarantees.

Thoughts?

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

Thanks @pebrc, these are good points to consider.

On SKI pinning

I looked at the code and renewCAFromExisting only applies to ECK-managed self-signed CAs. For custom CA secrets (provided via spec.transport.tls.certificate or spec.http.tls.certificate), ECK doesn't manage rotation at all - it just uses whatever CA the user provides. If a user rotates their custom CA (even with the same private key - although unlikely), we have no control over the SKI. The fix of this PR handles both scenarios uniformly: whether the CA changed because ECK rotated it, or because the user updated their custom CA secret, we detect the mismatch and reissue. AFAICT SKI pinning would only address the ECK-managed case.

On relying on validator behaviour

This is my main concern with the byte-swap approach. We'd be leaving a state where the leaf cert's issuer (old CA) differs from the CA cert in the chain (new CA), relying on validators to only check cryptographic signatures rather than certificate identity. While the SKI pinning would work for Go and Java today, should we bet on it?! (maybe I am being too defensive here)

On the transient disruption window

This is a valid concern only SKI pinning solves it

All the above said, maybe a solution is a hybrid approach? SKI pinning for ECK-generated certificate in addition to the CA matching checks for user provided secrets. Thoughts?

@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Mar 5, 2026

While the SKI pinning would work for Go and Java today, should we bet on it?! (maybe I am being too defensive here)

I think your concern is fair but AKI/SKI chaining is defined in the RFC so it is not an implemenation quirk but something we can build upon? I would expect any conformant PKIX implementation to accept it if the identifiers match.

So in my mind the minimal fix is maybe the extended validation with AKI/SKI comparison.

SKI pinning is not strictly necessary as the extended validation would already cover the bug (but maybe cheap IIUC it's 3 loc ), wdyt?

If we want to keep the full comparison we should probably remove the old byte-swap approach completely.

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

Thanks @pebrc , I've updated the PR to incorporate your suggestions:

  1. SKI Pinning - renewCAFromExisting now passes the existing CA's SubjectKeyId to preserve SKI stability during ECK-managed CA renewals
  2. SKI Comparison - Added CAMatch function that compares Subject Key Identifiers to validate certificate chains. Falls back to full certificate equality if SKI is not available
  3. Removed byte-swap - Removed the code that updated the CA in the certificate chain without reissuing the leaf certificate

AFAICT this approach:

  • Avoids the transient disruption window for ECK-managed CA rotations (SKI remains stable)
  • Triggers reissuance only when SKI doesn't match (custom CA rotation where SKI changed)
  • Applies consistently to both transport and HTTP certificates

wdyt?

@pkoutsovasilis pkoutsovasilis force-pushed the fix/ca_rotation_same_private branch from 9c890a3 to 41d63ae Compare March 5, 2026 15:04
Comment thread pkg/controller/elasticsearch/certificates/transport/pod_secret.go
@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Mar 5, 2026

One concern with the CAMatch check: it compares the CA cert in the chain's SKI against the current CA's SKI. This answers "was the CA in the chain updated?" but not "is the leaf cert compatible with the current CA?"

These diverge in the pre-existing byte-swapped state — i.e. clusters already affected by the original bug:

  1. Old ECK code (pre-PR) ran on Go 1.25, CA was renewed, byte-swap executed
  2. State is now: tls.crt = [leaf (AKI=old_SKI), new_CA (SKI=new_SKI)], ca.crt = [new_CA]
  3. Java PKIX is already broken (leaf's AKI doesn't match trust anchor's SKI)

When the PR is deployed on this cluster:

  • caCertInChain = new_CA (placed there by the old byte-swap)
  • CAMatch(new_CA, new_CA)true (literally the same cert)
  • x509.Verify passes (Go is lenient)
  • No reissuance → the broken state persists until the leaf cert naturally expires

The fix would be to check the leaf cert's AuthorityKeyId against the current CA's SubjectKeyId instead of (or in addition to) comparing the two CA certs:

if len(certificate.AuthorityKeyId) > 0 && len(ca.Cert.SubjectKeyId) > 0 {
    if !bytes.Equal(certificate.AuthorityKeyId, ca.Cert.SubjectKeyId) {
        // AKI→SKI chain is broken, leaf was signed by a CA with a different SKI
        return nil
    }
}

This directly checks whether the leaf cert is compatible with the current CA. In the byte-swapped state, certificate.AuthorityKeyId (old_SKI) ≠ ca.Cert.SubjectKeyId (new_SKI) → reissuance triggered → broken state remediated immediately.

The CAMatch check could still be useful as a secondary safeguard (e.g. detecting when the chain CA cert itself is stale before it gets byte-swapped), but the AKI→SKI check on the leaf cert is the one that catches the actual incompatibility that breaks Java's PKIX validation.

So the recommendation would be: AKI→SKI check on the leaf cert (primary detection) + SKI pinning (prevention for future rotations). The CAMatch CA-to-CA comparison can be dropped or kept as defense-in-depth — it doesn't hurt, but it doesn't catch the critical case on its own.

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

Thanks for catching that @pebrc - I had zoned out and was clearly focusing only on comparing the two CAs rather than what actually matters: whether the leaf cert is compatible with the current CA. Back on track now.

I've updated the implementation to follow your recommendation:

  • Primary detection (AKI→SKI check on leaf cert)
  • Replaced CAMatch with CertIsSignedByCA(leafCert, currentCA) which:
    1. Verifies the cryptographic signature via CheckSignatureFrom()
    2. Validates that leafCert.AuthorityKeyId matches currentCA.SubjectKeyId

This catches the pre-existing byte-swapped state immediately - leaf's AKI (old_SKI) ≠ current CA's SKI (new_SKI) → reissuance triggered.

  • Prevention (SKI pinning): renewCAFromExisting() continues to pass the existing CA's SubjectKeyId to keep SKI stable across ECK-managed rotations.
  • Removed:
    • The CA-to-CA comparison (CAMatch)
    • CA extraction from the certificate chain - we now validate the leaf directly against the current CA

IMO the last approach is cleaner and directly addresses the incompatibility that breaks PKIX validation, wdyt?

@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

buildkite test this -f p=gke,t=TestCustom.CA. -m s=9.3.0,s=8.19.0

Copy link
Copy Markdown
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t spend a lot of time reviewing this PR end-to-end, so I may be off here, so bear with me 🙂

While using an LLM to better understand the change set and edge cases, I ended up writing this test:

func Test_ensureInternalSelfSignedCertificateSecretContents_UpdatesCAChainWhenLeafIsReused(t *testing.T) {
	// Build a rotated CA that keeps the same private key + same SKI.
	// In this setup the existing leaf can still be considered valid by AKI->SKI checks,
	// so reconcile may decide to reuse the leaf certificate bytes.
	rotatedCA, err := NewSelfSignedCA(CABuilderOptions{
		Subject:      pkix.Name{CommonName: "test-common-name"},
		PrivateKey:   testRSAPrivateKey,
		SubjectKeyID: testCA.Cert.SubjectKeyId,
	})
	require.NoError(t, err)
	require.False(t, testCA.Cert.Equal(rotatedCA.Cert), "rotated CA should be a different certificate")
	assert.Equal(t, testCA.Cert.SubjectKeyId, rotatedCA.Cert.SubjectKeyId, "rotated CA should preserve SKI")

	privateKey, err := EncodePEMPrivateKey(testRSAPrivateKey)
	require.NoError(t, err)

	// Seed the secret with OLD CA material:
	// - ca.crt points to old CA
	// - tls.crt contains a leaf signed by old CA and old CA in chain
	secret := &corev1.Secret{
		ObjectMeta: metav1.ObjectMeta{
			Name:      InternalCertsSecretName(esv1.ESNamer, testES.Name),
			Namespace: testES.Namespace,
		},
		Data: map[string][]byte{
			KeyFileName:  privateKey,
			CAFileName:   EncodePEMCert(testCA.Cert.Raw),
			CertFileName: pemTLS, // leaf signed by testCA with testCA in chain
		},
	}

	// Reconcile against the rotated CA.
	// Expected behavior:
	// - leaf may be reused (same key + same SKI path is still valid),
	// - but CA material in secret should still be refreshed to rotated CA.
	changed, err := ensureInternalSelfSignedCertificateSecretContents(
		context.Background(),
		secret,
		k8s.ExtractNamespacedName(&testES),
		esv1.ESNamer,
		testES.Spec.HTTP.TLS,
		nil,
		[]corev1.Service{testSvc},
		rotatedCA,
		RotationParams{
			Validity:     DefaultCertValidity,
			RotateBefore: DefaultRotateBefore,
		},
	)
	require.NoError(t, err)

	// ca.crt and tls.crt chain should be rewritten with the current CA.
	assert.True(t, changed, "CA data should be refreshed even when leaf certificate is reused")
	assert.Equal(t, EncodePEMCert(rotatedCA.Cert.Raw), secret.Data[CAFileName], "ca.crt should be updated to rotated CA")

	chain, err := ParsePEMCerts(secret.Data[CertFileName])
	require.NoError(t, err)
	require.GreaterOrEqual(t, len(chain), 2, "tls.crt should include leaf + CA chain")
	assert.True(t, chain[1].Equal(rotatedCA.Cert), "tls.crt chain should include rotated CA")
}

With current PR code, this test fails. Is that expected behaviour, or does it point to a gap in the "same SKI" rotation path?

Comment thread pkg/controller/common/certificates/http_reconcile.go
@pkoutsovasilis
Copy link
Copy Markdown
Contributor Author

thanks for catching that @barkbay! I re-introduced the byte-swapping and added your test in bde2b83 which is now passes.

Evidently the boundaries of re-issuing or just updating the chain were blurry in my head

Copy link
Copy Markdown
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a few nits and testing improvements.

Comment thread pkg/controller/common/certificates/ca.go Outdated
Comment thread pkg/controller/common/certificates/ca_reconcile_test.go
Comment thread pkg/controller/common/certificates/ca.go Outdated
Comment thread pkg/controller/common/certificates/http_reconcile.go Outdated
Comment thread pkg/controller/elasticsearch/certificates/transport/pod_secret.go Outdated
Comment thread test/e2e/es/certs_test.go Outdated
Comment thread pkg/controller/common/certificates/ca_reconcile_test.go Outdated
Comment thread pkg/controller/common/certificates/ca_reconcile.go
@pkoutsovasilis pkoutsovasilis requested a review from pebrc March 9, 2026 09:01
Copy link
Copy Markdown
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread pkg/controller/common/certificates/ca_reconcile.go
@pkoutsovasilis pkoutsovasilis merged commit ffcb643 into elastic:main Mar 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug Something isn't working v3.4.0

6 participants