Detect stale CA in certificate chain and trigger certificates reissuance#9197
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
26d29a3 to
0311668
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates certificate reconciliation to detect CA rotations where the CA is reissued with the same private key (a case where x509.Verify() alone may still succeed), and adds regression tests to ensure leaf certificates are reissued when the CA certificate itself changes.
Changes:
- Update transport cert reconciliation to extract both leaf and CA certs from the chain and reissue when the CA cert in the chain differs from the current CA.
- Update HTTP cert reconciliation to invalidate/reissue when the CA cert in the chain differs from the current CA.
- Add unit tests covering CA rotation with the same private key.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/controller/elasticsearch/certificates/transport/reconcile.go | Switches to new extraction helper returning leaf+CA (leaf used for rotation timing). |
| pkg/controller/elasticsearch/certificates/transport/pod_secret.go | Adds CA-in-chain equality check and replaces leaf-only extraction with leaf+CA extraction helper. |
| pkg/controller/elasticsearch/certificates/transport/pod_secret_test.go | Adds a regression unit test for CA rotation with the same private key. |
| pkg/controller/common/certificates/http_reconcile.go | Adds CA-in-chain equality check when deciding whether an existing HTTP cert is valid. |
| pkg/controller/common/certificates/http_reconcile_test.go | Updates/strengthens internal HTTP cert reconcile test expectations for CA rotation with same key. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
0311668 to
57ddfb8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
Thanks for the fix — the root cause analysis (Go 1.25 changing SKI computation from SHA-1 to truncated SHA-256 breaking Java's AKI→SKI PKIX validation) is clear and the approach is solid. One thing I wanted to raise: have you considered pinning the Subject Key Identifier (SKI) during CA renewal instead of (or in addition to) triggering full leaf cert reissuance? In // In CABuilderOptions:
SubjectKeyId []byte
// In NewSelfSignedCA, on the template:
SubjectKeyId: options.SubjectKeyId,
// In renewCAFromExisting:
SubjectKeyId: oldCA.Cert.SubjectKeyId,This would keep the SKI stable across Go versions, so existing leaf certs' AKI→SKI chains remain valid and no reissuance is needed. You would obviously also have to verify that SKI has not changed when checking certificates for validity. The reason I bring this up is the transient disruption window during hot-reload. With the current PR, when the CA rotates and all leaf certs are reissued simultaneously, there's a window (up to kubelet sync interval, ~1 minute) where some pods have reloaded the new certs and others haven't. During that window, pods that have reloaded (new leaf cert with AKI→new_SKI, trusting new CA) and pods that haven't (old leaf cert with AKI→old_SKI, trusting old CA) cannot validate each other's certs in either direction — both A→B and B→A fail Java's PKIX check. With the byte-swap approach at least one direction still worked during that window. Pinning the SKI avoids this entirely since all existing certs remain valid throughout the transition. The tradeoff is that the byte-swap leaves a technically imprecise state (leaf signed by old CA, chain contains new CA), but that state is valid from both Go's and Java's perspective as long as AKI→SKI chains correctly and the key hasn't changed — which SKI pinning guarantees. Thoughts? |
|
Thanks @pebrc, these are good points to consider. On SKI pinningI looked at the code and On relying on validator behaviourThis is my main concern with the byte-swap approach. We'd be leaving a state where the leaf cert's issuer (old CA) differs from the CA cert in the chain (new CA), relying on validators to only check cryptographic signatures rather than certificate identity. While the SKI pinning would work for Go and Java today, should we bet on it?! (maybe I am being too defensive here) On the transient disruption windowThis is a valid concern only SKI pinning solves it All the above said, maybe a solution is a hybrid approach? SKI pinning for ECK-generated certificate in addition to the CA matching checks for user provided secrets. Thoughts? |
I think your concern is fair but AKI/SKI chaining is defined in the RFC so it is not an implemenation quirk but something we can build upon? I would expect any conformant PKIX implementation to accept it if the identifiers match. So in my mind the minimal fix is maybe the extended validation with AKI/SKI comparison. SKI pinning is not strictly necessary as the extended validation would already cover the bug (but maybe cheap IIUC it's 3 loc ), wdyt? If we want to keep the full comparison we should probably remove the old byte-swap approach completely. |
|
Thanks @pebrc , I've updated the PR to incorporate your suggestions:
AFAICT this approach:
wdyt? |
…son to validate certificate chains
9c890a3 to
41d63ae
Compare
|
One concern with the These diverge in the pre-existing byte-swapped state — i.e. clusters already affected by the original bug:
When the PR is deployed on this cluster:
The fix would be to check the leaf cert's if len(certificate.AuthorityKeyId) > 0 && len(ca.Cert.SubjectKeyId) > 0 {
if !bytes.Equal(certificate.AuthorityKeyId, ca.Cert.SubjectKeyId) {
// AKI→SKI chain is broken, leaf was signed by a CA with a different SKI
return nil
}
}This directly checks whether the leaf cert is compatible with the current CA. In the byte-swapped state, The So the recommendation would be: AKI→SKI check on the leaf cert (primary detection) + SKI pinning (prevention for future rotations). The |
…ring CA certificates in chain
|
Thanks for catching that @pebrc - I had zoned out and was clearly focusing only on comparing the two CAs rather than what actually matters: whether the leaf cert is compatible with the current CA. Back on track now. I've updated the implementation to follow your recommendation:
This catches the pre-existing byte-swapped state immediately - leaf's AKI (old_SKI) ≠ current CA's SKI (new_SKI) → reissuance triggered.
IMO the last approach is cleaner and directly addresses the incompatibility that breaks PKIX validation, wdyt? |
|
buildkite test this -f p=gke,t=TestCustom.CA. -m s=9.3.0,s=8.19.0 |
barkbay
left a comment
There was a problem hiding this comment.
I didn’t spend a lot of time reviewing this PR end-to-end, so I may be off here, so bear with me 🙂
While using an LLM to better understand the change set and edge cases, I ended up writing this test:
func Test_ensureInternalSelfSignedCertificateSecretContents_UpdatesCAChainWhenLeafIsReused(t *testing.T) {
// Build a rotated CA that keeps the same private key + same SKI.
// In this setup the existing leaf can still be considered valid by AKI->SKI checks,
// so reconcile may decide to reuse the leaf certificate bytes.
rotatedCA, err := NewSelfSignedCA(CABuilderOptions{
Subject: pkix.Name{CommonName: "test-common-name"},
PrivateKey: testRSAPrivateKey,
SubjectKeyID: testCA.Cert.SubjectKeyId,
})
require.NoError(t, err)
require.False(t, testCA.Cert.Equal(rotatedCA.Cert), "rotated CA should be a different certificate")
assert.Equal(t, testCA.Cert.SubjectKeyId, rotatedCA.Cert.SubjectKeyId, "rotated CA should preserve SKI")
privateKey, err := EncodePEMPrivateKey(testRSAPrivateKey)
require.NoError(t, err)
// Seed the secret with OLD CA material:
// - ca.crt points to old CA
// - tls.crt contains a leaf signed by old CA and old CA in chain
secret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Name: InternalCertsSecretName(esv1.ESNamer, testES.Name),
Namespace: testES.Namespace,
},
Data: map[string][]byte{
KeyFileName: privateKey,
CAFileName: EncodePEMCert(testCA.Cert.Raw),
CertFileName: pemTLS, // leaf signed by testCA with testCA in chain
},
}
// Reconcile against the rotated CA.
// Expected behavior:
// - leaf may be reused (same key + same SKI path is still valid),
// - but CA material in secret should still be refreshed to rotated CA.
changed, err := ensureInternalSelfSignedCertificateSecretContents(
context.Background(),
secret,
k8s.ExtractNamespacedName(&testES),
esv1.ESNamer,
testES.Spec.HTTP.TLS,
nil,
[]corev1.Service{testSvc},
rotatedCA,
RotationParams{
Validity: DefaultCertValidity,
RotateBefore: DefaultRotateBefore,
},
)
require.NoError(t, err)
// ca.crt and tls.crt chain should be rewritten with the current CA.
assert.True(t, changed, "CA data should be refreshed even when leaf certificate is reused")
assert.Equal(t, EncodePEMCert(rotatedCA.Cert.Raw), secret.Data[CAFileName], "ca.crt should be updated to rotated CA")
chain, err := ParsePEMCerts(secret.Data[CertFileName])
require.NoError(t, err)
require.GreaterOrEqual(t, len(chain), 2, "tls.crt should include leaf + CA chain")
assert.True(t, chain[1].Equal(rotatedCA.Cert), "tls.crt chain should include rotated CA")
}With current PR code, this test fails. Is that expected behaviour, or does it point to a gap in the "same SKI" rotation path?
…tched CA rotation
pebrc
left a comment
There was a problem hiding this comment.
LGTM just a few nits and testing improvements.
Overview
This PR updates certificate reconciliation to detect CA rotations where the CA is reissued with the same private key. In this scenario,
x509.Verify()alone still succeeds (valid signature), but the CA certificate embedded in the chain becomes stale.Changes
shouldIssueNewCertificate()to extract both leaf and CA certs from the chain and trigger reissuance when the CA cert in the chain differs from the current CA.getHTTPCertificate()to invalidate and reissue when the CA cert in the chain differs from the current CA.