[k8s] Fix logical race conditions in kubernetes_secrets provider#6623
Conversation
e001b10 to
9093b52
Compare
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
9093b52 to
3e3788e
Compare
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
| // no existing secret in the cache thus add it | ||
| return true | ||
| } | ||
| if existing.value != apiSecretValue && !existing.apiFetchTime.After(now) { |
There was a problem hiding this comment.
could we compare (just for readability)
sd.apiFetchTime.After(existing.apiFetchTime)
this reads better, update if current value is update after existing one was
There was a problem hiding this comment.
sd.apiFetchTime.After(existing.apiFetchTime) if I go with that, and the time between sd and existing are the same (I know it is way too hard but can happen) then I am gonna lose a secret even if the value has changed. makes sense? 🙂
There was a problem hiding this comment.
then we also could reverse and do !before. but this is just a readability nit
internal/pkg/composable/providers/kubernetessecrets/expiration_cache.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/expiration_cache.go
Outdated
Show resolved
Hide resolved
swiatekm
left a comment
There was a problem hiding this comment.
Overall this looks good to me, had some questions about the implementation details.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
|
@cmacknz since I am not sure that this PR should be categorised as a bug fix; please advice on the backport labels we should go with 🙂 |
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
|
This looks great, and given it's an action item from a production incident, we should backport to the branches that allow it to be deployed to guard against those incidents occurring again. Backport to the oldest branch that infra observability will deploy and any newer branches. |
|
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c)
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) # Conflicts: # internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c)
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c)
…) (#6797) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
…) (#6796) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
…) (#6794) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
…s_secrets provider (#6795) * [k8s] Fix logical race conditions in kubernetes_secrets provider (#6623) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) # Conflicts: # internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go * fix: resolve conflicts * fix: implicit memory aliasing in for loop * fix: make Run not to fail if getK8sClientFunc returns an err which is appropriate only for 8.17.x --------- Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>




What does this PR do?
This PR refactors the implementation to eliminate logical race conditions of the
kubernetes_secretsprovider alongside with brand new unit-tests.Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.
To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:
With this in mind, the provider behaviour is now as follows:
Cache Disabled Mode:
Cache Enabled Mode:
Cache Actor Responsibilities:
lastAccess:lastAccessto prevent premature expiration and give the fetch actor time to pick up the new value.lastAccessand let the entry "age" as it should.Fetch Actor Responsibilities:
lastAccesswhen an entry is accessed to prevent unintended expiration.Considerations:
ExpirationCachefromk8s.io/client-go/tools/cachedoes not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy
Why is it important?
This refactor significantly improves the correctness of the
kubernetes_secretsprovider by ensuring:Checklist
./changelog/fragmentsusing the change-log toolDisruptive User Impact
This change does not introduce breaking changes but ensures that the
kubernetes_secretsprovider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.How to test this PR locally
go test ./internal/pkg/composable/providers/kubernetessecrets/...Related issues