Remove race condition when accessing remote bulker map#4171
Remove race condition when accessing remote bulker map#4171michel-laterman merged 5 commits intoelastic:mainfrom
Conversation
|
This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
|
|
|
|
|
||
| // GetBulkerMap returns a copy of the BulkerMap | ||
| func (b *Bulker) GetBulkerMap() map[string]Bulk { | ||
| return b.bulkerMap |
There was a problem hiding this comment.
From what I can see, we had a flaky test because the policy self-monitor (internal/pkg/policy/self.go) was getting the bulker map in order to check remote output health; but we did not prevent our policy output preparation from creating a new bulker concurrently (internal/pkg/policy/policy_output.go).
I've changes our maps to sync.Map as we weren't properly using the mutex we had, and changed this func to return a copy instead
internal/pkg/bulk/engine.go
Outdated
| if !ok { | ||
| return nil | ||
| } | ||
| return o.(Bulk) |
There was a problem hiding this comment.
it shouldn't, we are only adding Bulkers
internal/pkg/bulk/engine.go
Outdated
| return bulker, false, nil | ||
| bulker, ok := b.bulkerMap.Load(outputName) | ||
| if ok && !hasConfigChanged { | ||
| return bulker.(Bulk), false, nil |
There was a problem hiding this comment.
I'm assuming it is guaranteed that the sync.Map will only hold bulker type, is that the case?
|
I added some benchmarks, and ran them against main for comparison. Benchmarks were ran with |
| } | ||
|
|
||
| func Benchmark_CreateAndGetBulker(b *testing.B) { | ||
| b.Skip("Crashes on remote runner") |
There was a problem hiding this comment.
I'm not sure why, but this causes issues when make benchmark is ran, but these tests can be ran individually without this
|
buildkite test this |
|
My take on this is that this is a real bug, the I don't like the use of
We don't satisfy 1 because outputs can be changed, updated, and deleted. We don't satisfy 2 because the entire point of It looks like we could solve this but just introducing a RWMutex for all the uses of bulkMap. There are several that are not mutex protected. fleet-server/internal/pkg/bulk/engine.go Lines 127 to 128 in 6b29ab4 The You could potentially rewrite the way the remote ES output healthcheck works to not need a concurrent map at all. The self monitor could listen on a channel for state updates from each attempt to interact with the remote ES update that could fail or something like that. |
|
I'm not really worried about the performance implications of this after read the code involved. I think that was just a way to see if use of sync.Map was justified. I don't think it is for non-performance reasons (interfaces, ugh). |
|
I'll reimplement the mutex and create an issue to discuss remote output health |
|
Issue: #4185 |
|
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07)
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07)
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07)
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>




What is the problem this PR solves?
Remove a race condition/bug that may occur when remote ES outputs are used.
How does this PR solve the problem?
Use the
remoteOutputMutexwhenever accessing thebulkerMap.Change
GetBulkerMapto return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.Design Checklist
I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragmentsusing the changelog toolRelated issues