[libbeat] Fix a shutdown race in the memory queue#47248
Conversation
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
mauri870
left a comment
There was a problem hiding this comment.
LGTM, thanks for fixing this! I tested it with the stresstest script and confirmed it's stable.
I'm increasingly of the opinion that controller logic goroutines should always have unbuffered input channels, there's often some sense that there should be a buffer for performance, "just in case," but it makes the synchronization logic so much less stable, this specific channel buffer has caused years worth of incredibly subtle races.
I don't think so either ( |
|
Sigh, the buffered channel is sticking around after all 😔 Per @leehinman's request I added a benchmark to confirm this wasn't hurting performance, and found that it actually did slow down performance by 10-15%. So I've revised the fix to keep the buffered channel but to handle shutdown more carefully in the producers so they still handle a pending response even if the queue shutdown reaches them first. (Response channels already had a one-element buffer, and after the closing channel is triggered the response buffer is guaranteed to be filled for all requests that were actually handled.) This version is more self-evidently performance-neutral, but the benchmark test confirms it and I left the it in the PR just for reference. |
|
@Mergifyio backport 8.19 9.0 9.1 9.2 |
✅ Backports have been createdDetails
|
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change. (cherry picked from commit 94eaea9)
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change. (cherry picked from commit 94eaea9)
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change. (cherry picked from commit 94eaea9) # Conflicts: # libbeat/publisher/queue/memqueue/queue_test.go
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change. (cherry picked from commit 94eaea9)
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change. (cherry picked from commit 94eaea9) Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change. (cherry picked from commit 94eaea9) Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change. (cherry picked from commit 94eaea9) Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see elastic#47246. A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.
A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.
Checklist
./changelog/fragmentsusing the changelog tool.Related issues