Skip to content

[libbeat] Fix a shutdown race in the memory queue#47248

Merged
faec merged 9 commits intoelastic:mainfrom
faec:memqueue-shutdown-race
Oct 24, 2025
Merged

[libbeat] Fix a shutdown race in the memory queue#47248
faec merged 9 commits intoelastic:mainfrom
faec:memqueue-shutdown-race

Conversation

@faec
Copy link
Contributor

@faec faec commented Oct 21, 2025

Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool.

Related issues

@faec faec self-assigned this Oct 21, 2025
@faec faec requested a review from a team as a code owner October 21, 2025 21:18
@faec faec added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team bugfix labels Oct 21, 2025
@faec faec requested review from leehinman and mauri870 October 21, 2025 21:18
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 21, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 21, 2025
@github-actions
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Oct 21, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
@faec faec added the backport-active-all Automated backport with mergify to all the active branches label Oct 21, 2025
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this removes the buffered channel. I don't think this will make performance worse, but any chance you could add a benchmark test to verify?

Otherwise LGTM.

Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this! I tested it with the stresstest script and confirmed it's stable.

@faec
Copy link
Contributor Author

faec commented Oct 22, 2025

I love that this removes the buffered channel.

I'm increasingly of the opinion that controller logic goroutines should always have unbuffered input channels, there's often some sense that there should be a buffer for performance, "just in case," but it makes the synchronization logic so much less stable, this specific channel buffer has caused years worth of incredibly subtle races.

I don't think this will make performance worse, but any chance you could add a benchmark test to verify?

I don't think so either (Publish already blocks on a channel for every event since the refactor last year, so the buffer seems vestigial), but I will look into confirming.

@faec
Copy link
Contributor Author

faec commented Oct 24, 2025

Sigh, the buffered channel is sticking around after all 😔

Per @leehinman's request I added a benchmark to confirm this wasn't hurting performance, and found that it actually did slow down performance by 10-15%. So I've revised the fix to keep the buffered channel but to handle shutdown more carefully in the producers so they still handle a pending response even if the queue shutdown reaches them first. (Response channels already had a one-element buffer, and after the closing channel is triggered the response buffer is guaranteed to be filled for all requests that were actually handled.) This version is more self-evidently performance-neutral, but the benchmark test confirms it and I left the it in the PR just for reference.

@faec faec enabled auto-merge (squash) October 24, 2025 21:33
@faec faec merged commit 94eaea9 into elastic:main Oct 24, 2025
208 checks passed
@github-actions
Copy link
Contributor

@Mergifyio backport 8.19 9.0 9.1 9.2

@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2025

backport 8.19 9.0 9.1 9.2

✅ Backports have been created

Details
mergify bot pushed a commit that referenced this pull request Oct 24, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

(cherry picked from commit 94eaea9)
mergify bot pushed a commit that referenced this pull request Oct 24, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

(cherry picked from commit 94eaea9)
mergify bot pushed a commit that referenced this pull request Oct 24, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

(cherry picked from commit 94eaea9)

# Conflicts:
#	libbeat/publisher/queue/memqueue/queue_test.go
mergify bot pushed a commit that referenced this pull request Oct 24, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

(cherry picked from commit 94eaea9)
pierrehilbert pushed a commit that referenced this pull request Oct 27, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

(cherry picked from commit 94eaea9)

Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
pierrehilbert pushed a commit that referenced this pull request Oct 27, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

(cherry picked from commit 94eaea9)

Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
pierrehilbert pushed a commit that referenced this pull request Oct 27, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see #47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.

(cherry picked from commit 94eaea9)

Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
andrzej-stencel pushed a commit to andrzej-stencel/beats that referenced this pull request Dec 1, 2025
Fix a race in the memory queue where an event that was reported as unpublished during queue shutdown could still be enqueued, ingested, and have its acknowledgment callback triggered. In specific circumstances this could cause a panic during Filebeat shutdown, see elastic#47246.

A deterministic test is impossible since the behavior depended on precise timing of an internal select statement with multiple valid paths, but the accompanying unit test fails 90% of the time on the old code, and passes 100% of the time with this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bugfix Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

5 participants