Skip to content

Fix spinner clear_transient race condition during rapid reruns#13849

Merged
kmcgrady merged 2 commits intodevelopfrom
fix/clear-transient
Feb 28, 2026
Merged

Fix spinner clear_transient race condition during rapid reruns#13849
kmcgrady merged 2 commits intodevelopfrom
fix/clear-transient

Conversation

@kmcgrady
Copy link
Copy Markdown
Collaborator

@kmcgrady kmcgrady commented Feb 6, 2026

Describe your changes

Fixes a frontend stale-node edge case where the first widget above a @st.fragment could disappear and be replaced by div.stEmpty after a cache miss.

Root cause: During stale-node cleanup, a TransientNode that had already been explicitly cleared in the current run (transientNodes.length === 0) could still hold a valid anchor element from a previous run. The existing stale filtering path treated that anchor as stale and removed it.

Fix: In ClearStaleNodeVisitor.visitTransientNode, return the anchor directly when the transient was cleared in the current run and has no transient children. This preserves the anchored widget instead of incorrectly dropping it.

GitHub Issue Link (if applicable)

Closes #13634

Testing Plan

  • Frontend unit test added in ClearStaleNodeVisitor.test.ts: restores anchor for transient cleared in current run
  • Targeted Vitest run:
    • yarn vitest run lib/src/render-tree/visitors/ClearStaleNodeVisitor.test.ts -t "restores anchor for transient cleared in current run"
  • Manual reproduction validated in debug session: widget no longer disappears

Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

Copilot AI review requested due to automatic review settings February 6, 2026 17:36
@snyk-io
Copy link
Copy Markdown
Contributor

snyk-io Bot commented Feb 6, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 6, 2026

✅ PR preview is ready!

Name Link
📦 Wheel file https://core-previews.s3-us-west-2.amazonaws.com/pr-13849/streamlit-1.54.0-py3-none-any.whl
📦 @streamlit/component-v2-lib Download from artifacts
🕹️ Preview app pr-13849.streamlit.app (☁️ Deploy here if not accessible)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race condition in st.spinner where clear_transient could be sent even when no transient was ever created, which could lead to stale-anchor behavior and incorrect element removal during rapid reruns.

Changes:

  • Track whether create_transient() actually ran and only enqueue clear_transient() when a transient was created.
  • Add a regression unit test to ensure no transient-related messages are sent when the spinner exits before its delay timer fires.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
lib/streamlit/elements/spinner.py Adds transient_created tracking so clear_transient() is only sent when needed, preventing unnecessary transient node creation.
lib/tests/streamlit/spinner_test.py Adds a regression test asserting that fast-exiting spinners do not enqueue transient messages.
Comment thread lib/tests/streamlit/spinner_test.py Outdated
Comment on lines +133 to +142
"""Test that no transient message is sent when spinner exits before timer fires.

This is a regression test for a race condition where rapid reruns would cause
unnecessary clear_transient messages to be sent, leading to stale element removal.
The spinner has a 0.5s delay before showing, so if the context exits before that,
no transient should be created or cleared.
"""
with st.spinner("quick spinner"):
# Exit immediately, before the 0.5s timer fires
time.sleep(0.1)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test and docstring hard-code the spinner delay as "0.5s" and rely on time.sleep(0.1). Since the real delay is defined by DELAY_SECS in streamlit.elements.spinner, consider referencing that constant (and avoiding mentioning a specific numeric value in the docstring) so the test remains correct if the delay is ever adjusted.

Copilot uses AI. Check for mistakes.
@kmcgrady kmcgrady added security-assessment-completed change:bugfix PR contains bug fix implementation impact:users PR changes affect end users ai-review If applied to PR or issue will run AI review workflow labels Feb 6, 2026
@github-actions github-actions Bot removed the ai-review If applied to PR or issue will run AI review workflow label Feb 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 6, 2026

Summary

This PR fixes a race condition where UI elements (like buttons) could disappear after rapidly clicking buttons that trigger full reruns. The root cause was that the spinner's finally block unconditionally sent a clear_transient message even when no transient was ever created (i.e., the spinner context exited before the 0.5s delay timer fired). These unnecessary clear messages created TransientNodes with stale scriptRunId anchors, causing legitimate elements to be incorrectly removed during clearStaleNodes.

Production code change: 1 file (lib/streamlit/elements/spinner.py), +8/-1 lines — adds a transient_created boolean flag, set inside the existing lock, and gates the clear_transient() call on it.

Test changes: 25 test files updated to reflect the corrected behavior.

Code Quality

The production fix is clean, minimal, and well-placed:

        display_message = True
        transient_created = False
        display_message_lock = threading.Lock()
        timer: threading.Timer | None = None
        try:

            def set_message() -> None:
                nonlocal transient_created
                with display_message_lock:
                    if display_message:
                        # Ignore the DeltaGenerator conveniences because Transients are special
                        enqueue_message(create_transient())
                        transient_created = True

            timer = threading.Timer(DELAY_SECS, set_message)
            add_script_run_ctx(timer)
            timer.start()
            # Yield control back to the context.
            yield
        finally:
            if timer:
                timer.cancel()
            with display_message_lock:
                display_message = False

                # Only send the clear message if we actually created a transient.
                # This prevents unnecessary clear messages during rapid reruns
                # where the timer is cancelled before firing.
                if transient_created:
                    enqueue_message(clear_transient())

Thread safety analysis: The transient_created flag is correctly synchronized. Both the timer thread (set_message) and the main thread (finally) access it under display_message_lock. All edge cases are handled:

  • Timer cancelled before firing: transient_created stays False, no clear sent. Correct.
  • Timer fires before finally: transient_created is True, clear is sent. Correct.
  • Timer and finally race: Lock ensures mutual exclusion — whichever acquires the lock first determines behavior correctly in both orderings.

The comments are clear and explain the "why" effectively.

Test Coverage

New regression test (spinner_test.py:132-149): test_spinner_no_transient_when_exiting_quickly directly tests the fix by verifying the forward message queue is empty when the spinner exits before the 0.5s timer fires. This is a well-written, targeted regression test with a good docstring explaining the race condition.

Updated existing tests: The PR comprehensively updates ~20 test_shows_cached_widget_replay_warning tests across widget test files. These tests were unintentionally relying on the old buggy behavior where a clear_transient was always sent. The fix:

  • Adds show_spinner=False to isolate the tests from spinner behavior (they test cached widget warnings, not spinners).
  • Changes queue index from -3 to -2 since no spinner message is enqueued.

This is the right approach — these tests should not have been coupled to spinner internals.

Updated spinner-specific cache tests (cache_spinner_test.py, common_cache_test.py): Tests for fast-completing cached functions now correctly assert the queue is empty. The test_spinner_with_nested_cached_functions test was simplified by removing the now-unnecessary iteration over the queue looking for empty transient elements.

No E2E test was added, which is acceptable here — the race condition is timing-dependent and an E2E test would be inherently flaky. The unit test covers the core mechanism reliably.

Backwards Compatibility

No breaking changes. This is purely an internal optimization that reduces unnecessary messages from the server to the client. No public API signatures, protobuf definitions, or observable user-facing behavior (other than the bug fix) is changed. The spinner still displays correctly when the timer fires.

Security & Risk

No security concerns. The change is a minor logic guard in a single function. Risk of regression is very low:

  • The change only skips a message that should never have been sent in the first place.
  • When the spinner does display (timer fires before context exits), the behavior is unchanged — transient_created is set to True and the clear_transient is sent as before.
  • Existing tests for the "spinner displays normally" path (test_spinner, test_spinner_for_caching, test_spinner_time, etc.) still sleep 0.7s (>0.5s delay), ensuring the timer fires and the create/clear pair is still tested.

Accessibility

No frontend changes in this PR. The fix is entirely backend (Python message-passing logic). No accessibility implications.

Recommendations

No blocking issues. Minor suggestions:

  1. Consider a positive+negative test enhancement: The new test_spinner_no_transient_when_exiting_quickly test could additionally assert that when sleeping longer than DELAY_SECS, transient messages are sent (the "positive" counterpart). However, this is already covered by existing tests (test_spinner, test_spinner_for_caching), so this is not strictly necessary.

  2. Consider adding a brief inline comment on the transient_created = False declaration explaining it's guarded by display_message_lock, for future readers scanning the code. This is a very minor nit.

Verdict

APPROVED: Clean, minimal, thread-safe fix for a real race condition bug with comprehensive test updates. The approach is sound, well-documented, and low-risk.


This is an automated AI review using opus-4.6-thinking. Please verify the feedback and use your judgment.

Copy link
Copy Markdown
Collaborator

@lukasmasuch lukasmasuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@sfc-gh-kmcgrady sfc-gh-kmcgrady marked this pull request as draft February 23, 2026 18:15
@kmcgrady kmcgrady force-pushed the fix/clear-transient branch from 51696a5 to 3cc7750 Compare February 26, 2026 18:07
@kmcgrady kmcgrady marked this pull request as ready for review February 26, 2026 18:07
@kmcgrady kmcgrady added the ai-review If applied to PR or issue will run AI review workflow label Feb 26, 2026
@github-actions github-actions Bot removed the ai-review If applied to PR or issue will run AI review workflow label Feb 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Summary

This PR fixes a frontend race condition (issue #13634) where the first widget above a @st.fragment could disappear and be replaced by an empty div after a cache miss and rapid reruns. The fix adds a targeted early-return in ClearStaleNodeVisitor.visitTransientNode that preserves the anchor node when a TransientNode has been explicitly cleared (zero transient children) during the current script run. A corresponding unit test is added.

Only two files are changed:

  • frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.ts — the bug fix
  • frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.test.ts — new test case

Reviewers agree unanimously that the fix is clean, minimal, well-placed, and ready to merge.

Code Quality

Both reviewers confirm the implementation is concise and follows existing visitor patterns. The early-return guard at lines 146-152 checks three narrow conditions:

  1. node.scriptRunId === this.currentScriptRunId — the transient was touched in this run
  2. node.transientNodes.length === 0 — transient children were cleared
  3. node.anchor — there is an anchor to restore

This correctly short-circuits before the existing logic that would recursively visit the anchor (which may carry a stale scriptRunId from a prior run) and incorrectly remove it. The comment explains the rationale well.

Agreement: Both reviewers note the logic is consistent with how ClearTransientNodesVisitor.visitTransientNode works — it also returns node.anchor directly when clearing transients, without checking the anchor's staleness. No readability or maintainability concerns.

Additional analysis (opus-4.6-thinking): The early return bypasses the recursive accept(this) call on the anchor, meaning if the anchor itself is a BlockNode with stale children, those children would not be pruned in this path. This is acceptable because the anchor is typically an ElementNode, and stale-node cleanup will run again on subsequent reruns when the anchor gets a current scriptRunId.

Test Coverage

Both reviewers agree the new unit test ("restores anchor for transient cleared in current run") correctly validates the core bug scenario and is valuable as a regression test. Existing tests for other visitTransientNode branches remain unaffected.

Agreement on residual gap: Both reviewers note there is no E2E test reproducing the real-world @st.fragment + @st.cache_resource disappearance scenario from issue #13634. Both acknowledge this is reasonable given the race-condition nature of the bug, which may be difficult to reliably reproduce in Playwright.

Non-blocking suggestions (opus-4.6-thinking):

  • A complementary negative assertion (e.g., verifying the result is not a TransientNode) could strengthen anti-regression signal, though toBe(anchor) is already quite specific.
  • A test case exercising the early-return path during a fragment run (with fragmentIdsThisRun set) would increase confidence, as the fix is not fragment-specific but fragments are the main use case.

Backwards Compatibility

No breaking changes. Both reviewers agree the behavior change is intentionally scoped to transient nodes that are from the current script run, have no transient children, and still have an anchor. The early return has strictly narrower conditions than the subsequent code path, preserving existing behavior for all other cases.

Security & Risk

No security concerns identified by either reviewer. Both agree regression risk is low because the condition is narrow and test-backed, though render-tree sequencing issues are historically subtle.

Accessibility

No accessibility-impacting changes. No DOM or ARIA changes in this PR.

Recommendations

  1. (Follow-up) Add an E2E or integration test that reproduces issue First streamlit widget vanish when used above a @st.fragment that calls a @st.cache_resource function #13634 (fragment rerun + cache miss path) to lock in behavior across full message sequencing.
  2. (Optional) Add a test case for the early-return path in a fragment-run context (passing fragmentIdsThisRun to the visitor constructor) to ensure the fix works correctly during fragment reruns.

Verdict

APPROVED: Both reviewers approve unanimously. The fix is focused, minimal, consistent with existing render-tree patterns, and sufficiently validated at unit level. No backwards compatibility, security, or accessibility concerns. Merge is recommended, with a suggestion to strengthen integration/E2E coverage as a follow-up.


Consolidated review by opus-4.6-thinking.


📋 Review by `gpt-5.3-codex-high`

Summary

No blocking issues found. The PR introduces a targeted fix in frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.ts to preserve a transient anchor when the transient was explicitly cleared in the current run, and adds a focused unit test in frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.test.ts for that edge case.

Code Quality

The implementation is concise and follows existing visitor patterns in the render-tree layer.

  • frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.ts (around lines 143-152) adds a narrow guard with clear intent and comments.
  • The new branch is consistent with transient semantics in this codebase (anchor restoration when transients are cleared), and avoids broad logic changes elsewhere.
  • No maintainability or readability concerns in the touched code.

Test Coverage

Unit coverage is improved and directionally good:

  • Added regression-style unit test restores anchor for transient cleared in current run in frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.test.ts (around lines 270-279).
  • Existing tests in the same file already cover the other visitTransientNode branches, so branch-level confidence is solid.

Residual gap:

  • There is still no integration/e2e test reproducing the real-world @st.fragment + @st.cache_resource disappearance scenario from issue #13634. The unit test is valuable, but an end-to-end guard would better protect against sequencing regressions in message-cache/fragment flows.

Backwards Compatibility

No breaking API or schema changes. Behavior change is intentionally scoped to transient nodes that:

  • are from the current script run,
  • have no transient children, and
  • still have an anchor.

This is backward compatible for consumers and aligns with expected UI persistence semantics during rerun transitions.

Security & Risk

No security concerns identified.

Regression risk appears low because the condition is narrow and test-backed, though render-tree sequencing issues are historically subtle, so the lack of e2e coverage remains the main residual risk.

Accessibility

No accessibility-impacting UI/DOM changes in this PR. No a11y concerns identified.

Recommendations

  1. Add a follow-up integration or e2e test that reproduces issue #13634 (fragment rerun + cache miss path) to lock in behavior across full message sequencing.

Verdict

APPROVED: The fix is focused, consistent with render-tree behavior, and sufficiently validated at unit level; merge is reasonable with a recommendation to strengthen integration coverage.


This is an automated AI review by gpt-5.3-codex-high. Please verify the feedback and use your judgment.

📋 Review by `opus-4.6-thinking`

Summary

This PR fixes a frontend race condition (issue #13634) where the first widget above a @st.fragment could disappear and be replaced by an empty div after a cache miss and rapid reruns. The fix is a small, targeted early-return in ClearStaleNodeVisitor.visitTransientNode that preserves the anchor node when a TransientNode has been explicitly cleared (zero transient children) during the current script run. A corresponding unit test is added.

Only two files are changed:

  • frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.ts — the bug fix
  • frontend/lib/src/render-tree/visitors/ClearStaleNodeVisitor.test.ts — new test case

Code Quality

The fix is clean, minimal, and well-placed. The early-return guard at lines 146-152 of ClearStaleNodeVisitor.ts checks three conditions:

  1. node.scriptRunId === this.currentScriptRunId — the transient was touched in this run
  2. node.transientNodes.length === 0 — transient children were cleared
  3. node.anchor — there is an anchor to restore

This correctly short-circuits before the existing logic that would recursively visit the anchor (which may have a stale scriptRunId from a prior run) and incorrectly remove it. The comment explains the rationale well.

The logic is consistent with how ClearTransientNodesVisitor.visitTransientNode works — it also returns node.anchor directly when clearing transients, without checking the anchor's staleness.

One minor observation: the early return bypasses the recursive accept(this) call on the anchor. This means if the anchor itself is a BlockNode with stale children, those children would not be pruned in this path. However, this is acceptable because:

  • The anchor is typically an ElementNode (not a block with children).
  • The stale-node cleanup will run again on subsequent reruns when the anchor gets a current scriptRunId.

Test Coverage

The new test case ("restores anchor for transient cleared in current run") correctly validates the core bug scenario: a TransientNode from the current run with an old-run anchor and zero transient children returns the anchor directly.

Minor suggestions (non-blocking):

  • Per the frontend AGENTS.md guidance on "positive + negative assertions," the test could add a complementary assertion (e.g., expect(result).not.toBeUndefined() or expect(result).toBeInstanceOf(ElementNode)) to strengthen the anti-regression signal. However, since toBe(anchor) is quite specific, this is optional.
  • No test covers this early-return path during a fragment run (i.e., with fragmentIdsThisRun set). While the fix is not fragment-specific, a fragment-context test would increase confidence.
  • No E2E test is added. Given this is a race condition triggered by rapid reruns and cache misses, it may be difficult to reliably reproduce in Playwright. The unit test and manual verification are reasonable for this type of fix.

Backwards Compatibility

No breaking changes. The fix only adds an early-return for a specific condition that was previously handled incorrectly (anchor being dropped). Existing behavior for all other cases is preserved — the early return has strictly narrower conditions than the subsequent code path.

Security & Risk

  • Security: No security concerns. The change is purely in render-tree node filtering logic.
  • Regression risk: Low. The three conditions for the early return are narrow and specific. The existing tests for other visitTransientNode scenarios (stale anchor+transients, partial stale transients, etc.) are unaffected and continue to pass.

Accessibility

No accessibility impact. This is a render-tree data structure fix with no DOM or ARIA changes.

Recommendations

  1. (Optional) Consider adding a test case for the early-return path in a fragment-run context (i.e., passing fragmentIdsThisRun to the visitor constructor) to ensure the fix also works correctly during fragment reruns.
  2. (Optional) Add a complementary negative assertion in the test, e.g., verifying the result is not a TransientNode (to ensure unwrapping happened).

Verdict

APPROVED: Clean, minimal, and well-targeted fix for a legitimate race condition. The early-return logic is correct, consistent with existing patterns (e.g., ClearTransientNodesVisitor), and well-tested. No backwards compatibility, security, or accessibility concerns.


This is an automated AI review by opus-4.6-thinking.

@kmcgrady kmcgrady merged commit c0cf390 into develop Feb 28, 2026
55 checks passed
@kmcgrady kmcgrady deleted the fix/clear-transient branch February 28, 2026 00:47
lukasmasuch added a commit that referenced this pull request Mar 11, 2026
## Describe your changes

Fixes a regression introduced in #13849 where stale elements from
previous script runs could persist after spinner completion.

**Root cause:** When a transient node (spinner) captured an element from
a previous run as its anchor, and the transient was cleared, the anchor
was restored directly without checking if it was stale.

**The fix:** Change `return node.anchor` to `return
node.anchor.accept(this)` so the anchor is also checked for staleness
before being restored.

## GitHub Issue Link (if applicable)

Closes #14249

## Testing Plan

- [x] Unit tests: Added test case for stale anchor scenario
- [x] Manual testing: Verified fix using reproduction steps from issue

<!-- agent-metrics-start -->
<details>
<summary>Agent metrics</summary>

| Type | Name | Count |
|------|------|------:|
| subagent | fixing-pr | 1 |

</details>
<!-- agent-metrics-end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

change:bugfix PR contains bug fix implementation impact:users PR changes affect end users

3 participants