Skip to content

Ensure initial state discovery does not block indefinitely on startup#139467

Merged
rjernst merged 7 commits intoelastic:mainfrom
rjernst:shutdown/cancel_master_wait
Dec 18, 2025
Merged

Ensure initial state discovery does not block indefinitely on startup#139467
rjernst merged 7 commits intoelastic:mainfrom
rjernst:shutdown/cancel_master_wait

Conversation

@rjernst
Copy link
Member

@rjernst rjernst commented Dec 12, 2025

If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.

If a node receives a SIGTERM while it is starting up, it should shutdown
immediately. Yet while waiting for initial cluster state the main thread
may be blocked. This commit adds a hook to shutdown so that initial
state discovery is unblocked once a shutdown is begun.
@rjernst rjernst requested a review from a team as a code owner December 12, 2025 18:50
@rjernst rjernst added >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown auto-backport Automatically create backport pull requests when merged branch:9.2 branch:9.1 branch:8.19 labels Dec 12, 2025
@rjernst rjernst requested a review from DaveCTurner December 12, 2025 18:50
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @rjernst, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few nits but nothing needing another look.


if (shutdownService.isShuttingDown()) {
// shutdown started in the middle of startup, so bail early
logger.warn("shutdown began while waiting for initial discovery state");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just info here, it doesn't especially need attention if this happens.

transportService.addRequestHandlingBehavior("internal:discovery/request_peers", (handler, request, channel, task) -> {
logger.info("blocking peer discovery");
reachedBlock.countDown();
testDone.await();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to block here? I think we can just fall through.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, how do we ensure we will actually block when we reach the initial state latch? I guess the trick is I need to make the other node not a master so that master discovery never completes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we ensure we will actually block when we reach the initial state latch?

Blocking here doesn't really ensure that anyway, and it blocks the cluster coordination thread on the new node which is rather artificial and may even reduce the coverage of this test.

If you really want to be sure that the new node never gets past the startup block, you could configure it with a ReadinessService and then assert that this is never started. Or similarly add a ClusterPlugin and verify its onNodeStarted method never executes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the validation that I'm concerned with. I want to wait to execute the shutdown until we are blocked on the initial cluster state latch. ie I want to be sure that the countdown of the latch from the shutdown is actually what caused the startup to unblock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm missing something but I don't think this testDone.await() achieves that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, sorry to conflate the two things. I was asking for suggestions on how to make the test useful as you made me realize this test isn't really doing what I intended it to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test already covers the case we want adequately (particularly if we remove testDone.await() and make INITIAL_STATE_TIMEOUT_SETTING much longer).

We could also ensure we see the shutdown began while waiting for initial discovery state log message after (and only after) the call to joiningNode.prepareForClose()

Comment on lines 719 to 724
Settings.builder()
.put(baseSettings)
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir())
.putList("cluster.initial_master_nodes", "master_node")
.put("node.name", "joining_node")
.putList(SettingsBasedSeedHostsProvider.DISCOVERY_SEED_HOSTS_SETTING.getKey(), masterAddress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should we set an (unreasonably) long INITIAL_STATE_TIMEOUT_SETTING here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a long initial state timeout in the test node configuration now


Thread startupThread = new Thread(() -> startNode.accept(joiningNode));
startupThread.start();
reachedBlock.await();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest we impose a timeout and handle interruptions properly here:

Suggested change
reachedBlock.await();
safeAwait(reachedBlock);
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good still tho the test failure seems germane.

@rjernst rjernst added the v9.3.0 label Dec 18, 2025
@rjernst rjernst enabled auto-merge (squash) December 18, 2025 21:02
@rjernst rjernst merged commit 77519e5 into elastic:main Dec 18, 2025
35 checks passed
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Dec 18, 2025
…elastic#139467)

If a node receives a SIGTERM while it is starting up, it should shutdown
immediately. Yet while waiting for initial cluster state the main thread
may be blocked. This commit adds a hook to shutdown so that initial
state discovery is unblocked once a shutdown is begun.
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.3
9.1
9.2
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 139467

rjernst added a commit to rjernst/elasticsearch that referenced this pull request Dec 18, 2025
…elastic#139467)

If a node receives a SIGTERM while it is starting up, it should shutdown
immediately. Yet while waiting for initial cluster state the main thread
may be blocked. This commit adds a hook to shutdown so that initial
state discovery is unblocked once a shutdown is begun.
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Dec 18, 2025
…elastic#139467)

If a node receives a SIGTERM while it is starting up, it should shutdown
immediately. Yet while waiting for initial cluster state the main thread
may be blocked. This commit adds a hook to shutdown so that initial
state discovery is unblocked once a shutdown is begun.
elasticsearchmachine pushed a commit that referenced this pull request Dec 18, 2025
…#139467) (#139786)

If a node receives a SIGTERM while it is starting up, it should shutdown
immediately. Yet while waiting for initial cluster state the main thread
may be blocked. This commit adds a hook to shutdown so that initial
state discovery is unblocked once a shutdown is begun.
elasticsearchmachine pushed a commit that referenced this pull request Dec 18, 2025
…#139467) (#139787)

If a node receives a SIGTERM while it is starting up, it should shutdown
immediately. Yet while waiting for initial cluster state the main thread
may be blocked. This commit adds a hook to shutdown so that initial
state discovery is unblocked once a shutdown is begun.
elasticsearchmachine pushed a commit that referenced this pull request Dec 18, 2025
…#139467) (#139785)

If a node receives a SIGTERM while it is starting up, it should shutdown
immediately. Yet while waiting for initial cluster state the main thread
may be blocked. This commit adds a hook to shutdown so that initial
state discovery is unblocked once a shutdown is begun.
@rjernst rjernst removed the v8.19.10 label Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown Team:Core/Infra Meta label for core/infra team v9.1.10 v9.2.4 v9.3.0 v9.4.0

3 participants