Ensure initial state discovery does not block indefinitely on startup by rjernst · Pull Request #139467 · elastic/elasticsearch

rjernst · 2025-12-12T18:50:35Z

If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.

elasticsearchmachine · 2025-12-12T18:51:00Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2025-12-12T18:51:24Z

Hi @rjernst, I've created a changelog YAML for you.

DaveCTurner

LGTM, a few nits but nothing needing another look.

DaveCTurner · 2025-12-12T20:16:55Z

server/src/main/java/org/elasticsearch/node/Node.java

+
+                if (shutdownService.isShuttingDown()) {
+                    // shutdown started in the middle of startup, so bail early
+                    logger.warn("shutdown began while waiting for initial discovery state");


Maybe just info here, it doesn't especially need attention if this happens.

DaveCTurner · 2025-12-12T20:19:29Z

server/src/test/java/org/elasticsearch/node/NodeTests.java

+            transportService.addRequestHandlingBehavior("internal:discovery/request_peers", (handler, request, channel, task) -> {
+                logger.info("blocking peer discovery");
+                reachedBlock.countDown();
+                testDone.await();


Do we need to block here? I think we can just fall through.

Hrm, how do we ensure we will actually block when we reach the initial state latch? I guess the trick is I need to make the other node not a master so that master discovery never completes?

how do we ensure we will actually block when we reach the initial state latch?

Blocking here doesn't really ensure that anyway, and it blocks the cluster coordination thread on the new node which is rather artificial and may even reduce the coverage of this test.

If you really want to be sure that the new node never gets past the startup block, you could configure it with a ReadinessService and then assert that this is never started. Or similarly add a ClusterPlugin and verify its onNodeStarted method never executes.

It's not the validation that I'm concerned with. I want to wait to execute the shutdown until we are blocked on the initial cluster state latch. ie I want to be sure that the countdown of the latch from the shutdown is actually what caused the startup to unblock.

Sorry if I'm missing something but I don't think this testDone.await() achieves that.

It doesn't, sorry to conflate the two things. I was asking for suggestions on how to make the test useful as you made me realize this test isn't really doing what I intended it to do.

I think this test already covers the case we want adequately (particularly if we remove testDone.await() and make INITIAL_STATE_TIMEOUT_SETTING much longer).

We could also ensure we see the shutdown began while waiting for initial discovery state log message after (and only after) the call to joiningNode.prepareForClose()

DaveCTurner · 2025-12-12T20:20:15Z

server/src/test/java/org/elasticsearch/node/NodeTests.java

+                Settings.builder()
+                    .put(baseSettings)
+                    .put(Environment.PATH_HOME_SETTING.getKey(), createTempDir())
+                    .putList("cluster.initial_master_nodes", "master_node")
+                    .put("node.name", "joining_node")
+                    .putList(SettingsBasedSeedHostsProvider.DISCOVERY_SEED_HOSTS_SETTING.getKey(), masterAddress)


Can/should we set an (unreasonably) long INITIAL_STATE_TIMEOUT_SETTING here too?

I put a long initial state timeout in the test node configuration now

DaveCTurner · 2025-12-12T20:20:45Z

server/src/test/java/org/elasticsearch/node/NodeTests.java

+
+                Thread startupThread = new Thread(() -> startNode.accept(joiningNode));
+                startupThread.start();
+                reachedBlock.await();


Suggest we impose a timeout and handle interruptions properly here:

Suggested change

reachedBlock.await();

safeAwait(reachedBlock);

DaveCTurner

Looks good still tho the test failure seems germane.

…elastic#139467) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.

elasticsearchmachine · 2025-12-18T22:17:22Z

💔 Backport failed

Status	Branch	Result
✅	9.3
✅	9.1
✅	9.2
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 139467

…elastic#139467) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.

…#139467) (#139786) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.

…#139467) (#139787) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.

…#139467) (#139785) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.

rjernst requested a review from a team as a code owner December 12, 2025 18:50

rjernst added >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown auto-backport Automatically create backport pull requests when merged branch:9.2 branch:9.1 branch:8.19 labels Dec 12, 2025

rjernst requested a review from DaveCTurner December 12, 2025 18:50

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team v9.3.0 v9.1.10 v9.2.4 labels Dec 12, 2025

elasticsearchmachine added v8.19.10 and removed branch:9.2 branch:9.1 branch:8.19 labels Dec 12, 2025

Update docs/changelog/139467.yaml

a4c944a

[CI] Auto commit changes from spotless

208b209

DaveCTurner approved these changes Dec 12, 2025

View reviewed changes

rjernst added 2 commits December 15, 2025 07:25

improve test and address comments

f79b621

Merge branch 'main' into shutdown/cancel_master_wait

f2b8fcf

DaveCTurner approved these changes Dec 15, 2025

View reviewed changes

elasticsearchmachine added v9.4.0 and removed v9.3.0 labels Dec 17, 2025

rjernst added 2 commits December 18, 2025 12:58

remove unnecessary test

70078d3

Merge branch 'main' into shutdown/cancel_master_wait

37df7c6

rjernst added the v9.3.0 label Dec 18, 2025

rjernst enabled auto-merge (squash) December 18, 2025 21:02

rjernst merged commit 77519e5 into elastic:main Dec 18, 2025
35 checks passed

elasticsearchmachine added the backport pending label Dec 18, 2025

szybia mentioned this pull request Dec 22, 2025

[CI] ReindexNodeShutdownIT testReindexWithShutdown failing #139806

Closed

rjernst removed the v8.19.10 label Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure initial state discovery does not block indefinitely on startup#139467

Ensure initial state discovery does not block indefinitely on startup#139467
rjernst merged 7 commits intoelastic:mainfrom
rjernst:shutdown/cancel_master_wait

rjernst commented Dec 12, 2025

elasticsearchmachine commented Dec 12, 2025

elasticsearchmachine commented Dec 12, 2025

DaveCTurner left a comment

DaveCTurner Dec 12, 2025

DaveCTurner Dec 12, 2025

rjernst Dec 12, 2025

DaveCTurner Dec 13, 2025

rjernst Dec 13, 2025

DaveCTurner Dec 13, 2025

rjernst Dec 15, 2025

DaveCTurner Dec 15, 2025

DaveCTurner Dec 12, 2025

rjernst Dec 15, 2025

DaveCTurner Dec 12, 2025

DaveCTurner left a comment

Uh oh!

elasticsearchmachine commented Dec 18, 2025

Labels

3 participants

Conversation

rjernst commented Dec 12, 2025

elasticsearchmachine commented Dec 12, 2025

elasticsearchmachine commented Dec 12, 2025

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Dec 18, 2025

💔 Backport failed

Labels

3 participants