Ensure initial state discovery does not block indefinitely on startup#139467
Ensure initial state discovery does not block indefinitely on startup#139467rjernst merged 7 commits intoelastic:mainfrom
Conversation
If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Hi @rjernst, I've created a changelog YAML for you. |
DaveCTurner
left a comment
There was a problem hiding this comment.
LGTM, a few nits but nothing needing another look.
|
|
||
| if (shutdownService.isShuttingDown()) { | ||
| // shutdown started in the middle of startup, so bail early | ||
| logger.warn("shutdown began while waiting for initial discovery state"); |
There was a problem hiding this comment.
Maybe just info here, it doesn't especially need attention if this happens.
| transportService.addRequestHandlingBehavior("internal:discovery/request_peers", (handler, request, channel, task) -> { | ||
| logger.info("blocking peer discovery"); | ||
| reachedBlock.countDown(); | ||
| testDone.await(); |
There was a problem hiding this comment.
Do we need to block here? I think we can just fall through.
There was a problem hiding this comment.
Hrm, how do we ensure we will actually block when we reach the initial state latch? I guess the trick is I need to make the other node not a master so that master discovery never completes?
There was a problem hiding this comment.
how do we ensure we will actually block when we reach the initial state latch?
Blocking here doesn't really ensure that anyway, and it blocks the cluster coordination thread on the new node which is rather artificial and may even reduce the coverage of this test.
If you really want to be sure that the new node never gets past the startup block, you could configure it with a ReadinessService and then assert that this is never started. Or similarly add a ClusterPlugin and verify its onNodeStarted method never executes.
There was a problem hiding this comment.
It's not the validation that I'm concerned with. I want to wait to execute the shutdown until we are blocked on the initial cluster state latch. ie I want to be sure that the countdown of the latch from the shutdown is actually what caused the startup to unblock.
There was a problem hiding this comment.
Sorry if I'm missing something but I don't think this testDone.await() achieves that.
There was a problem hiding this comment.
It doesn't, sorry to conflate the two things. I was asking for suggestions on how to make the test useful as you made me realize this test isn't really doing what I intended it to do.
There was a problem hiding this comment.
I think this test already covers the case we want adequately (particularly if we remove testDone.await() and make INITIAL_STATE_TIMEOUT_SETTING much longer).
We could also ensure we see the shutdown began while waiting for initial discovery state log message after (and only after) the call to joiningNode.prepareForClose()
| Settings.builder() | ||
| .put(baseSettings) | ||
| .put(Environment.PATH_HOME_SETTING.getKey(), createTempDir()) | ||
| .putList("cluster.initial_master_nodes", "master_node") | ||
| .put("node.name", "joining_node") | ||
| .putList(SettingsBasedSeedHostsProvider.DISCOVERY_SEED_HOSTS_SETTING.getKey(), masterAddress) |
There was a problem hiding this comment.
Can/should we set an (unreasonably) long INITIAL_STATE_TIMEOUT_SETTING here too?
There was a problem hiding this comment.
I put a long initial state timeout in the test node configuration now
|
|
||
| Thread startupThread = new Thread(() -> startNode.accept(joiningNode)); | ||
| startupThread.start(); | ||
| reachedBlock.await(); |
There was a problem hiding this comment.
Suggest we impose a timeout and handle interruptions properly here:
| reachedBlock.await(); | |
| safeAwait(reachedBlock); |
DaveCTurner
left a comment
There was a problem hiding this comment.
Looks good still tho the test failure seems germane.
…elastic#139467) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.
💔 Backport failed
You can use sqren/backport to manually backport by running |
…elastic#139467) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.
…elastic#139467) If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.
If a node receives a SIGTERM while it is starting up, it should shutdown immediately. Yet while waiting for initial cluster state the main thread may be blocked. This commit adds a hook to shutdown so that initial state discovery is unblocked once a shutdown is begun.