-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Description
Elasticsearch Version
8.18
Installed Plugins
No response
Java Version
bundled
OS Version
any
Problem Description
We have seen that during a rolling update ML anomaly detection jobs can get stuck in the OPENING state because the master node was temporarily unavailable when the job tried to update its state to OPENED. The number of jobs that can be in the OPENING state simultaneously on a single ML node is limited by the xpack.node_concurrent_job_allocations setting which has a default value of 2. If 2 jobs are stuck in the OPENING state this leads to a situation where vacated jobs cannot be reopened after the cluster upgrade.
In this situation the fix is to force close the jobs stuck OPENING or increase the value of xpack.node_concurrent_job_allocations to allow more jobs to open. This requires user intervention, a better solution would be that Elasticsearch detects jobs stuck in the OPENING state for an extended period of time and attempts to fix them with one of two alternatives:
- Close and reopen the job to fix the problem automatically.
- Stop the job or it into the failed state then continue opening the other jobs.
We need to design and implement a solution that requires minimal user intervention while providing user feedback about the state of the job migration.
Steps to Reproduce
Create a large number of anomaly detection jobs and perform a rolling upgrade
Logs (if relevant)
No response