[ML] Prevent AD jobs from being stuck in the "opening" state

Elasticsearch Version

8.18

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

We have seen that during a rolling update ML anomaly detection jobs can get stuck in the OPENING state because the master node was temporarily unavailable when the job tried to update its state to OPENED. The number of jobs that can be in the OPENING state simultaneously on a single ML node is limited by the xpack.node_concurrent_job_allocations setting which has a default value of 2. If 2 jobs are stuck in the OPENING state this leads to a situation where vacated jobs cannot be reopened after the cluster upgrade.

In this situation the fix is to force close the jobs stuck OPENING or increase the value of xpack.node_concurrent_job_allocations to allow more jobs to open. This requires user intervention, a better solution would be that Elasticsearch detects jobs stuck in the OPENING state for an extended period of time and attempts to fix them with one of two alternatives:

Close and reopen the job to fix the problem automatically.
Stop the job or it into the failed state then continue opening the other jobs.

We need to design and implement a solution that requires minimal user intervention while providing user feedback about the state of the job migration.

Steps to Reproduce

Create a large number of anomaly detection jobs and perform a rolling upgrade

Logs (if relevant)

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Prevent AD jobs from being stuck in the "opening" state #126148

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] Prevent AD jobs from being stuck in the "opening" state #126148

Description

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions