Skip to content

[Transform] Retry when failing to start indexer#132048

Merged
prwhelan merged 10 commits intoelastic:mainfrom
prwhelan:fix/128221
Aug 12, 2025
Merged

[Transform] Retry when failing to start indexer#132048
prwhelan merged 10 commits intoelastic:mainfrom
prwhelan:fix/128221

Conversation

@prwhelan
Copy link
Member

Bug: when the indexer state fails to save during a cluster state update, the Transform is stuck in STOPPING and cannot be restarted unless the user force stops to delete the task.

Fix: the task will continuously retry starting the indexer until the cluster state update can succeed.

Notes:

  • users can cancel the retry by force stopping the transform
  • the retry is displayed in the UI as "degraded" with a message as to why the transform is restarting
  • the transform now displays as STARTING rather than STOPPING until it successfully starts
  • the retry is audited so it displays in the Messages tab of the UI
  • the retry timer is randomly selected between 45s and 90s, this should help during rolling restarts for clusters that have a large amount of transforms

Fix #128221

Bug: when the indexer state fails to save during a cluster state update, the
Transform is stuck in STOPPING and cannot be restarted unless the user
force stops to delete the task.

Fix: the task will continuously retry starting the indexer until the
cluster state update can succeed.

Notes:
- users can cancel the retry by force stopping the transform
- the retry is displayed in the UI as "degraded" with a message as to
  why the transform is restarting
- the transform now displays as STARTING rather than STOPPING until it
  successfully starts
- the retry is audited so it displays in the Messages tab of the UI
- the retry timer is randomly selected between 45s and 90s, this should
  help during rolling restarts for clusters that have a large amount of
  transforms

Fix elastic#128221
@prwhelan prwhelan added >bug :ml/Transform Transform Team:ML Meta label for the ML team v9.2.0 labels Jul 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @prwhelan, I've created a changelog YAML for you.

@prwhelan prwhelan marked this pull request as ready for review August 4, 2025 12:15
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

params.getId(),
Strings.format(
"Failed while starting Transform. Automatically retrying every [%s] seconds. "
+ "To cancel retries, force stop this transform. Failure: [%s]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about including the force stop command in the message here? Or is it expected that the user would do that via a UI button (I'm imagining it being done from the dev console)?

@prwhelan prwhelan merged commit 0f65d79 into elastic:main Aug 12, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :ml/Transform Transform Team:ML Meta label for the ML team v9.2.0

3 participants