Improve exception for trained model deployment scale up timeout by dan-rubinstein · Pull Request #128218 · elastic/elasticsearch

dan-rubinstein · 2025-05-20T17:49:50Z

Description

Issue - https://github.com/elastic/ml-team/issues/1578
This change adds a custom exception that is thrown when the users inference request times out waiting for adaptive allocations to scale-up.

Testing

Unit tests
Made an inference call with a short timeout to a default endpoint and confirmed the exception was thrown.

elasticsearchmachine · 2025-05-20T17:50:30Z

Hi @dan-rubinstein, I've created a changelog YAML for you.

davidkyle · 2025-05-22T09:16:43Z

...in/java/org/elasticsearch/xpack/ml/inference/assignment/ModelDeploymentScaleUpException.java

+import org.elasticsearch.rest.RestStatus;
+
+public class ModelDeploymentScaleUpException extends ElasticsearchStatusException {
+    public ModelDeploymentScaleUpException(String message, RestStatus status) {


It looks like the same message parameter is used in every case ("Timed out after [%s] waiting for trained model deployment to start. ...") and the rest status is always timeout.

Given the way it is used let's call it ModelDeploymentTimeoutException and move the format string to a constructor

public ModelDeploymentTimeoutException(String deploymentId, TimeValue timeout) { super(format( "Timed out after [%s] waiting for trained model deployment [%s] to start. " + "Please ensure the trained model deployment has started and try again.", timeout, deploymentId ), RestStatus.REQUEST_TIMEOUT); }

I agree that we can move the rest status into the constructor as it is the same across all use cases. As for the message string, it does actually differ across use cases slightly as we don't always have access to the deploymentId (ex. in the definition of the interface here). We can omit the deploymentId but I figured it's helpful when available for users to know what deployment to monitor. Let me know what you think is the best way to handle the message.

davidkyle · 2025-05-22T10:07:01Z

Testing this I noticed there is a bug where the request does not timeout waiting for the model download. Starting with a clean Elasticsearch that does not have the ELSER model downloaded, this request should timeout after 5 seconds

curl --request POST 'http://localhost:9200/_inference/.elser-2-elasticsearch?timeout=5s' \
  --header "Content-Type: application/json" \
 --data '{"input": "A short timeout should occur before the model download is complete"}'

But what happens is the user has to wait for the model download to complete (60 seconds in my case) then it times out. The bug is probably in this code

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

Line 599 in 77bf65d

TaskRetriever.getDownloadTaskInfo(

Should we still extend ElasticsearchStatusException (in case someone is catching it).

Yes for the stated reason.

Is the reststatus correct?

I can't think of a more appropriate status code

Are there any other pieces of information I should add?
Is the naming of the exception clear enough? Is the error message clear enough?
Should we tell the user to ensure that the model deployment has started or just tell them to wait and try again?

I thought the old message that included the instruction "Use the trained model stats API to track the state of the deployment." was useful. Can you include that with the try again instruction.

Should we just call it ModelDeploymentTimeoutException?

++

dan-rubinstein · 2025-05-22T14:50:03Z

Testing this I noticed there is a bug where the request does not timeout waiting for the model download. Starting with a clean Elasticsearch that does not have the ELSER model downloaded, this request should timeout after 5 seconds

I've noticed this bug as well but I'm not entirely sure where the problem is originating. Running with a debugger it seems that the TaskRetriever.getDownloadTaskInfo call is hit well after the timeout as well so I think it's possible we are stuck waiting for the deployment elsewhere. From a quick test it seems that our putModel call which I believe starts the model download here does not use the timeout so we wait for it to be completed before initiating the start trained model deployment request. I think it requires more investigation and the fix should be part of a separate change. I'll create an issue to investigate this further.

elasticsearchmachine · 2025-05-23T14:31:07Z

Pinging @elastic/ml-core (Team:ML)

davidkyle · 2025-05-27T13:46:40Z

I think it requires more investigation and the fix should be part of a separate change. I'll create an issue to investigate this further.

++ please create the issue

davidkyle

LGTM

dan-rubinstein · 2025-05-27T14:10:35Z

I think it requires more investigation and the fix should be part of a separate change. I'll create an issue to investigate this further.

++ please create the issue

Issue created

Mikep86

LGTM. Not blocking, but are there any unit tests we can add for some automated testing against regressions?

dan-rubinstein · 2025-05-28T13:07:52Z

@elasticmachine merge upstream

dan-rubinstein · 2025-05-28T14:16:49Z

LGTM. Not blocking, but are there any unit tests we can add for some automated testing against regressions?

As this is within the transport layer we currently don't have unit tests for this code but I do think having some testing around this would make sense. I'll discuss with the team to see if this can fit into our QA pipeline where we have some similar testing for these code paths.

elasticsearchmachine · 2025-05-28T14:19:15Z

💚 Backport successful

Status	Branch	Result
✅	8.19

) (#128579) * Improve exception for trained model deployment scale up timeout * Update docs/changelog/128218.yaml * Rename exception and update exception message --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Improve exception for trained model deployment scale up timeout

ce9a97c

dan-rubinstein added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels May 20, 2025

Update docs/changelog/128218.yaml

37ce942

davidkyle reviewed May 22, 2025

View reviewed changes

Rename exception and update exception message

5cf9498

dan-rubinstein marked this pull request as ready for review May 23, 2025 14:30

davidkyle approved these changes May 27, 2025

View reviewed changes

dan-rubinstein requested a review from Mikep86 May 27, 2025 13:48

Mikep86 approved these changes May 27, 2025

View reviewed changes

Merge branch 'main' into improve-scale-up-error

16cafca

dan-rubinstein added the auto-backport Automatically create backport pull requests when merged label May 28, 2025

dan-rubinstein merged commit 53668f7 into elastic:main May 28, 2025
18 checks passed

dan-rubinstein mentioned this pull request May 28, 2025

[8.19] Improve exception for trained model deployment scale up timeout (#128218) #128579

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve exception for trained model deployment scale up timeout#128218

Improve exception for trained model deployment scale up timeout#128218
dan-rubinstein merged 4 commits intoelastic:mainfrom
dan-rubinstein:improve-scale-up-error

dan-rubinstein commented May 20, 2025 •

edited

Loading

elasticsearchmachine commented May 20, 2025

davidkyle May 22, 2025

dan-rubinstein May 22, 2025

davidkyle commented May 22, 2025

dan-rubinstein commented May 22, 2025 •

edited

Loading

elasticsearchmachine commented May 23, 2025

davidkyle commented May 27, 2025

davidkyle left a comment

dan-rubinstein commented May 27, 2025

Mikep86 left a comment

dan-rubinstein commented May 28, 2025

dan-rubinstein commented May 28, 2025

Uh oh!

elasticsearchmachine commented May 28, 2025

Labels

5 participants

Conversation

dan-rubinstein commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

elasticsearchmachine commented May 20, 2025

davidkyle May 22, 2025

Choose a reason for hiding this comment

dan-rubinstein May 22, 2025

Choose a reason for hiding this comment

davidkyle commented May 22, 2025

dan-rubinstein commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented May 23, 2025

davidkyle commented May 27, 2025

davidkyle left a comment

Choose a reason for hiding this comment

dan-rubinstein commented May 27, 2025

Mikep86 left a comment

Choose a reason for hiding this comment

dan-rubinstein commented May 28, 2025

dan-rubinstein commented May 28, 2025

Uh oh!

elasticsearchmachine commented May 28, 2025

💚 Backport successful

Labels

5 participants

dan-rubinstein commented May 20, 2025 •

edited

Loading

dan-rubinstein commented May 22, 2025 •

edited

Loading