Skip to content

Improve exception for trained model deployment scale up timeout#128218

Merged
dan-rubinstein merged 4 commits intoelastic:mainfrom
dan-rubinstein:improve-scale-up-error
May 28, 2025
Merged

Improve exception for trained model deployment scale up timeout#128218
dan-rubinstein merged 4 commits intoelastic:mainfrom
dan-rubinstein:improve-scale-up-error

Conversation

@dan-rubinstein
Copy link
Member

@dan-rubinstein dan-rubinstein commented May 20, 2025

Description

Issue - https://github.com/elastic/ml-team/issues/1578
This change adds a custom exception that is thrown when the users inference request times out waiting for adaptive allocations to scale-up.

Testing

  • Unit tests
  • Made an inference call with a short timeout to a default endpoint and confirmed the exception was thrown.
@dan-rubinstein dan-rubinstein added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels May 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dan-rubinstein, I've created a changelog YAML for you.

import org.elasticsearch.rest.RestStatus;

public class ModelDeploymentScaleUpException extends ElasticsearchStatusException {
public ModelDeploymentScaleUpException(String message, RestStatus status) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the same message parameter is used in every case ("Timed out after [%s] waiting for trained model deployment to start. ...") and the rest status is always timeout.

Given the way it is used let's call it ModelDeploymentTimeoutException and move the format string to a constructor

public ModelDeploymentTimeoutException(String deploymentId, TimeValue timeout)
{
    super(format(
                        "Timed out after [%s] waiting for trained model deployment [%s] to start. "
                            + "Please ensure the trained model deployment has started and try again.",
                        timeout,
                        deploymentId
                    ),
                    RestStatus.REQUEST_TIMEOUT);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can move the rest status into the constructor as it is the same across all use cases. As for the message string, it does actually differ across use cases slightly as we don't always have access to the deploymentId (ex. in the definition of the interface here). We can omit the deploymentId but I figured it's helpful when available for users to know what deployment to monitor. Let me know what you think is the best way to handle the message.

@davidkyle
Copy link
Member

Testing this I noticed there is a bug where the request does not timeout waiting for the model download. Starting with a clean Elasticsearch that does not have the ELSER model downloaded, this request should timeout after 5 seconds

curl --request POST 'http://localhost:9200/_inference/.elser-2-elasticsearch?timeout=5s' \
  --header "Content-Type: application/json" \
 --data '{"input": "A short timeout should occur before the model download is complete"}'

But what happens is the user has to wait for the model download to complete (60 seconds in my case) then it times out. The bug is probably in this code

Should we still extend ElasticsearchStatusException (in case someone is catching it).

Yes for the stated reason.

Is the reststatus correct?

I can't think of a more appropriate status code

Are there any other pieces of information I should add?
Is the naming of the exception clear enough? Is the error message clear enough?
Should we tell the user to ensure that the model deployment has started or just tell them to wait and try again?

I thought the old message that included the instruction "Use the trained model stats API to track the state of the deployment." was useful. Can you include that with the try again instruction.

Should we just call it ModelDeploymentTimeoutException?

++

@dan-rubinstein
Copy link
Member Author

dan-rubinstein commented May 22, 2025

Testing this I noticed there is a bug where the request does not timeout waiting for the model download. Starting with a clean Elasticsearch that does not have the ELSER model downloaded, this request should timeout after 5 seconds

I've noticed this bug as well but I'm not entirely sure where the problem is originating. Running with a debugger it seems that the TaskRetriever.getDownloadTaskInfo call is hit well after the timeout as well so I think it's possible we are stuck waiting for the deployment elsewhere. From a quick test it seems that our putModel call which I believe starts the model download here does not use the timeout so we wait for it to be completed before initiating the start trained model deployment request. I think it requires more investigation and the fix should be part of a separate change. I'll create an issue to investigate this further.

@dan-rubinstein dan-rubinstein marked this pull request as ready for review May 23, 2025 14:30
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle
Copy link
Member

I think it requires more investigation and the fix should be part of a separate change. I'll create an issue to investigate this further.

++ please create the issue

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dan-rubinstein dan-rubinstein requested a review from Mikep86 May 27, 2025 13:48
@dan-rubinstein
Copy link
Member Author

I think it requires more investigation and the fix should be part of a separate change. I'll create an issue to investigate this further.

++ please create the issue

Issue created

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Not blocking, but are there any unit tests we can add for some automated testing against regressions?

@dan-rubinstein
Copy link
Member Author

@elasticmachine merge upstream

@dan-rubinstein dan-rubinstein added the auto-backport Automatically create backport pull requests when merged label May 28, 2025
@dan-rubinstein
Copy link
Member Author

LGTM. Not blocking, but are there any unit tests we can add for some automated testing against regressions?

As this is within the transport layer we currently don't have unit tests for this code but I do think having some testing around this would make sense. I'll discuss with the team to see if this can fit into our QA pipeline where we have some similar testing for these code paths.

@dan-rubinstein dan-rubinstein merged commit 53668f7 into elastic:main May 28, 2025
18 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19
elasticsearchmachine pushed a commit that referenced this pull request May 28, 2025
) (#128579)

* Improve exception for trained model deployment scale up timeout

* Update docs/changelog/128218.yaml

* Rename exception and update exception message

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0

5 participants