Improve exception for trained model deployment scale up timeout#128218
Improve exception for trained model deployment scale up timeout#128218dan-rubinstein merged 4 commits intoelastic:mainfrom
Conversation
|
Hi @dan-rubinstein, I've created a changelog YAML for you. |
| import org.elasticsearch.rest.RestStatus; | ||
|
|
||
| public class ModelDeploymentScaleUpException extends ElasticsearchStatusException { | ||
| public ModelDeploymentScaleUpException(String message, RestStatus status) { |
There was a problem hiding this comment.
It looks like the same message parameter is used in every case ("Timed out after [%s] waiting for trained model deployment to start. ...") and the rest status is always timeout.
Given the way it is used let's call it ModelDeploymentTimeoutException and move the format string to a constructor
public ModelDeploymentTimeoutException(String deploymentId, TimeValue timeout)
{
super(format(
"Timed out after [%s] waiting for trained model deployment [%s] to start. "
+ "Please ensure the trained model deployment has started and try again.",
timeout,
deploymentId
),
RestStatus.REQUEST_TIMEOUT);
}
There was a problem hiding this comment.
I agree that we can move the rest status into the constructor as it is the same across all use cases. As for the message string, it does actually differ across use cases slightly as we don't always have access to the deploymentId (ex. in the definition of the interface here). We can omit the deploymentId but I figured it's helpful when available for users to know what deployment to monitor. Let me know what you think is the best way to handle the message.
|
Testing this I noticed there is a bug where the request does not timeout waiting for the model download. Starting with a clean Elasticsearch that does not have the ELSER model downloaded, this request should timeout after 5 seconds But what happens is the user has to wait for the model download to complete (60 seconds in my case) then it times out. The bug is probably in this code
Yes for the stated reason.
I can't think of a more appropriate status code
I thought the old message that included the instruction "Use the trained model stats API to track the state of the deployment." was useful. Can you include that with the try again instruction.
++ |
I've noticed this bug as well but I'm not entirely sure where the problem is originating. Running with a debugger it seems that the |
|
Pinging @elastic/ml-core (Team:ML) |
++ please create the issue |
Issue created |
Mikep86
left a comment
There was a problem hiding this comment.
LGTM. Not blocking, but are there any unit tests we can add for some automated testing against regressions?
|
@elasticmachine merge upstream |
As this is within the transport layer we currently don't have unit tests for this code but I do think having some testing around this would make sense. I'll discuss with the team to see if this can fit into our QA pipeline where we have some similar testing for these code paths. |
💚 Backport successful
|
) (#128579) * Improve exception for trained model deployment scale up timeout * Update docs/changelog/128218.yaml * Rename exception and update exception message --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Description
Issue - https://github.com/elastic/ml-team/issues/1578
This change adds a custom exception that is thrown when the users inference request times out waiting for adaptive allocations to scale-up.
Testing