-
Notifications
You must be signed in to change notification settings - Fork 823
Ensure frontend response status code are correct. #2122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure frontend response status code are correct. #2122
Conversation
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor nit
I also recommend adding a note about any changes a user might notice in the changelog.
pkg/querier/queryrange/retry.go
Outdated
@@ -42,7 +46,7 @@ func (r retry) Do(ctx context.Context, req Request) (Response, error) { | |||
var lastErr error | |||
for ; tries < r.maxRetries; tries++ { | |||
if ctx.Err() != nil { | |||
return nil, ctx.Err() | |||
return nil, errCanceled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we aren't going to propagate the error up, can we log it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's correct logging it, cause we may end up with double logs for 1 cancelled request.
Why not doing httpgrpc.Errorf(http.StatusInternalServerError, ctx.Err().Error())
? I think it's just because of the assertions in tests, but we can change them to not use strict comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know I actually can drop this line. it will work the same as non grpc errors are ending up as 500 with the string from the error.
But I wanted to talk about if that makes sense to do a 500 here. I think request cancelled should be a 4xx from the server point of view, but the upstream client who decided to cancel could do a 502. (gateway).
The problem I foresee here is that some request could be cancelled by a page refresh very quickly and count toward our error budget. While this is not a biggie for Cortex cause the average latency is 1s, it's pretty hard to cancel this before. But in Loki someone could send a 20s query and decide to cancel, it should not count as an error from the gateway point of view.
WDYT @jtlisi ?
TD;DR: I'm fine to remove this line, it doesn't do much right now, but I'd like to talk about how we should handle cancellation in our API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to die on this hill.
upstream context cancellation should not 500.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly discuss offline with @cyriltovena about this topic. My personal suggestion:
- Rollback this change in this PR (to unlock the merge without further longer discussion)
- Open a new PR to switch from 500 to 400 the response status code in case of context
Canceled
(but not onDeadlineExceeded
)
I think this was introduced by many of my refactoring, I didn't realize |
We need to update this behaviour somewhere else. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cyriltovena for addressing my feedback and sorry for the very long delay! LGTM
Signed-off-by: Cyril Tovena cyril.tovena@gmail.com
There were some places where the frontend and middlewares were not using httpgrpc.Errorf which causes some 500 instead of 400.
I'm also wondering if the frontend should really send back a 500 in case of cancellation. Couldn't not find a status quo on this.