-
Notifications
You must be signed in to change notification settings - Fork 585
Improve rate limit error message for traces size exceeds rate limit #4986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve rate limit error message for traces size exceeds rate limit #4986
Conversation
modules/distributor/distributor.go
Outdated
@@ -374,6 +374,16 @@ func (d *Distributor) checkForRateLimits(tracesSize, spanCount int, userID strin | |||
if d.overrides.IngestionRateStrategy() == overrides.GlobalIngestionRateStrategy { | |||
globalLimit = limit * d.DistributorRing.InstancesCount() | |||
} | |||
|
|||
if tracesSize > limit || tracesSize > globalLimit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that their batch size is too large? should we advise they reduce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and yes but decided not to put that in error message because we don't give advice in other error messages as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's fine to give more details here. i don't find the new error message any more useful than the old one. you could mention that a single batch exceeded the rate limit.
also, i don't think this check is correct. the push is limited if it exceeds local limit or global limit depending on how tempo is configured. the local limit is just the limit. the global limit is limit / numDistributors. the logic is bit tough to unwind but you can see some if it in ingestion_rate_strategy.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if what we need is to include the remaining burst size in the error message. I think that is an area that is unclear, because a push can exceed the local limit as a burst. But not when it's consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to add some more tests and comments to better explain the local and global ingestion rate strategy.
also, i don't think this check is correct. the push is limited if it exceeds local limit or global limit depending on how tempo is configured. the local limit is just the limit. the global limit is limit / numDistributors. the logic is bit tough to unwind but you can see some if it in ingestion_rate_strategy.go.
limit is fetched from ingestionRateLimiter so it will be the local or global limit based on the configured IngestionRateStrategy.
I also updated the check to not include globalLimit because that's not correct, instead we should be checking on limit AND burst to only return the batch size error when the batch is bigger then both limit.
if the batch size is more then limit but less then burst, it would allowed by burst and we should return the ingestion rate limit instead of batch size error.
9b0bdd1
to
171c74d
Compare
171c74d
to
5c64d11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
6658e07
to
85424d0
Compare
…rafana#4986) * Improve rate limit error message for traces size exceeds rate limit * CHANGELOG.md * Add more test cases into TestIngestionRateStrategy * better error message for rate limits * add test for TestIngestionRateStrategyAllowN * update the code for batch size too big error * fix unit for burst
What this PR does:
I saw errors that reads as
pusher failed to consume trace data" err="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (local: 50000 bytes, global: 500000 bytes) exceeded while adding 966953 bytes for user xxxxxx"
This error message had me confused because the ingestion rate for this user was way under the ingestion rate limit.
After some investigation I noticed that this push failed because the batch size of 966953 bytes is more then the global limit of 500000 bytes (my cluster is using
global
ingestion rate limiting strategy)This PR improves the error message when the push failure is due to batch size being more then rate limit and not because user is hitting the ingestion rate limit.
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]