Skip to content

Improve rate limit error message for traces size exceeds rate limit #4986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 23, 2025

Conversation

electron0zero
Copy link
Member

@electron0zero electron0zero commented Apr 11, 2025

What this PR does:

I saw errors that reads as pusher failed to consume trace data" err="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (local: 50000 bytes, global: 500000 bytes) exceeded while adding 966953 bytes for user xxxxxx"

This error message had me confused because the ingestion rate for this user was way under the ingestion rate limit.

After some investigation I noticed that this push failed because the batch size of 966953 bytes is more then the global limit of 500000 bytes (my cluster is using global ingestion rate limiting strategy)

This PR improves the error message when the push failure is due to batch size being more then rate limit and not because user is hitting the ingestion rate limit.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
@@ -374,6 +374,16 @@ func (d *Distributor) checkForRateLimits(tracesSize, spanCount int, userID strin
if d.overrides.IngestionRateStrategy() == overrides.GlobalIngestionRateStrategy {
globalLimit = limit * d.DistributorRing.InstancesCount()
}

if tracesSize > limit || tracesSize > globalLimit {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that their batch size is too large? should we advise they reduce it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and yes but decided not to put that in error message because we don't give advice in other error messages as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's fine to give more details here. i don't find the new error message any more useful than the old one. you could mention that a single batch exceeded the rate limit.

also, i don't think this check is correct. the push is limited if it exceeds local limit or global limit depending on how tempo is configured. the local limit is just the limit. the global limit is limit / numDistributors. the logic is bit tough to unwind but you can see some if it in ingestion_rate_strategy.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if what we need is to include the remaining burst size in the error message. I think that is an area that is unclear, because a push can exceed the local limit as a burst. But not when it's consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to add some more tests and comments to better explain the local and global ingestion rate strategy.

also, i don't think this check is correct. the push is limited if it exceeds local limit or global limit depending on how tempo is configured. the local limit is just the limit. the global limit is limit / numDistributors. the logic is bit tough to unwind but you can see some if it in ingestion_rate_strategy.go.

limit is fetched from ingestionRateLimiter so it will be the local or global limit based on the configured IngestionRateStrategy.

I also updated the check to not include globalLimit because that's not correct, instead we should be checking on limit AND burst to only return the batch size error when the batch is bigger then both limit.

if the batch size is more then limit but less then burst, it would allowed by burst and we should return the ingestion rate limit instead of batch size error.

@electron0zero electron0zero force-pushed the limit_message_update branch from 171c74d to 5c64d11 Compare May 22, 2025 23:19
Copy link
Contributor

@zalegrala zalegrala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

@electron0zero electron0zero force-pushed the limit_message_update branch from 6658e07 to 85424d0 Compare May 23, 2025 16:35
@electron0zero electron0zero enabled auto-merge (squash) May 23, 2025 16:39
@electron0zero electron0zero merged commit bd38bfa into grafana:main May 23, 2025
19 checks passed
@electron0zero electron0zero deleted the limit_message_update branch May 27, 2025 14:39
knylander-grafana pushed a commit to knylander-grafana/tempo-doc-work that referenced this pull request Jun 2, 2025
…rafana#4986)

* Improve rate limit error message for traces size exceeds rate limit

* CHANGELOG.md

* Add more test cases into TestIngestionRateStrategy

* better error message for rate limits

* add test for TestIngestionRateStrategyAllowN

* update the code for batch size too big error

* fix unit for burst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants