Skip to content

Add support for topk and bottomk functions for TraceQL metrics #4646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Apr 10, 2025

Conversation

electron0zero
Copy link
Member

@electron0zero electron0zero commented Jan 31, 2025

What this PR does:

Add support for topk and bottomk functions on top of TraceQL Metrics.

you can now write TraceQL Queries like {} | rate() by (span.client_ip) | topk(10) or {} | rate() by (span.client_ip) | bottomk(10), and only get top or bottomk series from the underlaying TraceQL Metrics Queries.

topk and bottomk behaves like Prometheus topk and bottomk functions.

checkout this demo video to see it in action:

topk_bottomk_demo_final.mov

Topk and Bottomk are implemented as first set of second stage functions (functions that operate on series generated by TraceQL Metrics). code in this PR lays the ground for second stage functions and opens the door for other second stage functions like min, max and avg on top of TraceQL Metrics.

Which issue(s) this PR fixes:
Fixes #4217
Fixes https://github.com/grafana/tempo-squad/issues/601

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
@electron0zero electron0zero changed the title try 1 to get the grammer working Jan 31, 2025
@electron0zero electron0zero force-pushed the topk_bottomk branch 5 times, most recently from bff2f87 to ac57f89 Compare February 26, 2025 10:25
@electron0zero electron0zero force-pushed the topk_bottomk branch 2 times, most recently from eead50b to 3c21a76 Compare February 26, 2025 20:25
Copy link
Contributor

@mdisibio mdisibio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far. Can review again after the process functions are implemented. To make sure we have the same understanding: I believe the result of our discussion was that they need to work like Prometheus and operate on each timestamp interval independently. For example topk(10) can return 10 different series at each interval. In addition to the design, was the performance concerns. This is so that we can push down some of the top/bottom filtering to the job/block level. Each job can return it's top 10, and the frontend then chooses the final top 10. If it is designed the other way, the 10 series with the highest values anywhere in the query, then nothing could be pushed down. All data would have to gathered up in the frontend.

@knylander-grafana
Copy link
Contributor

knylander-grafana commented Mar 13, 2025

Should we add doc for this? I added docs for this

@electron0zero electron0zero force-pushed the topk_bottomk branch 3 times, most recently from 2a1dfdf to 7c857c2 Compare March 17, 2025 17:19
@electron0zero electron0zero marked this pull request as ready for review March 17, 2025 20:35
@electron0zero electron0zero requested a review from mdisibio March 17, 2025 20:35
@@ -1484,3 +1505,119 @@ func FloatizeAttribute(s Span, a Attribute) (float64, StaticType) {
}
return f, v.Type
}

// processTopK implements TopKBottomK topk method
func processTopK(input SeriesSet, limit int) SeriesSet {
Copy link
Contributor

@javiermolinar javiermolinar Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this is breaks avg_over_time.

The way this function works is by producing two series per matching. One with the current average and another one with the count. That way we can calculate the incremental average in the next aggregation layers:

{\"span.foo\"=\"baz\"} 
{__meta_type=\"__count\", \"span.foo\"=\"baz\"}

Therefore this is mixing count series with average ones, which is not correct.

Another concern is if we should apply topk at the first layer. We could drop series that we need in the next aggregation layer. Maybe this should be done just in the last aggregation one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it means that avg_over_time() is not shardable and must be computed at the query-frontend. Whatever follows an unshardable element is also unshardable. So it means that { } | avg_over_time(...) | topk(...), the topk must be done only at the query-frontend.

Some queries are shardable, for example min/max_over_time, the top/bottomk can also be pushed down to the earliest level.

Starting with always making the secondStage unshardable and computed only in the frontend sounds good. That is more straightforward and we can prioritize correctness first. Adding the shardable/unshardable aspect to the AST will be more involved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a chat with @javiermolinar about this and updated the code to handle this.

I am only handling topk and bottomk in frontend for now, and we can handle pushdown later when first stage is shardable.

@knylander-grafana knylander-grafana added the type/docs Improvements or additions to documentation label Mar 20, 2025
result() SeriesSet
}

type getExemplar func(Span) (float64, uint64)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the code here is moved from ast.go, to cleanup the ast.go file and extract out metrics related AST into it's own file.


var _ firstStageElement = (*MetricsAggregate)(nil)

// secondStageElement represents operations that are performed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new code for second stage is from here.

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating docs. :) I have two minor suggestions to correct a missing word and remove an extra space.

@electron0zero electron0zero merged commit ef2cee7 into grafana:main Apr 10, 2025
15 checks passed
@electron0zero electron0zero deleted the topk_bottomk branch April 10, 2025 17:39
@alexbikfalvi
Copy link
Contributor

@electron0zero @knylander-grafana I think the documentation didn't publish to next:
https://github.com/grafana/tempo/actions/runs/14386797134/job/40344098510

Is there a way to retrigger?

@electron0zero
Copy link
Member Author

@alexbikfalvi next run that triggered on another PR took care of this so it's published now: https://grafana.com/docs/tempo/next/traceql/metrics-queries/functions/#topk-and-bottomk-functions

so we don't need to re-trigger it now.

tho we should look into this failure, cc @knylander-grafana @jdbaldry

electron0zero added a commit that referenced this pull request Apr 11, 2025
noticed this duplicate code while working on #4646, so sending this standalone PR to cleanup this code.
@09jvilla
Copy link
Contributor

09jvilla commented May 5, 2025

@electron0zero nice work on this. I checked out the demo video since you mentioned it in your presentation and enjoyed watching it :) My only question from that was why we were seeing more than 5 series for a 'topk(5)' query but your docs answered that question exactly when I checked them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Improvements or additions to documentation
6 participants