-
Notifications
You must be signed in to change notification settings - Fork 585
Add support for topk and bottomk functions for TraceQL metrics #4646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b935091
to
8c6bcf5
Compare
bff2f87
to
ac57f89
Compare
eead50b
to
3c21a76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far. Can review again after the process functions are implemented. To make sure we have the same understanding: I believe the result of our discussion was that they need to work like Prometheus and operate on each timestamp interval independently. For example topk(10)
can return 10 different series at each interval. In addition to the design, was the performance concerns. This is so that we can push down some of the top/bottom filtering to the job/block level. Each job can return it's top 10, and the frontend then chooses the final top 10. If it is designed the other way, the 10 series with the highest values anywhere in the query, then nothing could be pushed down. All data would have to gathered up in the frontend.
|
2a1dfdf
to
7c857c2
Compare
pkg/traceql/engine_metrics.go
Outdated
@@ -1484,3 +1505,119 @@ func FloatizeAttribute(s Span, a Attribute) (float64, StaticType) { | |||
} | |||
return f, v.Type | |||
} | |||
|
|||
// processTopK implements TopKBottomK topk method | |||
func processTopK(input SeriesSet, limit int) SeriesSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly this is breaks avg_over_time
.
The way this function works is by producing two series per matching. One with the current average and another one with the count. That way we can calculate the incremental average in the next aggregation layers:
{\"span.foo\"=\"baz\"}
{__meta_type=\"__count\", \"span.foo\"=\"baz\"}
Therefore this is mixing count series with average ones, which is not correct.
Another concern is if we should apply topk at the first layer. We could drop series that we need in the next aggregation layer. Maybe this should be done just in the last aggregation one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it means that avg_over_time() is not shardable and must be computed at the query-frontend. Whatever follows an unshardable element is also unshardable. So it means that { } | avg_over_time(...) | topk(...)
, the topk must be done only at the query-frontend.
Some queries are shardable, for example min/max_over_time, the top/bottomk can also be pushed down to the earliest level.
Starting with always making the secondStage unshardable and computed only in the frontend sounds good. That is more straightforward and we can prioritize correctness first. Adding the shardable/unshardable aspect to the AST will be more involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had a chat with @javiermolinar about this and updated the code to handle this.
I am only handling topk and bottomk in frontend for now, and we can handle pushdown later when first stage is shardable.
result() SeriesSet | ||
} | ||
|
||
type getExemplar func(Span) (float64, uint64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the code here is moved from ast.go, to cleanup the ast.go file and extract out metrics related AST into it's own file.
|
||
var _ firstStageElement = (*MetricsAggregate)(nil) | ||
|
||
// secondStageElement represents operations that are performed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new code for second stage is from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating docs. :) I have two minor suggestions to correct a missing word and remove an extra space.
Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>
cf63269
to
a95deff
Compare
b4169be
to
6303d81
Compare
6303d81
to
abe64f3
Compare
@electron0zero @knylander-grafana I think the documentation didn't publish to next: Is there a way to retrigger? |
@alexbikfalvi next run that triggered on another PR took care of this so it's published now: https://grafana.com/docs/tempo/next/traceql/metrics-queries/functions/#topk-and-bottomk-functions so we don't need to re-trigger it now. tho we should look into this failure, cc @knylander-grafana @jdbaldry |
noticed this duplicate code while working on #4646, so sending this standalone PR to cleanup this code.
@electron0zero nice work on this. I checked out the demo video since you mentioned it in your presentation and enjoyed watching it :) My only question from that was why we were seeing more than 5 series for a 'topk(5)' query but your docs answered that question exactly when I checked them! |
What this PR does:
Add support for topk and bottomk functions on top of TraceQL Metrics.
you can now write TraceQL Queries like
{} | rate() by (span.client_ip) | topk(10)
or{} | rate() by (span.client_ip) | bottomk(10)
, and only get top or bottomk series from the underlaying TraceQL Metrics Queries.topk and bottomk behaves like Prometheus topk and bottomk functions.
checkout this demo video to see it in action:
topk_bottomk_demo_final.mov
Topk and Bottomk are implemented as first set of second stage functions (functions that operate on series generated by TraceQL Metrics). code in this PR lays the ground for second stage functions and opens the door for other second stage functions like min, max and avg on top of TraceQL Metrics.
Which issue(s) this PR fixes:
Fixes #4217
Fixes https://github.com/grafana/tempo-squad/issues/601
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]