-
Notifications
You must be signed in to change notification settings - Fork 823
Fix tenant federation performance hotspot #4502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tenant federation performance hotspot #4502
Conversation
c3e4c2c
to
5bc990c
Compare
Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>
Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>
Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>
5bc990c
to
be84c7b
Compare
cc @simonswine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch. LGTM
bump - can someone with merge privileges help merge this? |
Yes. I need a 2nd review. @bboreham or @pstibrany could you review this, please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
* merge queryable: cache curr series Signed-off-by: Eugene Ma <eugene.ma@airbnb.com> * add test to check label set Signed-off-by: Eugene Ma <eugene.ma@airbnb.com> * update CHANGELOG Signed-off-by: Eugene Ma <eugene.ma@airbnb.com> Co-authored-by: Eugene Ma <eugene.ma@airbnb.com> Signed-off-by: Manish Kumar Gupta <manishkg@microsoft.com>
* merge queryable: cache curr series Signed-off-by: Eugene Ma <eugene.ma@airbnb.com> * add test to check label set Signed-off-by: Eugene Ma <eugene.ma@airbnb.com> * update CHANGELOG Signed-off-by: Eugene Ma <eugene.ma@airbnb.com> Co-authored-by: Eugene Ma <eugene.ma@airbnb.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
What this PR does:
Profiling federated queries shows a hotspot in
MergeQueryable
that makes federated queries over many tenants very slow.Performance measurements using synthetic read/write traffic with benchtool:
{__name__=~".+"}
federated over all 100 tenants, last 15 minutesThe test was repeated with even more tenants (1000) and the performance gap was even larger (10s vs. 50s P99). With this patch applied, a federated query over 100 tenants patch performs similarly to a non-federated query (against the same number of queried time series and time range), eliminating almost all latency overhead of federated querying.
Profile graph taken on Querier while executing query without the patch (attached pprof file):

The root cause lies in the implementation of
addLabelsSeries.Labels
which rebuilds the entire label set for a series, followed by a sort. It does this in order to insert an extra__tenant_id__
label and rewrite any existing label with that name that would conflict.This isn't a big deal if called once per time series, but it's actually called many times for merging series in a sorted fashion. The merge algorithm uses a heap which performs a lot of comparisons (which are based on labels) when push/popping nodes. Therefore, we need to make comparisons cheap, which requires
.Labels()
calls to be cheap.This PR attempts to fix this hot spot by memoizing the label set once per call to the iterator's
At()
function, which makes theLabels
call essentially free.Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]