Skip to content

Fix tenant federation performance hotspot #4502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 14, 2021

Conversation

edma2
Copy link
Contributor

@edma2 edma2 commented Sep 29, 2021

What this PR does:
Profiling federated queries shows a hotspot inMergeQueryable that makes federated queries over many tenants very slow.

Performance measurements using synthetic read/write traffic with benchtool:

  • Test parameters
    • tenants: 100
    • cardinality per tenant: 1000
    • query: {__name__=~".+"} federated over all 100 tenants, last 15 minutes
  • (Baseline) P99 query latency against a single tenant (with 100K cardinality): ~10s
  • P99 federated query latency without the fix: ~25s
  • P99 federated query latency with the fix: ~10s

The test was repeated with even more tenants (1000) and the performance gap was even larger (10s vs. 50s P99). With this patch applied, a federated query over 100 tenants patch performs similarly to a non-federated query (against the same number of queried time series and time range), eliminating almost all latency overhead of federated querying.

Profile graph taken on Querier while executing query without the patch (attached pprof file):
Screen Shot 2021-09-29 at 4 13 37 PM

The root cause lies in the implementation of addLabelsSeries.Labels which rebuilds the entire label set for a series, followed by a sort. It does this in order to insert an extra __tenant_id__ label and rewrite any existing label with that name that would conflict.

This isn't a big deal if called once per time series, but it's actually called many times for merging series in a sorted fashion. The merge algorithm uses a heap which performs a lot of comparisons (which are based on labels) when push/popping nodes. Therefore, we need to make comparisons cheap, which requires .Labels() calls to be cheap.

This PR attempts to fix this hot spot by memoizing the label set once per call to the iterator's At() function, which makes the Labels call essentially free.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
@edma2 edma2 force-pushed the fix-merge-queryable-hotspot branch from c3e4c2c to 5bc990c Compare September 29, 2021 23:47
Eugene Ma added 3 commits September 29, 2021 16:48
Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>
Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>
Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>
@edma2 edma2 force-pushed the fix-merge-queryable-hotspot branch from 5bc990c to be84c7b Compare September 29, 2021 23:48
@edma2
Copy link
Contributor Author

edma2 commented Oct 1, 2021

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pracucci pracucci requested a review from pstibrany October 6, 2021 15:13
Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch. LGTM

@edma2
Copy link
Contributor Author

edma2 commented Oct 13, 2021

bump - can someone with merge privileges help merge this?

@pracucci
Copy link
Contributor

Yes. I need a 2nd review. @bboreham or @pstibrany could you review this, please?

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@pstibrany pstibrany merged commit 85c3781 into cortexproject:master Oct 14, 2021
srijan55 pushed a commit to srijan55/cortex that referenced this pull request Nov 26, 2021
* merge queryable: cache curr series

Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>

* add test to check label set

Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>

* update CHANGELOG

Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>

Co-authored-by: Eugene Ma <eugene.ma@airbnb.com>
Signed-off-by: Manish Kumar Gupta <manishkg@microsoft.com>
alvinlin123 pushed a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
* merge queryable: cache curr series

Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>

* add test to check label set

Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>

* update CHANGELOG

Signed-off-by: Eugene Ma <eugene.ma@airbnb.com>

Co-authored-by: Eugene Ma <eugene.ma@airbnb.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 participants