Skip to content

Add a counter to track chunks which were de-duplicated at store time. #2485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 22, 2020

Conversation

slim-bean
Copy link
Contributor

@slim-bean slim-bean commented Apr 19, 2020

Signed-off-by: Ed Welch edward.welch@grafana.com

What this PR does:

At the moment it's a bit tricky to understand how effectively chunks are being de-duplicated at the store with the cache based de-dupe code.

This PR adds a counter which explicitly increments when a chunk was not stored because it was already in the cache.

This can be combined with cortex_ingester_flush_reasons to get a ratio of chunks flushed vs deduped to better tell how well the de-duplication is working.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
@pracucci
Copy link
Contributor

pracucci commented Apr 20, 2020

I think we could assume this fixes #2353. /cc @bboreham

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you mention it as ENHANCEMENT in the CHANGELOG please?

@@ -61,6 +61,11 @@ var (
// For 100k series for 7 week, could be 1.2m - 10*(8^(7-1)) = 2.6m.
Buckets: prometheus.ExponentialBuckets(10, 8, 7),
})
dedupedChunksTotal = promauto.NewCounter(prometheus.CounterOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correlate with cortex_chunk_store_stored_chunks_total which is per user. @gouthamve are we OK having this metric non per-user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a case where I would want to query this metric with a user dimension, that is, I'm not sure it's ever meaningful to know if one user has more de-duped chunks than another.

However if you were already querying cortex_chunk_store_stored_chunks_total with a user dimension and wanted to compare accurately with this metric then it would also need a user dimension.

My general feeling is there isn't a use case for checking de-duplication with a user and you would query both cortex_chunk_store_stored_chunks_total and this metric without a user label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need it per-user if you're trying to charge users based on how much of the store they used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a count metric useful for this? Is there a good way to convert number of chunks flushed to the size of those chunks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all a little fuzzy since we don't include metadata in cortex_ingester_chunk_stored_bytes_total and don't have any user-specific counters for labels.

I'll modify #2353 to mention the bytes metric too.

@bboreham
Copy link
Contributor

Is this different to the dedupe cache hit rate?

@slim-bean
Copy link
Contributor Author

Is this different to the dedupe cache hit rate?

If I followed all this correctly, the cache used for checking de-dupes is the chunks cache which is also used when FetchChunks is called, so the hit rate on this cache is affected by queries (if you have a lot of queries for really old data they would drive up the miss rate on this cache)

The index writeDedupeCache is only checked after the chunks cache is checked, so if the chunks cache returned a result, the writeDedupeCache is never checked and therefore the stats for it won't reflect successful de-dupes

@bboreham
Copy link
Contributor

OK, is this different to the dedupe cache hit rate exposed by ingesters?
(ingesters never call FetchChunks)

@slim-bean
Copy link
Contributor Author

OK, is this different to the dedupe cache hit rate exposed by ingesters?
(ingesters never call FetchChunks)

Interesting I hadn't though about the runtime separation.

The problem here would be anyone running a singlebinary, which is possible to do with caching and distributed stores like dynamo/s3. I know this probably wouldn't be advised but we see this a lot with Loki Helm users.

Also a recent change in Loki will allow ingesters to query the store directly (part of how the persisting BoltDB to the object store implementation works), which would result in ingesters calling FetchChunks and impacting using cache stats for this purpose.

For these reasons I still see value in a separate metric for this.

Signed-off-by: Ed Welch <edward.welch@grafana.com>
Signed-off-by: Ed Welch <edward.welch@grafana.com>
@slim-bean slim-bean force-pushed the dedupe-chunk-counter branch from 81b0081 to 32a2bfb Compare April 21, 2020 11:57
Signed-off-by: Ed Welch <edward.welch@grafana.com>
Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good addition, the post-deduped metrics required in #2353 are already implemented imo.

This is to check if the deduping is effective or not.

@pracucci pracucci merged commit 681e16d into cortexproject:master Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 participants