Skip to content

add per-tenant alertmanager metrics #2124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 2, 2020
Merged

add per-tenant alertmanager metrics #2124

merged 7 commits into from
Mar 2, 2020

Conversation

jtlisi
Copy link
Contributor

@jtlisi jtlisi commented Feb 12, 2020

Moved from #2116 because integration tests don't work for Grafana org repo PRs due to NOQUAY being set as an environment variable.

What this PR does:

This PR takes advantage of the util.MetricFamiliesPerUser struct to provide per-tenant Alertmanager metrics.

Which issue(s) this PR fixes:
Fixes #1631

Checklist

  • Tests updated
  • CHANGELOG.md updated
@jtlisi jtlisi mentioned this pull request Feb 12, 2020
2 tasks
@jtlisi
Copy link
Contributor Author

jtlisi commented Feb 13, 2020

@pracucci I paired down the number of user metrics considerably. The ones that remain convey either essential basic information about the number of alerts/silences or help users avoid silent failures.

cortex_alertmanager_alerts -- Basic information that gives a basic understanding of alertmanager usage on a per-user basis
cortex_alertmanager_silences -- Same as above

cortex_alertmanager_notifications_total & cortex_alertmanager_notifications_failed_total -- These metrics are vital for detecting a silent failure in a users config. A user could have a valid Alertmanager config with a misconfigured integration. This could easily lead to a silent failure where a user is not receiving alerts. Having this metric on a per-user basis makes it easy to configure redundant alerts in either a separate Prometheus/Alertmanager system or using something like Prometheus/(Grafana Alerts).

I also made sure to remove the unused functions I added to metrics_helper.go, as well as add unit tests for the function I did add.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments around metrics code.

Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nit. I haven't taken a close look at the tests as I think Marco and Peter took a close look there.

@jtlisi Could you also resolve conversations that have been already implemented?

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtlisi Thanks for addressing my feedback. I'm a bit concerned about the "unpause" logic and wondering if there's any issue there. Please take a look at comments.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor nits.
(Btw, flaky test is now fixed on master)

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving cause my previous concern about the "unpause" is not an issue. However I left few comments which I would be glad if you could address before merging. Thanks @jtlisi !

return nil
}

func (am *Alertmanager) isActive() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't exposed, while Pause() is? I think it should be specular to Pause() and being exposed too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

userAM.Stop()
delete(am.alertmanagers, user)
// The user alertmanager is only paused in order to retain the prometheus metrics
// it has reported to it's registry. If a new config for this user appears, this structure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's > its

jtlisi added 7 commits March 2, 2020 09:57
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
@jtlisi jtlisi merged commit 3c6875d into cortexproject:master Mar 2, 2020
@jtlisi jtlisi deleted the 20200211_pertenant_am_metrics branch March 2, 2020 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants