-
Notifications
You must be signed in to change notification settings - Fork 823
runtimeconfig: initialize overridesReloadSuccess gauge in success state #2092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtimeconfig: initialize overridesReloadSuccess gauge in success state #2092
Conversation
d7a4ed9
to
66bd5cd
Compare
|
Would appreciate some initial feedback. Thank you! :) |
Thanks for your contribution.
Does that match what you've observed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I accidentally run into this issue today. However, I'm not sure this is the right way to fix it. We should not export the metric at all when not used. My suggestion is to use the pattern we've recently started using:
- No
overridesReloadSuccess
global variable, but move it to theManager
struct - Pick
registerer prometheus.Registerer
in input toNewRuntimeConfigManager()
as second parameter and register theoverridesReloadSuccess
metric on that - In
modules.go
theinitRuntimeConfig()
callsNewRuntimeConfigManager(cfg.RuntimeConfig, prometheus.DefaultRegisterer)
Yes, I think so!
Agreed, that's a cleaner solution. |
c1f5fcf
to
e72efe6
Compare
e72efe6
to
63ebdaa
Compare
There seem to be code paths where the overridesReloadSuccess gauge gets initialized, but `NewRuntimeConfigManager()` never is called (absence of both log lines: "failed to load config", "runtime config disabled: file not specified"). promauto.NewGauge() seems to initialize the gauge with the value 0 (that seems to be common in Prometheus client libs). In the current code, the value 0 is associated with the error state, and the value 1 is associated with the success state. Likewise, https://github.com/grafana/cortex-jsonnet/ defines an alert that fires when this gauge is set to 0 for an extended period of time. That is, when `NewRuntimeConfigManager()` never gets called the gauge indicates a false negative error state. Starting in the success state is a pragmatic way to solve that problem. It might be better to use a counter for individual error or -- if a gauge is preferred -- one could introduce a third state referring to "initialized". Signed-off-by: Dr. Jan-Philip Gehrcke <jp@opstrace.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jp@opstrace.com>
63ebdaa
to
314bd12
Compare
…ically Only register the gauge when NewRuntimeConfigManager() is called (when a runtime config manager is actually being used). Still, initialize it explicitly in state 1 (success) instead of implicitly initializing it in state (error). manager_test.go: subsequently the registry should be used for testing the state of the metric in question. Signed-off-by: Dr. Jan-Philip Gehrcke <jp@opstrace.com>
Adjust variable name to actual behavior: The gauge is set to 0 when loading the config from the file failed. The gauge is set to 1 when loading the config from the file succeeded. Signed-off-by: Dr. Jan-Philip Gehrcke <jp@opstrace.com>
320ff5f
to
ad4721f
Compare
The commit starting with the message I have added more commits on top of that proposing variable renames in code, and after all also a rename of the gauge itself. If you can, please have a look at the commit messages. I am not sure how this affects our thinking w.r.t. backwards compatibility, but I went ahead and worded a corresponding changelog entry. If desired we can drop the last few commits again, or treat that independently in a separate discussion/PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this!
Problem with this rename is that many people (including us at Grafana Labs) have alerts based on existing name and will need to change them. That would be fair if new new name provided improved functionality, but I’m not sure it does.
Personally, I would suggest keeping previous logic and name. With your change (only registering metric when needed), if component doesn’t use runtime config manager, metric will not be exported at all. If it is, it starts with 0 (= no load of the config), and when config is successfully loaded or reloaded, it is set to 1. I think that is a good logic. Using inversion (“load failed”) is more confusing to me.
@@ -204,8 +209,8 @@ func TestOverridesManager_StopClosesListenerChannels(t *testing.T) { | |||
return val, nil | |||
}, | |||
} | |||
|
|||
overridesManager, err := NewRuntimeConfigManager(overridesManagerConfig) | |||
registry := prometheus.NewRegistry() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically in tests we just pass nil (unless testing metrics), and make sure that New function can accept nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done, commit pushed
Signed-off-by: Dr. Jan-Philip Gehrcke <jp@opstrace.com>
ad4721f
to
9715c4c
Compare
Signed-off-by: Dr. Jan-Philip Gehrcke <jp@opstrace.com>
Thanks for the quick feedback @pstibrany
Done. I have removed the more controversial commits!
Well, it's totally ok and I respect the decision :). My thinking, though, was that About the gauge itself and its meaning: I think in the long run we probably agree in the point of view that an error counter would be the more intuitive, simple, and complete solution. With the gauge we actually don't keep track of when exactly (and, with that, how frequently) the config load operation fails if it fails. Again, if this gauge is at 0 we actually can't even know for sure that a config load was attempted (with the current patch it's just way more likely :)). Thanks again, looking forward to reading your feedback about the current patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
pkg/util/runtimeconfig/manager.go
Outdated
mgr := Manager{ | ||
cfg: cfg, | ||
quit: make(chan struct{}), | ||
configLoadSuccess: prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Name: "cortex_overrides_last_reload_successful", | ||
Help: "Whether the last config reload attempt was successful (0: error, 1: successful).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(0: error, 1: successful)
can be removed, this is common enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, commit pushed
pkg/util/runtimeconfig/manager.go
Outdated
} | ||
|
||
// NewRuntimeConfigManager creates an instance of Manager and starts reload config loop based on config | ||
func NewRuntimeConfigManager(cfg ManagerConfig) (*Manager, error) { | ||
func NewRuntimeConfigManager(cfg ManagerConfig, metricsRegisterer prometheus.Registerer) (*Manager, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metricsRegisterer
> registerer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, commit pushed
Signed-off-by: Dr. Jan-Philip Gehrcke <jp@opstrace.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
From the commit message:
(work in progress, let's see what CI says)
What this PR does:
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]