-
Notifications
You must be signed in to change notification settings - Fork 658
Description
What is the problem you are trying to solve?
The default HA tracker failover timeout of 30 s can be inadequate for some tenants, leading to continuous failovers.
On the other hand, increasing the timeout would lead to longer periods in which we stop scraping metrics for a tenant in the event of actual unavailability from upstream instances, thus working against the whole purpose of HA tracker.
Which solution do you envision (roughly)?
@pracucci suggests computing the timeout for each tenant from its scrape interval:
Would be very nice if we wouldn't have to change it manually in the config, but we could auto-detect the tenant scrape interval (or highest value if customer has multiple scraping jobs with different intervals), so that we can apply the failover timeout equal to
max(configured timeout, highest scrape interval * delta).
Have you considered any alternatives?
This has been implemented, but specifying a custom timeout per tenant in config isn't a great approach.
Any additional context to share?
No response
How long do you think this would take to be developed?
Small (<= 1 month dev)
What are the documentation dependencies?
No response