Skip to content

Idea: Compute HA failover timeout from tenant's scrape rate #12108

@tcard

Description

@tcard

What is the problem you are trying to solve?

The default HA tracker failover timeout of 30 s can be inadequate for some tenants, leading to continuous failovers.

On the other hand, increasing the timeout would lead to longer periods in which we stop scraping metrics for a tenant in the event of actual unavailability from upstream instances, thus working against the whole purpose of HA tracker.

Which solution do you envision (roughly)?

@pracucci suggests computing the timeout for each tenant from its scrape interval:

Would be very nice if we wouldn't have to change it manually in the config, but we could auto-detect the tenant scrape interval (or highest value if customer has multiple scraping jobs with different intervals), so that we can apply the failover timeout equal to max(configured timeout, highest scrape interval * delta).

Have you considered any alternatives?

This has been implemented, but specifying a custom timeout per tenant in config isn't a great approach.

Any additional context to share?

No response

How long do you think this would take to be developed?

Small (<= 1 month dev)

What are the documentation dependencies?

No response

Proposer?

@pracucci

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions