Skip to content

feat(ruler): add per-tenant configuration to disable WAL replay #16717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 19, 2025

Conversation

trevorwhitney
Copy link
Collaborator

What this PR does / why we need it:

This change adds the ability to disable WAL replay for specific tenants through ruler_enable_wal_replay tenant override. Disabling WAL replay helps prevent OOM crashes for tenants with large WALs, as it skips loading all series into memory during startup. We are currently only replaying the WAL so we can reuse the series ID, we are not actually replaying any metrics into the remote write target. Having the WAL is still beneficial to handle periods when the remote write server cannot be reached, but the benefit we're gaining from replaying the WAL is minimal considering it frequently causes the rulers to OOMs.

  • Added ruler_enable_wal_replay flag to Limits and Overrides
  • Modified WAL storage to accept enableReplay parameter
  • Updated ruler registry to use tenant-specific setting when creating instances
  • Added tests to verify WAL replay can be disabled

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR
This change adds the ability to disable WAL replay for specific tenants
through ruler_enable_wal_replay tenant override. Disabling WAL replay
helps prevent OOM crashes for tenants with large WALs, as it skips
loading all series into memory during startup.

- Added ruler_enable_wal_replay flag to Limits and Overrides
- Modified WAL storage to accept enableReplay parameter
- Updated ruler registry to use tenant-specific setting when creating instances
- Added tests to verify WAL replay can be disabled
@trevorwhitney trevorwhitney requested a review from a team as a code owner March 12, 2025 22:00
@trevorwhitney trevorwhitney changed the title feat: add per-tenant configuration to disable WAL replay Mar 12, 2025
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Mar 12, 2025
Copy link
Contributor

github-actions bot commented Mar 12, 2025

💻 Deploy preview deleted.

@trevorwhitney trevorwhitney merged commit eda3ba8 into main Mar 19, 2025
62 checks passed
@trevorwhitney trevorwhitney deleted the per-tenant-disabling-of-ruler-wal-replay branch March 19, 2025 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
2 participants