Skip to content

fix: backport wal corruption fix to 2.9.x #18229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Segflow
Copy link
Contributor

@Segflow Segflow commented Jun 25, 2025

Backport 1954f67 from #18175


What this PR does / why we need it:

Currently a WAL corruption lead to endless restart loops, as reported in issue #12583. Users experience crashes with errors like:

corruption in segment /var/loki/tsdb-shipper-active/wal/s3_2024-01-02/1712203235/00000004 at 65536: last record is torn
error recovering from TSDB WAL

and

"error running loki" err="corruption in segment /data/loki/index/wal/filesystem_2023-05-01/1727455534/00000000 at 81944: unexpected checksum 2ffb91ba, expected 5999e2d7
error recovering from TSDB WAL

This causes loki to crashloop indefinitely, repeatedly reading the WAL and updating object storage, which can be costly for large WALs.

This PR adds the ability to repair the WAL on startup (if possible). It also adds unit tests which corrupt the WAL and try to recover it after.

Notice that we only recover from the corruption at the TSDB WAL level. Any corruption in the data part (invalid chunks or invalid series) is unrecoverable and we just log the error and crash.

It also includes some documentation about the WAL format used for future reference.

The metric wal_corruptions_repairs_total is added to track whenever we fail or succeed to repair the corrupted WAL.

Which issue(s) this PR fixes:
Fixes #12583

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR
@Segflow Segflow changed the base branch from main to release-2.9.x June 25, 2025 11:49
@Segflow Segflow added backport type/bug Somehing is not working as expected product-approved labels Jun 25, 2025
@Segflow Segflow marked this pull request as ready for review June 25, 2025 11:52
@Segflow Segflow requested a review from a team as a code owner June 25, 2025 11:52
@Segflow Segflow enabled auto-merge (squash) June 25, 2025 12:55
@Segflow Segflow removed the backport label Jun 25, 2025
@Segflow Segflow disabled auto-merge June 25, 2025 13:20
@Segflow Segflow merged commit 77fc888 into release-2.9.x Jun 25, 2025
47 checks passed
@Segflow Segflow deleted the meher/release-2.9.x/backport_wal_corruption_fix branch June 25, 2025 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product-approved size/L type/bug Somehing is not working as expected
2 participants