WAL corruption leads to endless restarts

Describe the bug

loki-write pod is dying with this log:

msg="error running loki" err="corruption in segment /var/loki/tsdb-shipper-active/wal/s3_2024-01-02/1712203235/00000004 at 65536: last record is torn\nerror recovering from TSDB WAL"

and restarts indefinitely (crashlooping).

But at each restart it reads the WAL and updates object storage.
On a big WAL this can cost a lot because all the data is sent to the object storage again and again.

To Reproduce

Steps to reproduce the behavior:

Running Loki 2.9.6
It happens once in a while on clusters that are a bit undersized and where pods tend to die OOM.
This does not happen consistently.

Expected behavior
I can understand the WAL can get corrupted when the app unexpectedly crashes.
But maybe when the WAL is corrupted it should discard it? So after it crashes once it can start properly, and not retry endlessly?

Environment:

Infrastructure: Kubernetes
Deployment tool: helm 5.47.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WAL corruption leads to endless restarts #12583

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WAL corruption leads to endless restarts #12583

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions