Skip to content

WAL corruption leads to endless restarts #12583

@hervenicol

Description

@hervenicol

Describe the bug

loki-write pod is dying with this log:

msg="error running loki" err="corruption in segment /var/loki/tsdb-shipper-active/wal/s3_2024-01-02/1712203235/00000004 at 65536: last record is torn\nerror recovering from TSDB WAL"

and restarts indefinitely (crashlooping).

But at each restart it reads the WAL and updates object storage.
On a big WAL this can cost a lot because all the data is sent to the object storage again and again.

To Reproduce

Steps to reproduce the behavior:

  1. Running Loki 2.9.6
  2. It happens once in a while on clusters that are a bit undersized and where pods tend to die OOM.
  3. This does not happen consistently.

Expected behavior
I can understand the WAL can get corrupted when the app unexpectedly crashes.
But maybe when the WAL is corrupted it should discard it? So after it crashes once it can start properly, and not retry endlessly?

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm 5.47.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomehing is not working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions