-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Labels
type/bugSomehing is not working as expectedSomehing is not working as expected
Description
Describe the bug
loki-write pod is dying with this log:
msg="error running loki" err="corruption in segment /var/loki/tsdb-shipper-active/wal/s3_2024-01-02/1712203235/00000004 at 65536: last record is torn\nerror recovering from TSDB WAL"
and restarts indefinitely (crashlooping).
But at each restart it reads the WAL and updates object storage.
On a big WAL this can cost a lot because all the data is sent to the object storage again and again.
To Reproduce
Steps to reproduce the behavior:
- Running Loki 2.9.6
- It happens once in a while on clusters that are a bit undersized and where pods tend to die OOM.
- This does not happen consistently.
Expected behavior
I can understand the WAL can get corrupted when the app unexpectedly crashes.
But maybe when the WAL is corrupted it should discard it? So after it crashes once it can start properly, and not retry endlessly?
Environment:
- Infrastructure: Kubernetes
- Deployment tool: helm 5.47.2
akorp, ramsateesh, DanielCastronovo, sondrelg, allen-pattern and 42 moretimo1707 and mlladb
Metadata
Metadata
Assignees
Labels
type/bugSomehing is not working as expectedSomehing is not working as expected