Ingester fails initializing when WAL PVC is out of space

We run Loki in simple scalable mode on Kubernetes with WAL enabled writing to PVCs. When a PVC becomes full it seems the corresponding loki-write instance bricks itself due to that the ingester no longer can initialize due to insufficient space on the PVC and hence cannot start to replay and flush the WAL to long term storage and start freeing up storage space on the PVC.

Startup logs from loki-write:

2024-12-03 14:48:43.807	
level=error ts=2024-12-03T13:48:43.80708689Z caller=log.go:216 msg="error running loki" err="open /var/loki/wal/00039919: no space left on device
error initialising module: ingester
github.com/grafana/dskit/modules.(*Manager).initModule
  /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
  /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
  /src/loki/pkg/loki/loki.go:458
main.main
 /src/loki/cmd/loki/main.go:129
runtime.main
  /usr/local/go/src/runtime/proc.go:271
runtime.goexit
  /usr/local/go/src/runtime/asm_amd64.s:1695"
2024-12-03 14:48:43.807	
level=info ts=2024-12-03T13:48:43.800163536Z caller=head_manager.go:308 index-store=tsdb-2024-09-17 component=tsdb-head-manager msg="loaded wals by period" groups=1
2024-12-03 14:48:43.807	
level=info ts=2024-12-03T13:48:43.800103805Z caller=manager.go:86 index-store=tsdb-2024-09-17 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=49 indices=30 failures=0
2024-12-03 14:48:43.807	
level=info ts=2024-12-03T13:48:43.796691475Z caller=head_manager.go:308 index-store=tsdb-2024-09-17 component=tsdb-head-manager msg="loaded wals by period" groups=0
2024-12-03 14:48:43.807	
level=info ts=2024-12-03T13:48:43.796371461Z caller=table_manager.go:136 index-store=tsdb-2024-09-17 msg="uploading tables"
2024-12-03 14:48:43.807	
level=info ts=2024-12-03T13:48:43.79630754Z caller=shipper.go:160 index-store=tsdb-2024-09-17 msg="starting index shipper in WO mode"

We had multiple problems leading to this particular issue, but I think the root problem started with an incorrectly configured ingester.wal.replay_memory_ceiling. It was set to default (4GB according to the docs) while the memory resource limit for loki-write was configured to 2GB. During a deployment log ingestion happen to be surging and when one of the loki-write instances was initializing it got OOM-killed during WAL replay, however ingestion continued on the same instance before this happened letting the WAL grow even further and eventually it ran out of storage space of the PVC. At this point we realized the misconfiguration of replay_memory_ceiling, adjusted it to 1.5GB, and redeployed. This is where the above log record starting to appear, causing the container to crash.

We ended up deleting the PVC to get the instance up and running again.

Config:

auth_enabled: false
chunk_store_config:
  chunk_cache_config:
    background:
      writeback_buffer: 500000
      writeback_goroutines: 1
      writeback_size_limit: 500MB
    default_validity: 0s
    memcached:
      batch_size: 4
      parallelism: 5
    memcached_client:
      addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.loki.svc
      consistent_hash: true
      max_idle_conns: 72
      timeout: 2000ms
common:
  compactor_address: 'http://loki-backend:3100'
  path_prefix: /var/loki
  replication_factor: 3
  storage:
    azure:
      account_name: <redacted>
      container_name: chunks
      use_federated_token: true
      use_managed_identity: false
compactor:
  compaction_interval: 10m
  delete_request_store: azure
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  retention_enabled: true
frontend:
  scheduler_address: ""
  tail_proxy_url: ""
frontend_worker:
  scheduler_address: ""
index_gateway:
  mode: simple
ingester:
  chunk_encoding: snappy
limits_config:
  allow_structured_metadata: true
  ingestion_rate_strategy: local
  max_cache_freshness_per_query: 10m
  max_chunks_per_query: 300000
  otlp_config:
    resource_attributes:
      attributes_config:
      - action: index_label
        attributes:
        - namespace
        - container
        - pod
        - cluster
        - k8s.cronjob.name
        - k8s.job.name
        - k8s.daemonset.name
        - k8s.statefulset.name
        - k8s.deployment.name
        - k8s.replicaset.name
      ignore_defaults: true
  query_timeout: 300s
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 30d
  split_queries_by_interval: 15m
  volume_enabled: true
memberlist:
  join_members:
  - loki-memberlist
pattern_ingester:
  enabled: true
querier:
  max_concurrent: 4
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      background:
        writeback_buffer: 500000
        writeback_goroutines: 1
        writeback_size_limit: 500MB
      default_validity: 12h
      memcached_client:
        addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.loki.svc
        consistent_hash: true
        timeout: 500ms
        update_interval: 1m
ruler:
  storage:
    azure:
      account_name: <redacted>
      container_name: ruler
      use_federated_token: true
      use_managed_identity: false
    type: azure
runtime_config:
  file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
  configs:
  - from: "2024-09-17"
    index:
      period: 24h
      prefix: loki_index_
    object_store: azure
    schema: v13
    store: tsdb
server:
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 8388608
  grpc_server_max_send_msg_size: 8388608
  http_listen_port: 3100
  http_server_read_timeout: 600s
  http_server_write_timeout: 600s
  log_level: info
storage_config:
  boltdb_shipper:
    index_gateway_client:
      server_address: dns+loki-backend-headless.loki.svc.cluster.local:9095
  hedging:
    at: 250ms
    max_per_second: 20
    up_to: 3
  tsdb_shipper:
    index_gateway_client:
      server_address: dns+loki-backend-headless.loki.svc.cluster.local:9095
tracing:
  enabled: true

Environment:

Infrastructure: AKS / Kubernetes v1.29.4
Deployment tool: Helm, ArgoCD
loki-write: docker.io/grafana/loki:3.1.1
Loki helm chart: 6.12.0 (https://github.com/grafana/loki/tree/main/production/helm/loki)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ingester fails initializing when WAL PVC is out of space #15235

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ingester fails initializing when WAL PVC is out of space #15235

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions