Description
We run Loki in simple scalable mode on Kubernetes with WAL enabled writing to PVCs. When a PVC becomes full it seems the corresponding loki-write instance bricks itself due to that the ingester no longer can initialize due to insufficient space on the PVC and hence cannot start to replay and flush the WAL to long term storage and start freeing up storage space on the PVC.
Startup logs from loki-write:
2024-12-03 14:48:43.807
level=error ts=2024-12-03T13:48:43.80708689Z caller=log.go:216 msg="error running loki" err="open /var/loki/wal/00039919: no space left on device
error initialising module: ingester
github.com/grafana/dskit/modules.(*Manager).initModule
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
/src/loki/pkg/loki/loki.go:458
main.main
/src/loki/cmd/loki/main.go:129
runtime.main
/usr/local/go/src/runtime/proc.go:271
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1695"
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.800163536Z caller=head_manager.go:308 index-store=tsdb-2024-09-17 component=tsdb-head-manager msg="loaded wals by period" groups=1
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.800103805Z caller=manager.go:86 index-store=tsdb-2024-09-17 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=49 indices=30 failures=0
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.796691475Z caller=head_manager.go:308 index-store=tsdb-2024-09-17 component=tsdb-head-manager msg="loaded wals by period" groups=0
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.796371461Z caller=table_manager.go:136 index-store=tsdb-2024-09-17 msg="uploading tables"
2024-12-03 14:48:43.807
level=info ts=2024-12-03T13:48:43.79630754Z caller=shipper.go:160 index-store=tsdb-2024-09-17 msg="starting index shipper in WO mode"
We had multiple problems leading to this particular issue, but I think the root problem started with an incorrectly configured ingester.wal.replay_memory_ceiling
. It was set to default (4GB according to the docs) while the memory resource limit for loki-write was configured to 2GB. During a deployment log ingestion happen to be surging and when one of the loki-write instances was initializing it got OOM-killed during WAL replay, however ingestion continued on the same instance before this happened letting the WAL grow even further and eventually it ran out of storage space of the PVC. At this point we realized the misconfiguration of replay_memory_ceiling, adjusted it to 1.5GB, and redeployed. This is where the above log record starting to appear, causing the container to crash.
We ended up deleting the PVC to get the instance up and running again.
Config:
auth_enabled: false
chunk_store_config:
chunk_cache_config:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 0s
memcached:
batch_size: 4
parallelism: 5
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.loki.svc
consistent_hash: true
max_idle_conns: 72
timeout: 2000ms
common:
compactor_address: 'http://loki-backend:3100'
path_prefix: /var/loki
replication_factor: 3
storage:
azure:
account_name: <redacted>
container_name: chunks
use_federated_token: true
use_managed_identity: false
compactor:
compaction_interval: 10m
delete_request_store: azure
retention_delete_delay: 2h
retention_delete_worker_count: 150
retention_enabled: true
frontend:
scheduler_address: ""
tail_proxy_url: ""
frontend_worker:
scheduler_address: ""
index_gateway:
mode: simple
ingester:
chunk_encoding: snappy
limits_config:
allow_structured_metadata: true
ingestion_rate_strategy: local
max_cache_freshness_per_query: 10m
max_chunks_per_query: 300000
otlp_config:
resource_attributes:
attributes_config:
- action: index_label
attributes:
- namespace
- container
- pod
- cluster
- k8s.cronjob.name
- k8s.job.name
- k8s.daemonset.name
- k8s.statefulset.name
- k8s.deployment.name
- k8s.replicaset.name
ignore_defaults: true
query_timeout: 300s
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 30d
split_queries_by_interval: 15m
volume_enabled: true
memberlist:
join_members:
- loki-memberlist
pattern_ingester:
enabled: true
querier:
max_concurrent: 4
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 12h
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.loki.svc
consistent_hash: true
timeout: 500ms
update_interval: 1m
ruler:
storage:
azure:
account_name: <redacted>
container_name: ruler
use_federated_token: true
use_managed_identity: false
type: azure
runtime_config:
file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
configs:
- from: "2024-09-17"
index:
period: 24h
prefix: loki_index_
object_store: azure
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 8388608
grpc_server_max_send_msg_size: 8388608
http_listen_port: 3100
http_server_read_timeout: 600s
http_server_write_timeout: 600s
log_level: info
storage_config:
boltdb_shipper:
index_gateway_client:
server_address: dns+loki-backend-headless.loki.svc.cluster.local:9095
hedging:
at: 250ms
max_per_second: 20
up_to: 3
tsdb_shipper:
index_gateway_client:
server_address: dns+loki-backend-headless.loki.svc.cluster.local:9095
tracing:
enabled: true
Environment:
- Infrastructure: AKS / Kubernetes v1.29.4
- Deployment tool: Helm, ArgoCD
- loki-write: docker.io/grafana/loki:3.1.1
- Loki helm chart: 6.12.0 (https://github.com/grafana/loki/tree/main/production/helm/loki)