Problem
When Filebeat runs as a managed component under Elastic Agent, the V2 manager's Start() method — which initiates the gRPC connection and check-in loop with the agent — is called at the very end of Filebeat.Run() initialization, after all components have been initialized. This means the beat cannot send check-ins while the state store is loading.
If the memlog data file is large and the process is CPU-constrained, the state store loading can take longer than the agent's check-in timeout (3 missed check-ins at 30s intervals = 90s). When this happens, the agent kills the unresponsive component and restarts it, but the new instance faces the same problem — creating a permanent crash loop. Each killed process also leaves behind a zombie since the parent never calls wait() on it.
Root cause
In filebeat/beater/filebeat.go, b.Manager.Start() is called at line 441, after all initialization including:
- Line 287:
openStateStore() — creates the memlog Registry (no disk I/O yet)
- Line 301:
registrar.New(stateStore, ...) — calls stateStore.Access(), which triggers memlog.Registry.Access() → openStore(). This is where loadDataFile() and loadLogFile() parse the entire state from disk into memory. For components tracking many files, this data file can be hundreds of megabytes of JSON.
- Lines 308–438: remaining initialization (pipeline setup, crawler, autodiscover, etc.)
- Line 441:
b.Manager.Start() — check-ins begin here, far too late.
The call chain that blocks is:
Filebeat.Run() # filebeat/beater/filebeat.go:259
→ registrar.New(stateStore, ...) # filebeat/beater/filebeat.go:301
→ stateStore.Access() # filebeat/beater/store.go:57-59
→ registry.Get(storeName) # libbeat/statestore/registry.go:58-64
→ memlog.Registry.Access(name) # libbeat/statestore/backend/memlog/memlog.go:103-116
→ openStore(...) # libbeat/statestore/backend/memlog/store.go:110-134
→ loadDataFile(active.path, tbl) # BLOCKS HERE - CPU-bound JSON parsing
Since Manager.Start() at line 441 has not been reached, the V2 client never starts, no check-ins are sent, and the agent kills the process.
Proposed fix
Move b.Manager.Start() to execute before the state store loading begins — ideally right after openStateStore() at line 287 and before registrar.New() at line 301. At this point, the V2 client would start its gRPC connection and check-in loop, signaling to the agent that the component is alive (in a STARTING state) while the potentially slow store loading proceeds.
The manager does not depend on any of the components initialized after it — unitListen() simply queues incoming unit configurations until the beat is ready to process them. Starting it early is safe and only changes the timing of when the beat begins responding to check-ins and diagnostic requests.
Note that:
- The existing deferred
Manager.Stop() ordering must be preserved
- The beat reports a
STARTING (not HEALTHY) status during the loading phase
- Inputs are not processed until the rest of initialization completes
This change would also benefit other beats with potentially slow initialization phases (e.g., large Metricbeat module sets), not just Filebeat.
Problem
When Filebeat runs as a managed component under Elastic Agent, the V2 manager's
Start()method — which initiates the gRPC connection and check-in loop with the agent — is called at the very end ofFilebeat.Run()initialization, after all components have been initialized. This means the beat cannot send check-ins while the state store is loading.If the memlog data file is large and the process is CPU-constrained, the state store loading can take longer than the agent's check-in timeout (3 missed check-ins at 30s intervals = 90s). When this happens, the agent kills the unresponsive component and restarts it, but the new instance faces the same problem — creating a permanent crash loop. Each killed process also leaves behind a zombie since the parent never calls
wait()on it.Root cause
In
filebeat/beater/filebeat.go,b.Manager.Start()is called at line 441, after all initialization including:openStateStore()— creates the memlogRegistry(no disk I/O yet)registrar.New(stateStore, ...)— callsstateStore.Access(), which triggersmemlog.Registry.Access()→openStore(). This is whereloadDataFile()andloadLogFile()parse the entire state from disk into memory. For components tracking many files, this data file can be hundreds of megabytes of JSON.b.Manager.Start()— check-ins begin here, far too late.The call chain that blocks is:
Since
Manager.Start()at line 441 has not been reached, the V2 client never starts, no check-ins are sent, and the agent kills the process.Proposed fix
Move
b.Manager.Start()to execute before the state store loading begins — ideally right afteropenStateStore()at line 287 and beforeregistrar.New()at line 301. At this point, the V2 client would start its gRPC connection and check-in loop, signaling to the agent that the component is alive (in aSTARTINGstate) while the potentially slow store loading proceeds.The manager does not depend on any of the components initialized after it —
unitListen()simply queues incoming unit configurations until the beat is ready to process them. Starting it early is safe and only changes the timing of when the beat begins responding to check-ins and diagnostic requests.Note that:
Manager.Stop()ordering must be preservedSTARTING(notHEALTHY) status during the loading phaseThis change would also benefit other beats with potentially slow initialization phases (e.g., large Metricbeat module sets), not just Filebeat.