Skip to content

Filebeat: Move Manager.Start() before state store loading to prevent check-in timeout livelock #49512

Description

@rdner

Problem

When Filebeat runs as a managed component under Elastic Agent, the V2 manager's Start() method — which initiates the gRPC connection and check-in loop with the agent — is called at the very end of Filebeat.Run() initialization, after all components have been initialized. This means the beat cannot send check-ins while the state store is loading.

If the memlog data file is large and the process is CPU-constrained, the state store loading can take longer than the agent's check-in timeout (3 missed check-ins at 30s intervals = 90s). When this happens, the agent kills the unresponsive component and restarts it, but the new instance faces the same problem — creating a permanent crash loop. Each killed process also leaves behind a zombie since the parent never calls wait() on it.

Root cause

In filebeat/beater/filebeat.go, b.Manager.Start() is called at line 441, after all initialization including:

  1. Line 287: openStateStore() — creates the memlog Registry (no disk I/O yet)
  2. Line 301: registrar.New(stateStore, ...) — calls stateStore.Access(), which triggers memlog.Registry.Access()openStore(). This is where loadDataFile() and loadLogFile() parse the entire state from disk into memory. For components tracking many files, this data file can be hundreds of megabytes of JSON.
  3. Lines 308–438: remaining initialization (pipeline setup, crawler, autodiscover, etc.)
  4. Line 441: b.Manager.Start() — check-ins begin here, far too late.

The call chain that blocks is:

Filebeat.Run()                                    # filebeat/beater/filebeat.go:259
  → registrar.New(stateStore, ...)                # filebeat/beater/filebeat.go:301
    → stateStore.Access()                         # filebeat/beater/store.go:57-59
      → registry.Get(storeName)                   # libbeat/statestore/registry.go:58-64
        → memlog.Registry.Access(name)            # libbeat/statestore/backend/memlog/memlog.go:103-116
          → openStore(...)                        # libbeat/statestore/backend/memlog/store.go:110-134
            → loadDataFile(active.path, tbl)      # BLOCKS HERE - CPU-bound JSON parsing

Since Manager.Start() at line 441 has not been reached, the V2 client never starts, no check-ins are sent, and the agent kills the process.

Proposed fix

Move b.Manager.Start() to execute before the state store loading begins — ideally right after openStateStore() at line 287 and before registrar.New() at line 301. At this point, the V2 client would start its gRPC connection and check-in loop, signaling to the agent that the component is alive (in a STARTING state) while the potentially slow store loading proceeds.

The manager does not depend on any of the components initialized after it — unitListen() simply queues incoming unit configurations until the beat is ready to process them. Starting it early is safe and only changes the timing of when the beat begins responding to check-ins and diagnostic requests.

Note that:

  • The existing deferred Manager.Stop() ordering must be preserved
  • The beat reports a STARTING (not HEALTHY) status during the loading phase
  • Inputs are not processed until the rest of initialization completes

This change would also benefit other beats with potentially slow initialization phases (e.g., large Metricbeat module sets), not just Filebeat.

Metadata

Metadata

Assignees

No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions