Add beatreceiver OTEL telemetry#49300
Conversation
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a changelog fragment and integrates OpenTelemetry into Beats by introducing a new ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@x-pack/otel/telemetry/bridge.go`:
- Around line 328-333: The spawned goroutine in
callback/collectStats()/collectInputs() calls createAndReRegister which uses
ensureStatsInt, ensureStatsFloat, ensureInputInt and ensureInputFloat to mutate
instrument maps without synchronization; add locking inside those ensure*
functions so they acquire and defer b.mu.Unlock() (e.g., b.mu.Lock()/defer
b.mu.Unlock()) around any map reads/writes to prevent concurrent map access,
ensuring the functions (ensureStatsInt, ensureStatsFloat, ensureInputInt,
ensureInputFloat) are the single place that guard map access and avoiding
double-locks if callers already hold b.mu.
- Around line 157-164: Shutdown currently waits on b.reRegWg before removing the
registration which allows the re-register callback to call b.reRegWg.Add(1) and
panic; move the unregister step before waiting (i.e., acquire lock,
unregister/clear b.registration, release lock, then call b.reRegWg.Wait()) or
otherwise ensure the callback cannot Add after Shutdown begins; also eliminate
unsynchronized map access by protecting all reads of b.intGauges, b.floatGauges,
b.intCounters, b.floatCounters, etc. in collectStats() and collectInputs() with
b.mu (either lock around map iteration or copy the maps while holding b.mu and
iterate the copy without the lock) so that
createAndReRegister()/ensureStatsInt()/ensureStatsFloat() writes cannot race
with readers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cb20f2ec-c423-49f1-a8d6-3f3845769bca
📒 Files selected for processing (1)
x-pack/otel/telemetry/bridge.go
|
Haven't looked at the code yet, I assume there will be some work in https://github.com/elastic/elastic-agent/tree/main/internal/pkg/otel/receivers/elasticmonitoring to handle these coming in OTLP format now so we can avoid breaking the monitoring dashboards? |
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
This PR at the moment is just the first step. It provides a bridge between beats metrics (scoped to a specific receiver) to the standard OTEL metrics. It does require setting up the telemetry readers in the collector configuration and figuring out how to get the metrics to elasticsearch in the same resulting format that was used by the dashboards. This PR specifically scoped out the libbeat level metrics, but I could see those be registered only once (no matter the number of receivers) which seems possible with the otel collector. Which might then remove the need for We need to discuss how to then go from telemetry readers to elasticsearch (ensuring that the resulting document format is the same). |
|
It does seem possible with |
|
I think we'll want to expose the internal telemetry format directly in OTLP eventually, but as a choice. We need to keep the existing elastic agent integration dashboards and indices working in the ECS data model as the default for now to avoid forcing a breaking change on users here. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
x-pack/otel/telemetry/bridge.go (1)
343-349:⚠️ Potential issue | 🔴 CriticalWaitGroup race:
Add(1)called outside lock can race withShutdown()'sWait().After releasing the lock at line 341,
Add(1)at line 344 can race withShutdown()'sWait()at line 173. IfWait()returns beforeAdd(1)executes, this violates WaitGroup semantics and causes undefined behavior.Move the
Add(1)inside the lock scope and guard it with aclosedcheck:🔒 Proposed fix
b.mu.Lock() + if b.closed { + b.mu.Unlock() + return nil + } hasPending := len(b.pendingStatsInts) > 0 || len(b.pendingStatsFloats) > 0 || len(b.pendingInputInts) > 0 || len(b.pendingInputFloats) > 0 var statsInts, statsFloats, inputInts, inputFloats []string if hasPending { statsInts = b.pendingStatsInts statsFloats = b.pendingStatsFloats inputInts = b.pendingInputInts inputFloats = b.pendingInputFloats b.pendingStatsInts = nil b.pendingStatsFloats = nil b.pendingInputInts = nil b.pendingInputFloats = nil + b.reRegWg.Add(1) } b.mu.Unlock() if hasPending { - b.reRegWg.Add(1) go func() { defer b.reRegWg.Done() b.createAndReRegister(statsInts, statsFloats, inputInts, inputFloats) }() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/otel/telemetry/bridge.go` around lines 343 - 349, The WaitGroup Add(1) for b.reRegWg is done outside the bridge lock and can race with Shutdown()'s Wait(), causing undefined behavior; move b.reRegWg.Add(1) inside the same locked critical section where hasPending is checked and, before spawning the goroutine that calls b.createAndReRegister(...), verify the bridge isn't closed (e.g., check b.closed or the existing closed flag) so you only Add/Start the goroutine when still open, and ensure Done() is still deferred inside the goroutine to keep Wait/Done pairing correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@x-pack/otel/telemetry/bridge.go`:
- Around line 343-349: The WaitGroup Add(1) for b.reRegWg is done outside the
bridge lock and can race with Shutdown()'s Wait(), causing undefined behavior;
move b.reRegWg.Add(1) inside the same locked critical section where hasPending
is checked and, before spawning the goroutine that calls
b.createAndReRegister(...), verify the bridge isn't closed (e.g., check b.closed
or the existing closed flag) so you only Add/Start the goroutine when still
open, and ensure Done() is still deferred inside the goroutine to keep Wait/Done
pairing correct.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 96bb6907-e8f8-4816-8990-380157ba771c
📒 Files selected for processing (2)
x-pack/otel/telemetry/bridge.gox-pack/otel/telemetry/bridge_test.go
|
I reviewed workflow run
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
|
I investigated workflow run
Root cause analysis is not applicable because the run completed successfully. No remediation is required. Tests run in this investigation: none locally; reviewed GitHub Actions run/job status and failed-job logs ( What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
|
Workflow run
Recommended action: no remediation needed for this workflow run. If you expected a failure, share the failing run ID/check URL and I can analyze that run’s failing step output. Tests run in this investigation: none locally; analysis used GitHub Actions run and job metadata only. What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@x-pack/otel/telemetry/bridge.go`:
- Around line 323-350: The callback path can spawn concurrent goroutines that
call createAndReRegister, causing duplicate OTel registrations; protect
re-registration with a single-flight mechanism (e.g., a dedicated mutex or
sync/singleflight.Group) so only one goroutine performs unregister+register at a
time: add a re-registration lock (name it reRegMu or use a singleflight.Group)
on RegistryBridge and acquire it inside the goroutine before calling
createAndReRegister, release it after, and keep using reRegWg.Done() in defer to
preserve shutdown tracking; update createAndReRegister invocation sites to
assume serialized re-registration and ensure any error handling still runs under
that protection.
- Around line 272-286: The current registerCallback stops after warning when
allInstruments() returns zero, which prevents dynamic discovery because
meter.RegisterCallback won't invoke the callback for an empty instrument list;
change registerCallback in RegistryBridge so that when len(instruments) == 0 it
actively triggers discovery instead of just warning — e.g., immediately invoke
the bridge's callback logic (so collectStats/collectInputs run at least once) or
start a short-lived background retry (ticker) that calls the callback and
re-checks allInstruments() until instruments are found, then proceeds to call
meter.RegisterCallback(b.callback, instruments...) and set b.registration under
b.mu; reference registerCallback, allInstruments, meter.RegisterCallback,
b.callback, collectStats, collectInputs, registration and mu when making the
change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2441ac4c-b746-474b-90ca-d7e243b4400b
📒 Files selected for processing (1)
x-pack/otel/telemetry/bridge.go
|
Workflow run
Tests/checks run for this investigation:
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
|
Checked workflow run
Root cause: none (no failure present in this run). Recommended action: no fix needed for this run; if a different failed run is the target, share that run ID and I can analyze the failing step in detail. Tests run in this investigation:
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
|
I reviewed workflow run
Root cause summary: no failing check was found in this run, so no remediation is required for Actions on this execution. Recommended follow-up: if you want failure analysis, share a run ID with Tests run in this investigation:
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
|
Checked workflow run Root cause of failure: none observed in the run logs; the key check ( Recommended action: no fix needed for this run; if you want extra assurance, re-run the same workflow after the next commit and only investigate if a job turns non-success. Tests run (from this workflow):
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
|
Workflow run What I checked
Root cause / remediation
Tests run
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
|
Workflow run From the job log (
Root causeTransient/network failure while fetching Recommended remediation
Tests run
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
Adds a new RegistryBridge and SystemRegistryBridge that bridges the beats internal metrics to the OTEL internal telemetry. This is done in a way where all of the metrics are dynamically mapped (removing the need to manually create mappings). The metrics are specifically kept in beats format using . as separators. That does provide a slight difference between OTEL telemetry which normally uses _. The bridge is done using the callback registration that ensures that the metric values provided to OTEL telemetry are always the value at that point in time. This ensures there is no drift between reporting the metrics and the actual value at that time.
Adds a new RegistryBridge and SystemRegistryBridge that bridges the beats internal metrics to the OTEL internal telemetry. This is done in a way where all of the metrics are dynamically mapped (removing the need to manually create mappings). The metrics are specifically kept in beats format using . as separators. That does provide a slight difference between OTEL telemetry which normally uses _. The bridge is done using the callback registration that ensures that the metric values provided to OTEL telemetry are always the value at that point in time. This ensures there is no drift between reporting the metrics and the actual value at that time.
Adds a new RegistryBridge and SystemRegistryBridge that bridges the beats internal metrics to the OTEL internal telemetry. This is done in a way where all of the metrics are dynamically mapped (removing the need to manually create mappings). The metrics are specifically kept in beats format using . as separators. That does provide a slight difference between OTEL telemetry which normally uses _. The bridge is done using the callback registration that ensures that the metric values provided to OTEL telemetry are always the value at that point in time. This ensures there is no drift between reporting the metrics and the actual value at that time.
Proposed commit message
Adds a new
RegistryBridgeandSystemRegistryBridgethat bridges the beats internal metrics to the OTEL internal telemetry. This is done in a way where all of the metrics are dynamically mapped (removing the need to manually create mappings).The metrics are specifically kept in beats format using
.as separators. That does provide a slight difference between OTEL telemetry which normally uses_.The bridge is done using the callback registration that ensures that the metric values provided to OTEL telemetry are always the value at that point in time. This ensures there is no drift between reporting the metrics and the actual value at that time.
Checklist
[ ] I have made corresponding change to the default configuration filesstresstest.shscript to run them under stress conditions and race detector to verify their stability../changelog/fragmentsusing the changelog tool.Disruptive User Impact
None.
Author's Checklist
How to test this PR locally
Pull this commit into the elastic-agent and then build a package that works for your host. Run a similar config as below:
Related issues
Logs
Trimmed version of
/tmp/otel-telemetry.json.