Skip to content

CEL input: Add OTel tracing#48440

Open
chrisberkhout wants to merge 50 commits intoelastic:mainfrom
chrisberkhout:cel-otel-tracing
Open

CEL input: Add OTel tracing#48440
chrisberkhout wants to merge 50 commits intoelastic:mainfrom
chrisberkhout:cel-otel-tracing

Conversation

@chrisberkhout
Copy link
Contributor

@chrisberkhout chrisberkhout commented Jan 16, 2026

Proposed commit message

CEL input: Add OTel tracing (#)

Instruments the CEL Input with OpenTelemetry tracing. Sampling is 100% -
all operation is covered. By default no exporter is set up and traces
will not be exported. Export can be configured to go to the console or
to an OTLP endpoint using the `grpc` (default) or `http/protobuf`
protocols.

Typically OTel tracing considers the whole process to be the "resource".
However, in this case the resource is the input instance. For that
reason a trace provider is created specifically for the input instance
and it is not explicitly set as the global tracer provider.

There is an extra environment variable to override any other
configuration and disable export for a specific input:
`BEATS_OTEL_TRACES_DISABLE=cel`.

Spans covering HTTP requests are enriched with attributes for request
and response headers, with values automatically (but configurably)
redacted to protect sensitive data.

Normal request logging and Filebeat logs will include span and trace IDs
that allow correlation with the OTel data. This is done in any location
to which we can pass a logger from the trace creation site. Other
Filebeat logging will lack the IDs. Because logger attributes are
append-only we pass around a logger with modified attributes rather than
modify attributes in a global logger.

Normal request logging had unused functionality for including a
`trace.id` field. That has been removed in favor of an OTel-specific
implementation that adds `trace.id` and `span.id` if there is a current,
valid span.

Requests initiated by CEL will have spans added by `otelhttp` and will
identify the correct parent span using trace data from the request
context. Since the relevant eval-time context is not propagated to those
requests by mito, cel-go[1] or oauth2[2], `ContextInjector` is used to
rewrite each request to include the current context as it is processed.

[1]: https://github.com/google/cel-go/issues/557
[2]: https://github.com/golang/oauth2/issues/262

There were a couple of things for which the initial approach changed:

  • Use of https://pkg.go.dev/go.opentelemetry.io/contrib/exporters/autoexport to interpret OTel environment variables and set up the exporter was removed in favor of manual handling, which seems to be standard when using the Go SDK (unlike implementations in some other languages).
  • The context with OTel tracing data needs to be propagated the HTTP client used by CEL so that HTTP spans are attached to the correct parent span. That was initially done with a change in Mito: Add HTTPWithContextFnOpts so requests can have eval-time context mito#118. That has been closed to avoid changing Mito. Now it is done in the CEL Input by having ContextInjector rewrite requests in the client used by CEL, which also solves the problem for OAuth2 requests.

There are some differences from the attribute and other names given in the planning document:

  • cel.periodic.program_count
    → Changed to cel.periodic.execution_count to match cel.program.execution.
  • cel.program.batch_count
    → Removed. It would only indicate whether an execution returned any events or not. Any other batching is internal to the CEL evaluation.
  • cel.{periodic,program}.success
    → Removed, in favor of span status.
  • cel.program.error_message
    → Not set. Uses SetStatus and RecordError instead.
  • BEATS_OTEL_TRACING_DISABLE
    → Changed to BEATS_OTEL_TRACES_DISABLE to match OTEL_TRACES_EXPORTER and OTEL_EXPORTER_OTLP_TRACES_*.

Handling of span-specific context and loggers is somewhat cumbersome. Refactoring to extract separate functions from run for separate stages of processing will help to tidy this up and is planned as follow-up work: #48464.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

How to test this PR locally

You can use otel-desktop-viewer as simple receiver and viewer of OTel traces:

# Install it
go install github.com/CtrlSpice/otel-desktop-viewer@latest

# Run it. It will open its web UI
otel-desktop-viewer

# In another terminal, set it as the destination for OTel traces
export OTEL_TRACES_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317

In the terminal with those environment variables set, you can run the input with an example that includes OAuth2 and multiple requests per period, like this:

(cd x-pack/filebeat && go build) && ./x-pack/filebeat/filebeat run -c <(echo '
filebeat.inputs:
- type: cel
  enabled: true
  id: cel-1
  interval: 5s
  resource.url: https://api.ipify.org/?format=json&passwd=mysecretword
  program: |
    get(state.url).Body.as(body, state.with({
        "events": [body.decode_json()],
        "want_more": int(state.?runcount.orValue(1)) % 3 != 0,
        "runcount": int(state.?runcount.orValue(1)) + 1,
    }))
  resource.tracer.enable: true
  resource.tracer.filename: "x-pack/filebeat/logs/cel/http-request-trace-cel-*.ndjson"
  auth.oauth2.enabled: true
  auth.oauth2.client.id: someclientid
  auth.oauth2.client.secret: someclientsecret
  auth.oauth2.scopes: scope.me
  auth.oauth2.token_url: https://oauth-mock.mock.beeceptor.com/oauth/token/github
  auth.oauth2.endpoint_params:
    grant_type: client_credentials
  otel.trace.redacted:
    - User-Agent
  otel.trace.unredacted:
    - Authorization
output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  username: "elastic"
  password: "changeme"
  protocol: "https"
  ssl.verification_mode: "none"
  preset: balanced
logging.level: debug
logging.to_stderr: true
')

You can also use Elastic Observability to receive and view OTel traces, but it involves a bit more setup.

Bring up the Elastic Stack:

elastic-package stack up -v

In Kibana, go to "Management > Integrations" and go to the "APM" integration page. Click "Manage APM integration in Fleet", then "Add Elastic APM". Under "Configure integration > Integration settings > General > Server configuration", change the Host and URL settings to use '0.0.0.0' instead of 'localhost'. Under "Where to add this integration?", choose "Existing hosts > Elastic Agent (elastic-package)". Then click "Save and continue".

Now, back in the terminal, find the IP address of the agent container.

docker ps # confirm the agent container name is elastic-package-stack-elastic-agent-1
AGENT="elastic-package-stack-elastic-agent-1"
AGENT_IP=$(docker inspect "$AGENT" \
  --format '{{ (index .NetworkSettings.Networks "elastic-package-stack_default").IPAddress }}')
echo "$AGENT_IP" # confirm the IP was found

Use that as the destination for OTel traces:

export OTEL_TRACES_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://$AGENT_IP:8200"

Then from the terminal with those settings you can run the input using example Filebeat configuration as above.

To view the exported traces in Kibana, go to "Observability > Applications > Traces".

Related

Use cases

This tracing is to be used for troubleshooting, particularly for Agentless.

Screenshots

OTel traces for the CEL Input in Elastic Observability:
Screenshot 2026-02-06 at 15-46-40 cel periodic run - Transactions - unknown - Service inventory - APM - Observability - Elastic

@chrisberkhout chrisberkhout self-assigned this Jan 16, 2026
@chrisberkhout chrisberkhout added enhancement Filebeat Filebeat Team:Security-Service Integrations Security Service Integrations Team labels Jan 16, 2026
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 16, 2026
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
@mergify
Copy link
Contributor

mergify bot commented Jan 16, 2026

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b cel-otel-tracing upstream/cel-otel-tracing
git merge upstream/main
git push upstream cel-otel-tracing
@mergify
Copy link
Contributor

mergify bot commented Jan 16, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @chrisberkhout? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.
@mergify
Copy link
Contributor

mergify bot commented Jan 23, 2026

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b cel-otel-tracing upstream/cel-otel-tracing
git merge upstream/main
git push upstream cel-otel-tracing
@github-actions
Copy link
Contributor

github-actions bot commented Jan 30, 2026

🔍 Preview links for changed docs

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 3, 2026
chrisberkhout and others added 23 commits February 9, 2026 11:41
…ing up the global one (which is set automatically), which may not be correct (because there may be several).
Co-authored-by: Janeen Mikell Roberts <57149392+jmikell821@users.noreply.github.com>
Co-authored-by: Janeen Mikell Roberts <57149392+jmikell821@users.noreply.github.com>
Co-authored-by: Dan Kortschak <dan.kortschak@elastic.co>
Copy link
Contributor

@orestisfl orestisfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of https://pkg.go.dev/go.opentelemetry.io/contrib/exporters/autoexport to interpret OTel environment variables and set up the exporter was removed in favor of manual handling, which seems to be standard when using the Go SDK (unlike implementations in some other languages).

I was not aware that this is the accepted standard. What is that based on?

Typically OTel tracing considers the whole process to be the "resource".
However, in this case the resource is the input instance. For that
reason a trace provider is created specifically for the input instance
and it is not explicitly set as the global tracer provider.

I would perhaps expect "filebeat" perhaps to be the resource, otherwise I would be concerned we would spam the service inventory with every single input name.

OTel traces for the CEL Input in Elastic Observability

Any idea why it's listed as "unknown" on the path above?

metrics, reg := newInputMetrics(env.MetricsRegistry, env.Logger)

ctx := ctxtool.FromCanceller(env.Cancelation)
otelTracerProvider, err := otel.NewTracerProvider(ctx, getResourceAttributes(env, cfg), i.Name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Shutdown method is never called on the provider. Could that lead to unexpected data loss?

metrics, reg := newInputMetrics(env.MetricsRegistry, env.Logger)

ctx := ctxtool.FromCanceller(env.Cancelation)
otelTracerProvider, err := otel.NewTracerProvider(ctx, getResourceAttributes(env, cfg), i.Name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Could this lead to significant overhead for multiple inputs? Could we instead make this call once and set any run-specific attributes in the span level?

@efd6
Copy link
Contributor

efd6 commented Feb 9, 2026

I do see one span per trace without a parent ID, and other parts of the UI identify the root as a root span.

Yes, that's what I'm seeing.

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the PR locally. Works as I expected, minus the few things I commented on. 👍

  • URL query parameter redaction works
  • Header redaction works
  • Default sensitive-word detection works
  • File tracer (resource.tracer) includes trace.id and span.id
  • Filebeat debug logs include trace.id and span.id fields
  • Resource attributes are populated
  • BEATS_OTEL_TRACES_DISABLE=cel disables trace export as expected

trace.json

Image
Comment on lines +329 to +331
case <-waitCtx.Done():
runSpan.SetStatus(codes.Unset, waitCtx.Err().Error())
return waitCtx.Err()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When <-waitCtx.Done() fires, the function returns without calling waitSpan.End(). Please add waitSpan.End() before the return.

Comment on lines 478 to 480
if !ok {
metricsRecorder.AddProgramRunDuration(ctx, time.Since(start))
metricsRecorder.AddProgramRunDuration(execCtx, time.Since(start))
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop continues without ending execSpan. It looks like we are leaking the span?

Comment on lines +492 to 493
errorSpans(err, end{execSpan}, runSpan)
return errors.New("unexpected missing events array from evaluation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, at this point, err may be nil here. A fresh error should be used instead:

err := errors.New("unexpected missing events array from evaluation")
errorSpans(err, end{execSpan}, runSpan)
return err
err := fmt.Errorf("unexpected type returned for evaluation cursor element: %T", cursors[0])
metricsRecorder.AddProgramRunDuration(pubCtx, time.Since(start))
errorSpans(err, end{pubSpan}, end{execSpan}, runSpan)
return fmt.Errorf("unexpected type returned for evaluation cursor element: %T", cursors[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated error.

Suggested change
return fmt.Errorf("unexpected type returned for evaluation cursor element: %T", cursors[0])
return err
}

func (rt *ExtraSpanAttribsRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file wants to be gofumpt -w -extraed. 😄

Comment on lines +392 to +394
span.SetAttributes(attribute.StringSlice(
"url.full",
[]string{sanitizedURLString(r.URL, rt.shouldRedact)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per OTel semantic conventions, url.full is a string type, not an array. This should be

attribute.String("url.full", sanitizedURLString(r.URL, rt.shouldRedact)).

https://opentelemetry.io/docs/specs/semconv/registry/attributes/url/#url-full

func (rt *ExtraSpanAttribsRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) {

span := trace.SpanFromContext(r.Context())
if span != nil && span.SpanContext().IsValid() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace.SpanFromContext never returns nil. It may return a noop, but never nil.

Suggested change
if span != nil && span.SpanContext().IsValid() {
if span.SpanContext().IsValid() {
return resp, err
}

if span != nil && span.SpanContext().IsValid() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if span != nil && span.SpanContext().IsValid() {
if span.SpanContext().IsValid() {
Comment on lines 33 to 35
// TraceIDKey is key used to add a trace.id value to the context of HTTP
// requests. The value will be logged by LoggingRoundTripper.
const TraceIDKey = contextKey("trace.id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code?

return false
}

var sensitiveWords = map[string]struct{}{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "credentials" in Access-Control-Allow-Credentials is a CORS concept, not a secret. This is a common header and we should avoid redacting all the time. Users / developers would need to add this to otel.trace.unredacted to see the value, which is unlikely to occur to them.

Maybe we need a set of known safe headers...?

var knownSafeNames = map[string]struct{}{
      "access-control-allow-credentials": {},
      // etc.
}

We will have to be vigilant on our code reviews for packages to make sure that we are setting the unredacted for things like sort_key, country_code, etc. We can probably put something about this into our code review wiki page for developing packages, and hopefully AI tools can help keep us straight.

@leehinman leehinman removed their request for review February 13, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Filebeat Filebeat Team:Docs Label for the Observability docs team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Service Integrations Security Service Integrations Team

8 participants