Skip to content

pusher failed to consume trace data due to "context canceled" #3957

Closed
@dastrobu

Description

@dastrobu

Is your feature request related to a problem? Please describe.

We encountered the log message msg="pusher failed to consume trace data" err="context canceled" and had significant difficulty identifying the cause. The root cause was not related to the reasons documented in the troubleshooting guide.

The metric tempo_discarded_spans_total indicated that spans were being discarded, but the logs did not clarify the reason.

We ended up patching Grafana Tempo and even the Go SDK to gather more information about the discarded spans. I would like to propose improvements to observability in this area.

Here is what we did:

  1. Additional Logging in Distributor

    We added extra logging to distributor.go to identify which spans were causing issues:

    if err != nil {
        logSpans(batches, &LogReceivedSpansConfig{ // TODO should be LogDiscardedSpansConfig
            Enabled:              true,
            IncludeAllAttributes: true,
            FilterByStatusError:  false,
        }, d.logger)
        return nil, err
    }

    I propose adding a LogDiscardedSpansConfig in addition to LogReceivedSpansConfig to log discarded spans that are causing issues. I can open a PR for this if you think it is a good idea.

  2. Context Cancellation Debugging

    It was extremely challenging to determine why the Context was canceled. This is a weak spot in Go, in my opinion. We had to patch the Go SDK to get more information about context cancellation.

    We added stack traces to context.go:

    func (c *cancelCtx) cancel(removeFromParent bool, err, cause error) {
    	if err == nil {
    		panic("context: internal error: missing cancel error")
    	}
    	if err == Canceled {
    		stack := debug.Stack()
    		s := string(stack)
    		err = fmt.Errorf("%w stack: %v", err, s)
    	}

    This led to the following logs:

    level=error ts=2024-06-20T06:05:40.085835663Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context canceled stack:
    goroutine 869 [running]:
    runtime/debug.Stack()
        /opt/hostedtoolcache/go/1.22.2/x64/src/runtime/debug/stack.go:24 +0x6b
    context.(*cancelCtx).cancel(0xc047f05a90, 0x1, {0x42ee3a0, 0xc000130c20}, {0x0, 0x0})
        /opt/hostedtoolcache/go/1.22.2/x64/src/context/context.go:546 +0x8a
    context.WithCancel.func1()
        /opt/hostedtoolcache/go/1.22.2/x64/src/context/context.go:240 +0x34
    google.golang.org/grpc/internal/transport.(*http2Server).closeStream(0xc0438ad380, 0xc047e65b00, 0x0, 0x0, 0x0)
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/internal/transport/http2_server.go:1313 +0x39
    google.golang.org/grpc/internal/transport.(*http2Server).handleRSTStream(0xc0438ad380, 0xc047da5d60)
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/internal/transport/http2_server.go:833 +0x94
    google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc0438ad380, {0x43139d0, 0xc04410f830}, 0xc04410f8c0)
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/internal/transport/http2_server.go:685 +0xaa9
    google.golang.org/grpc.(*Server).serveStreams(0xc000670a00, {0x43139d0, 0xc04410f830}, {0x432e560, 0xc0438ad380}, {0x432a618, 0xc0410f96d0})
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/server.go:1013 +0x4f3
    google.golang.org/grpc.(*Server).handleRawConn.func1()
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/server.go:949 +0x92
    created by google.golang.org/grpc.(*Server).handleRawConn in goroutine 866
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/server.go:948 +0x389

    As you can see, the context was closed due to an RST frame sent by the client. See handleRSTStream and closeStream.

    Therefore, Grafana Tempo is behaving correctly, but it was extremely difficult to determine why the context was canceled.

  3. Confusing Metric Labels

    The discarded spans are added to the tempo_discarded_spans_total metric with the reason reasonInternalError: distributor.go#L440, which was also confusing.

Describe the solution you'd like

I propose two improvements:

  1. Add a LogDiscardedSpansConfig to the distributor to log discarded spans causing issues.
    distributor:
      config:
        log_discarded_spans:
          enabled: true
        log_received_spans:
          # ...
    This should be easily doable.
  2. Add a new reason label to tempo_discarded_spans_total to indicate when the client's fault causes traces to be discarded. I am unsure how to implement this, though.

Describe alternatives you've considered

Switching to Jaeger 😉.

Additional context

Grafana was deployed using the tempo-distributed Helm chart.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions