Without a trace (query): Troubleshooting simplified with Tempo and Traces Drilldown in Grafana

About this session

Sifting through the deluge of data from your distributed systems can be overwhelming – but it doesn’t have to be. Join Senior Engineering Manager Andrew Stucky to see how Traces Drilldown in Grafana, powered by Tempo, can simplify complex troubleshooting, helping you quickly detect latency issues, identify error trends, and spot anomalies – all without having to learn yet another query language. Andrew will highlight how you can leverage automatic rate, error, duration (RED) metrics created from traces to pinpoint root causes faster. And for those of you who are new to distributed tracing: We know from other users that getting started with metrics derived from traces can be overwhelming at first. That’s why the Traces Drilldown homepage consolidates trace data into a user-friendly view for quick orientation. Plus: Alex Simion and Deepika Muthudeenathayalan, engineers working on the Glovo delivery app, will talk about how traces help their team solve incidents and keep orders moving, especially during peak hours, when every second counts. They’ll walk through learnings from a recent incident when a critical service was failing, new pods weren’t spinning up, and metrics and logs didn’t tell the full story. Tracing was the missing piece that finally revealed what was really happening.

Speakers

Andrew Stucky (00:03): Hello. Thank you for joining us. I’m going to try this again. Hello everyone… Woohoo. I’m excited to talk to you today about traces, Tempo and Traces Drilldown. We’ll be doing QA at the end and I know you’ve probably seen this QR code a million times, so I’m going to skip past it, but scan it if you haven’t. Actually, I think I had our intro (notes) on that slide. I’m Andrew Stucky. I’m from Grafana. I lead the Grafana Drilldown group. I’m honored to be joined by Deepika and Alex from Glovo, a subsidiary of Delivery Hero. So, I want to talk a bit about how metrics and logs alone leave this story half-told. Pretend you wake up to an alert. You might not have to pretend, this might have happened to you recently and users are unable to checkout. Your metrics, tell you that something is happening.

(01:02): After all, they probably triggered your alert, but metrics alone probably won’t tell you what is happening or where. Logs, they’re pretty good about telling you what is happening. An exception is thrown, et cetera, et cetera. You can find the needle in the haystack, but finding that (needle) can be challenging. To complicate matters further, as teams continue to adopt microservices, a single backend service is often just a piece of the puzzle. We might have 2 microservices or 10. The puzzle has a way of getting larger. So what is tracing and how does it help with observability of distributed systems? Tracing is basically logging improved with context that is shared across HTTP boundaries, which allow you to see the flow through the various components that your system contains.

(02:07): But first I want to cover some tracing basics. A trace represents an entire request across multiple services and visualizes start to finish the flow to identify bottlenecks and or dependencies. Traces are often high volume and have high cardinality. They can also contain logs or events as well. A span represents a particular operation within the request flow dealing with “where” it occurred. They usually include timing, success or failure data and relevant tags or metadata, which we call attributes. Attributes are basically key value pairs. That metadata I spoke about, spans can also contain events which record significant occurrences, such as an exception. And finally, spans can contain links to other spans or other traces, which can be helpful when you have really large traces.

(03:11): Coming back to the beginning, traces really provide us a structure to visualize what would otherwise be a pretty chaotic environment. This structure can be really, really useful in distributed systems. A common use case of traces is to pinpoint errors in the context of such a larger system. This doesn’t necessarily mean that one signal is better than the other. Traces doesn’t rule them all. It has its own strengths and weaknesses. So I think they really are better together. Metrics tell us that something might be happening. Traces tell us where something is happening and logs tell us what is happening.

(03:58): Now that we’ve covered the basics, let’s look at what tracing looked like in Grafana before Traces Drilldown. You wake up to your alert, user’s unable to check out. You’re in dashboards, you’re looking at metrics, you’re looking at logs, and they might not be sufficient for this particular incident. So you think, all right, we’re going to navigate to traces in Grafana. This will save the day. Andrew told-me-so. This will find the where. So you go to Explore, you select your tempo data source and … “no data”. Okay, so what now? Well, we write a query and even simple queries can get complex. For example, if we dissect the query on the screen, we want to see a list of services and that list of services will be descendant services, which are experiencing errors that originate from traces using the front end service. And these dependencies can be kind of complex when you try to query for them. So easy, right? This is why we built Traces Drilldown. No more writing queries, deliver value from tracing faster, using point and click experiences. But instead of me doing the demo, I’ve already talked enough about traces and Drilldown. I’m going to give Glovo the opportunity to share one of their use-cases and sell you on Traces Drilldown.

Alex Simion (05:33): Thank you. Hello everyone. My name is Alex and I’m here with Deepika. We’re a part of the SRE team at Glovo. And today we want to touch upon two topics, how we use Traces Drilldown during one of our recent incidents and how we shape the way we collect telemetry. So first of all, who is Glovo? Glovo is an on-demand delivery platform that lets users order anything from food, groceries, or any other item that a local store might sell. We are operating in 23 countries across Europe, Africa, and Asia. And we are part of a bigger group called Delivery Hero, which is one of the biggest delivery groups in the world. To give you more insights into our business, this is one of the most important metrics we have in Glovo. It’s called the “Orders created metric”. And as you can see, the volume per day has different peaks at different times of the day.

(06:26): For example, at lunch or dinner, we have much more orders. And as you can imagine, an incident at lunch is much more impactful than one at breakfast. So with that in mind, imagine the following. It’s Monday 12:50, there are 10 minutes to lunch and you’re getting hungry. You go to the Glovo app, you select that restaurant that is your favorite restaurant in the city. Find that perfect dish, go to check out. And this happens. The app is not responding, the order is closed, you don’t know what to do, you’re getting more and more hungry. And on our side we get paged because the order creation goes below the normal rate, we go quickly and check the orders created dashboard. We see that orders are going down and we are losing money as we speak. So in the next couple of minutes, Deepika will explain how we mitigated this incident.

Deepika Muthudeenathayalan (07:22): As SREs, we are kind of in the dark when such incidents happen because any part of our distributed system could fail causing the order loss. So we built many custom dashboards and metrics in Grafana, which helps us during such incidents. This is an example dashboard. It clearly highlights the checkout part of the flow failed. We go further into the checkout RED dashboards, we find that one of its downstream dependencies called catalog failed because it was rate limiting checkout. So for context catalog is the service which gives the product information for the order. During the daytime, whenever incident happens, we always suspect it could be a code change. So we have built a change log dashboard which pulls events from our deployment tool called Argo. So we can see here that there was a deployment in catalog when this incident happened and there were also some feature toggle changes.

(08:19): We reached out to the engineers in the team, they confirmed that this could have impacted the checkout failure. So we quickly rolled back the deployment and in five minutes we are back to our order creation being normal. In fact, we are over the last week’s values, so great mitigation, but as SRE again, our job doesn’t end here because we need to find the root cause of the incident and make sure it doesn’t repeat again. So we go to the application monitoring page of checkout to look for some clues. As expected, we see a huge spike in the error panel because all the traces were being rate limited. What was interesting was we also saw that the latency of the checkout was very high. This was kind of surprising because we were just rate limited. So the request should have taken few milliseconds and failed. So we dig deep and then we go to the logs.

(09:18): We see most of the errors were pointing to Hikari Connection Pool. Again, weird because we didn’t have anything related to database going on, but it clearly showed that the application was failing to get connections from the connection pool. We check further the metrics for Hikari, we see a lot of threads are waiting for the connections. So they seem like two different issues. As per our mitigation, it seemed like there was a faulty deployment in a downstream service. So we just rolled back and the issue was solved. But then looking at logs and metrics, the issue was pointing towards database. We checked the performance insights and stuff and our database was perfectly fine. So we realized that the database issue was rather the effect of the original problem, not the cause of the problem. We still couldn’t find what exactly how the catalog failure caused the database failure.

(10:13): We thought to use traces because traces usually give us a sample. I mean example. Traces can give us the hints, but then in this case when we went there and then we query for “500” traces, most of the examples we got again, short database problems, which didn’t help us much. We are users of Grafana Cloud at Glovo, so we get to pilot many of the features they build. So when this incident happened a couple of months back, we’ve been using Drilldown for I think one or two weeks and it helped us during one of our earlier incidents. So we thought why not? Let’s see if Drilldown can help us to find those interesting traces. So in the next slide, Alex will walk you through how we actually use Drilldown to find the root cause of this problem.

Alex Simion (11:04): Alright, so in the next couple of slides I’m going to show how we use Drilldown. Usually we go to the (traces) Drillown main page here we can see the error rate already selected and it’s a very easy thing to filter by. It’s the service name. Here we try to filter by the checkout API, so the service that had the problem. And then another easy way to understand if there’s any problem in the service is to try different span attributes or resource attributes. Sometimes we go for pod name or service version to understand if the problem is localized or a global problem. But in this specific case, unfortunately we couldn’t find more than what the metrics and logs showed us. So we thought why not go backwards from the end to the start? Why not check product-catalog API traces? And that’s what we did.

(11:56): It’s as easy as selecting here the service name to “product-ctg-api”. And then what we wanted to do is to focus our search on the rejections that we get from product catalog. So that’s why we also filter by the HTTP status code. Drilldown shows you very nicely that you can select one of the status code that appears in the spans. And here in this case we selected the “429"s. The next step was very useful because we didn’t know exactly what attributes to look for. So the comparison tab gives you insight on what attributes are having different values from the baseline. And scrolling down the page, we saw that one of the attributes called “flow_name”, which basically in Glovo means the client, is having much bigger values than the others. And in this case, checkout was having 88% of the times rejections.

(12:50): So we said why not filter by this as well? And this moment we have three filters. The first one is “product-ctg-api”. The second one is the “429"s on the HTP status code and the third one is the client. Now here we knew that this is a cascading failure. We knew that the database issue was caused by something else, so we wanted to check some exemplars from the beginning of the incident to see what caused this. So that’s exactly what we did here. We check one of the first seconds of the incident, open a trace, look inside the trace. And as you can see here we are calling the database. There’s a very interesting fact here because the database called it’s less than one millisecond, but the whole trace it’s taking 3.7 seconds and then timing out. So something is not right.

(13:41): We go down the trace and we see that one of the methods called “batch product validator” is taking the most time. And if we look inside of it in the nested components, we can see that product catalog API is called two times. Still, we didn’t understand why there’s a space in between the calls, but before going further, we wanted to check if in fact this is the longest running span in our incident. That’s why we went to Drilldown again, we checked the duration tab and this showed us in decreasing order of span durations, all the spans that took so much time in the incident. And as you can see, “batch product validator” is indeed one of the longest running spans. Coming back to the trace, we deep dive into the product catalog API trace, and we see the exception message tells us rate-limited.

(14:38): So we already knew that we are getting rate-limited, but there was a key information that was missing. If you see in this screenshot we have a field called “remaining ban seconds”. And this value was actually sent back from product catalog to checkout in a Retry-After header. So basically what happened is that checkout was being told you should wait this amount of seconds before calling back and now it all made sense. So product catalog was being called once by checkout. They were getting rejected very fast, like one millisecond and then three seconds of vital time was taken here. And then calling again by the second time we call product catalog, already the request was timing out. So we understood what was the problem with long running request, but we still didn’t understand how is this connected to the database and we went further and checked on another trace.

(15:38): And here in this trace I want to highlight that the database also takes a lot of time. It takes one second. We also see the same behavior as in the first trace where product catalog is called multiple times. In this case it’s even called more than two times, it’s called five times and there’s some time in between the calls. So what we thought at this moment is what if they happen in the same thread and in fact the connection is being held for the whole duration of this trace. So that’s why we also used another useful feature from the Trace Explorer. We were able to filter by the thread ID — specifically — this granularity helped us understand that in fact they were happening in the same thread, but there was something missing because in Glovo we had this problem in the past a few years ago.

(16:32): And for all the services that have a lot of downstream dependencies we use on property called “open-in-view” in Spring to disable, to disable the data source connection being held while other things are happening. So this is the property that we have and as you can see in the first line, this is a screenshot from the checkout API codebase. Setting it to “false” will release the primary data source connection from the thread pool. So this should not be a problem, but in this specific case, checkout API has overridden this with the “@Transactional” annotation on their method. So this didn’t really help with anything. So this was the incident resolution. Checkout API was calling product catalog, was getting very limited, there were retries and long waiting times in between the retries. While all of this was happening, we were holding connections from the thread pool for the DB and new connections to checkout API. New requests could not fulfill their processing because this connection were being held. And in the next following slide, Deepika will explain to us how it shaped the way we collect telemetry at the Glovo.

Deepika Muthudeenathayalan (17:56): Before we go there, I just want to take a moment to explain our open telemetry collector setup. So we have apps running on Kubernetes. We’ve instrumented them with OTel agents for some other non-Java based apps. We use SDKs. These create the spans and then we have something called “load balancing collector”. So among the other uses, we use it for making sure that all the spans of a given trace go to the same version of our traces collector. And then we do something called tail sampling. This helps us to save the costs a lot because we select which traces we send to Grafana.

(18:35): So how Drilldown made us rethink our sampling policy. So this is our current tail sampling policy. If there are server errors, we send 100% of them. And for client errors we send 5%. And sometimes when teams are making new services, they ask us, they want to keep all the traces because they want to see some patterns. So we made a custom OTTL condition to support that. In general, our probabilistic sampling is 1% because we thought why do we have to send all the 200 traces. During this incident with checkout, as you saw in the application page, there were so many slow traces, but when we went to the histogram duration, we only saw few examples in the slow traces. This is because of our probabilistic sampling because it just says it’s 1%. It doesn’t guarantee that the slowest spans get sent to Grafana.

(19:29): We went and explored the open telemetry library to see if there are any processes, tail sampling processes based on latency. There was one, but at this moment it was only supporting static latency thresholds. And in Glovo, since we have a complex distributed system, we had different thresholds for different services and endpoints and we thought this could get outdated quickly. But also the teams needed something like this soon because they thought traces data was not useful in case of high latency issues. So we built something as a stop gap. So we always send the SLOs of the spans in their context. So we already had this information available in the root spans, so we thought why not we make an OTTL condition? So we intercept the root span, we check its duration, we compare it with the SLA in the span context and if the span has breached the SLA. We keep the whole trace since we use tail sampling, we end up keeping the whole trace with all the spans.

(20:36): So this was another example where after we build this we had a latency issue and you can see that now we have more examples of slow traces. This makes the usage of (traces) Drilldown also better as we get a lot of insights from the tracing data. I just want to summarize how we are using it in our day-to-day activities at Glovo. I personally mainly use the error panel because it’s very helpful to pick exemplar. As example, if you have a deployment you can pick from a particular version of the service. If one of your pods is crashing because of memory or something, you can pick traces from those pods. And there are various resources and span attributes you can use. We also use the duration tab for picking the slow traces during latency issues. That’s pretty much about it. Thank you everyone for giving us this opportunity and I hope you found it useful.