Bringing situational awareness to observability with Grafana and Prometheus

About this session

When it comes to monitoring a large network of touchscreen multimedia kiosks across the U.S., it often turns out that the most useful leading indicators are more qualitative than technical: Does the screen content look stuck or blank?

Is the touchscreen too noisy?

Did we forget to install something?

In this lightning talk, Orange Barrel Media Engineering Manager Christoph Roedig will explore how to make observability tools more helpful in real-world scenarios where technical system metrics such as CPU, RAM, and disk on their own don’t accurately show what’s important.

Sometimes it’s about finding just the right query to turn some technical data into an indicator. In other cases, you have to invent your own measurement entirely. Chris will share examples of creative uses of Prometheus Node Exporter, PromQL, and Grafana query transformations to bridge observability and situational awareness.

Speakers

Speaker 1 (00:05): Hi. So I’m Chris. I work at a company called Orange Barrel Media, more about that in a minute, and I manage production systems. So let’s talk about that. So it says situational awareness. So real quick, what do I mean? So it’s good to measure things. I think we all agree what’s better is actually knowing what’s going on because that way we can make better operational decisions. And it turns out that this is pretty hard. And where I work, because we have this nationwide network of information kiosks. So these are in about 20 cities throughout the US. They’re about this tall. They have touch screens, they provide free Wi-Fi, emergency call buttons, way finding things like that, information for residents. We put them in the cities for free because we can run ads on them and share the revenue with the cities. So these are really useful, but they’re also very complicated and it’s also very important that we keep them running.

(01:12): So just by the numbers, we have 680 units, it’s actually now up to 700. And that means we have nearly 1400 touchscreens running apps that need to be running. And those make — and in total, we have about 3000 devices that we monitor because it’s not just the touchscreens, it’s also the cellular routers, the security cameras, the computers themselves, and other electronics within the kiosk that we need to keep an eye on. And that means we have about 3 million time series in our Grafana Mimir cluster, and we just love measuring things. It’s super fun when you have this platform on cities across the country. So here’s an example of what we can measure. I can’t see very well, but raise your hand if you think you know what, this is what’s happening here. Okay, good. No hands. I’ll give you a hint. This is ambient light levels in various US cities.

(02:14): And the cities are marked in red, and the peaks kind of follow those yellow arrows. And if you haven’t guessed by now, this is April 8th, 2024, and that’s the ambient light sensors on our LCDs that normally balance the backlight. Then they’re seeing an eclipse. So we got to, with one line of PromQL, watch an eclipse march across the country. So that was really fun, but that’s not what we need to measure. So what do we need to measure? We need to measure all the way from machine to experience. And that gets kind of tricky. I’ll kind of explain. So in the middle you’ll have your operating system and your apps, and we all know there’s plenty of help for that. I’ve actually learned a lot about that today and yesterday. And so you generally can find plugins that help you with that profiling, node exporter, all that kind of stuff.

(03:07): When you go down towards the machine, it gets harder because there might not be the plugins. You get some weird APIs. How do you instrument a cellular router that someone made? If you go in the other direction, it just gets weird because we’re talking about content and usability. These things aren’t even well-defined and they’re sometimes not even measurable. So I’m just going to go through a couple examples today on how we approach this problem in our kiosks and how we measure all the way from machine to experience to kind of get a better idea of what’s going on and what we should do about it in the field. So how do we do it? So we started out with something like this. So every computer had a dashboard like this. This is the kind of dashboard you usually have for a computer. It tells you something about the CPU some time series.

(03:59): And usually an engineer looks at this and they say, look, you need to go out and replace something, or this thing’s crashed, get a new one. And Grafana can tell us really actionable things like the system CPU IRQ. time is 1:27, but it won’t tell us whether ads are playing or whether a user’s having a hard time pulling up the transit app. Today, it looks more like this. So we have this nice dashboard and everything on this, every little indicator earned its place. Every time, everything you see here. At some point someone had a problem and said, I wish I could see when the touchscreen is overactive or how many times the thing had to reboot itself. So everything here kind of earned its place and is looked at by not just engineers or SREs like myself, but folks who are in the field opening these things up, replacing parts, or even folks who have the job of taking pictures of the kiosks to show them off.

(04:59): And so a dashboard like this kind of spans that entire spectrum all the way from machine to usability. And it’s the thing where Grafana can say, “Hey, the kiosk is doing its job.” Plain and simple, no interpretation necessary. Some examples about this, for example, we check content on nearly 1400 screens. So this means we know whether it’s stuck, whether it’s blank. So if a bad software update rolls out and everyone sees a loading spinner on the sidewalk, we see it here. And this is, we’re actually just looking at the pixels like this. So we have our display hosts and we take screenshots constantly, and then we run them through image magic and we extract things like the RMS change in pixels over time and the density of edges in the image. And then that goes into node exporter and shows up on these dashboards. And if it doesn’t look good, a little shows up in Slack and we know that, “Hey, this kiosk has a stuck screen, or a lot of kiosks have a stuck screen, you probably deployed something bad.”

(06:06): Another example here. Oh, forgot that queue. Yeah, Grafana can now tell us you have content problems and it can tell us that it’s scale, which is neat, right? You have content problems on 20% of your fleet. For example, we also, not just the displayed content, but also the touch screen. So we have touch screens. And so the idea is the kiosk is sitting there doing nothing. It’s playing some ads and then you walk up and you touch it and the apps come up. And so we need that to work well. And if the touchscreen is broken, it can be overactive and the thing is just stuck on the apps and it just doesn’t look good. It looks kind of broken that way. It should go back to sleep when no one’s using it. So we actually monitor touch activity at the kind of hardware interface level.

(06:51): So we pull the raw stream of events. So if you’re a Linux user, this you can cat out. There’s a certain path that will give you this fire hose of touchscreen data. So we take that fire hose and we compute things like how often is it being touched, how long are the gestures, where are they and how densely populated are they? How clustered are they? And then we can get metrics from that. So this is a normal kiosk. You see occasionally someone walks up and touches it. There’s locations, we’ve plot them XY here, and the gesture durations are less than one second that looks healthy. I think that kiosk is okay. This one, however not so good. So this one’s 32 touches per minute for several hours, and all the touches are on one line. So we pretty much know that one of the lines in the touch layer is broken or something like that.

(07:46): And so that one will alert and it will look like this. On our little dashboard, it’ll say touchscreen is overactive. The content is not rotating because the apps are up all the time. And Grafana can now tell us, “You have content problems because your touchscreen is buggy.” And so now we have a better clue we can go out there with a new touchscreen controller. Finally, there’s some things you just can’t measure remotely, but we do that anyway. So I’ll give examples. So we have this kiosk. Grafana says, it’s all good, everything’s good, everything’s green. Then you show up and it looks like this. So this one literally has a bullet hole in it, and this one’s heat damaged, so that’s not good. We don’t want that. And then the other kiosk, come on, click. Oh, there we go. Grafana says, “Hey, the router is offline.”

(08:37): And we say, “Okay, that happens.” We’ll get out there later with our new router, maybe ethernet cable or something like that. But no worries, it happens. We’ll fix it. And we show up and it looks like that a car hit it. A router is offline. You’re right about that. But we kind of had to show up with something than a router. And so our last defense, when all the cool metrics and all the analyzing of data fails is our field technician team. We have about 30 people throughout the country and they drive around and walk around all day and they walk up to the kiosk, they clean off any dirt. They operate the kiosk like we would expect a user to, and they make reports via a mobile app. Those reports, I mean, this is a really low tech, but we just have a Postgres data source, hooks right into the application database.

(09:28): And it lets us make a nice map. And so right here we can see which kiosks were checked, which ones are flagged with problems. And when a kiosk is flagged with a problem, it goes into a channel that the tech supervisor sees and they can dispatch some help to take care of it. And what Grafana does here is it puts the human observation on the same pane of glass as the technical observation. So you can go all the way from some low level CPU thing up to the actual user experience. And that way Grafana can tell us something like, “Hey, you have content problems. Your touchscreen is buggy also, Matt showed up and it was cracked and he couldn’t fix it.” And that’s pretty much it. So my takeaways are for this kind of thing, really talk to your stakeholders, work backwards from what they need to know, and be really creative with how you measure things. Thank you.