AI Reliability Engineering: Welcome to the Third Age of SRE

SREs must build AI we can trust, leveraging the emerging ecosystem of tools and standards.

Jun 4th, 2025 1:00pm by Denys Vasyliev

Featued image for: AI Reliability Engineering: Welcome to the Third Age of SRE

Photo by Nasik Lababan on Unsplash.

When Clayton Coleman‘s quote was dropped at KubeCon NA, it resonated. Just five years back, ask any Site Reliability Engineer (SRE) about their job, and you’d hear about keeping web apps fast, scalable, and resilient. Today? The landscape is shifting beneath our feet. AI inference workloads — the process where a trained model uses its knowledge to make predictions on new data — are becoming as mission-critical as web applications ever were.

“Inference — refers to the process by which a trained model applies its learned patterns to new, unseen data to generate predictions or decisions. During inference, the model utilizes its knowledge to respond to real-world inputs.”

This evolution demands a new discipline: AI Reliability Engineering (AIRe). We’re no longer just battling latency spikes in HTTP requests; we’re grappling with token generation delays in LLMs. Optimizing database queries feels almost quaint compared to optimizing model checkpoints and tensors. AI models, like the web apps before them, demand intense scalability, reliability, and observability — but on a level we’re still architecting.

The new stack of AI

I’ve spent almost two years deep in AI Reliability Engineering — researching, prototyping, and building real-world inference systems. From the DevOps conferences to SRE Days and community meetups in Nuremberg and London, I’ve shared hard-earned lessons with peers in the field. Now, I’m bringing those insights here.

Unreliable AI is worse than no AI at all.

Inference isn’t just model execution — it’s an operational discipline with its own set of architectural trade-offs and engineering patterns. Unlike training, where time and cost can be amortized, inference is on the hot path. Every millisecond matters.
Real-time vs. Batch: Inference operates in two distinct modes. Real-time (or online) inference powers experiences like chatbots, fraud detection, and self-driving cars, where low latency is non-negotiable. Batch (offline) inference, on the other hand, chews through large datasets at scheduled intervals to classify images, mine logs, or forecast trends.
Resource Profiles: Though typically lighter than training, inference still demands precision engineering. Real-time applications require not only fast computation but also highly available infrastructure. CPUs still have a role, but modern inference stacks increasingly rely on GPUs, TPUs, or custom silicon, such as AWS Inferentia and NVIDIA TensorRT, for low-latency performance.
Deployment Footprints: Inference runs anywhere, from edge devices to hyperscale clouds. You’ll find it in serverless endpoints, Kubernetes clusters, and even tiny IoT modules. Cloud platforms like SageMaker, Vertex AI, Hugging Face, and Together.ai have streamlined deployment, but the decision often comes down to cost, control, and latency.
Optimization Playbook: The battle for speed and efficiency is ongoing. Teams use quantization (FP32 → INT8), model distillation, and Neural Architecture Search (NAS) to tune performance without compromising output. The goal? Smaller, faster, leaner inference engines.
Observability and Monitoring: Traditional telemetry stacks fall short. Inference workloads need more — tracking prediction latency, token throughput, drift, and even hallucination rates. Tools like OpenTelemetry, Prometheus, and AI-native traces are no longer optional.
Scalability: Predictable isn’t part of the vocabulary. Inference traffic can spike with usage patterns, requiring aggressive autoscaling (Kubernetes HPA, Cloud Run) and intelligent load balancing (Envoy, Istio, or KServe) to stay ahead of demand.
Security Frontlines: AI inference brings new attack surfaces — from adversarial inputs to data leakage risks. Engineers must defend model endpoints like APIs: with authentication, rate-limiting, encryption, and runtime integrity checks.

Inference is no longer just a sub-process of machine learning. It’s the application. It’s production. And it’s redefining the operational stack beneath it.

Traditional SRE principles offer a foundation, but they don’t quite fit the AI.

Probabilistic Nature: AI models aren’t deterministic like most web apps. The same input might yield different outputs. A model can boast 100% uptime yet spew incorrect, biased, or nonsensical results. This fundamentally changes how we define “reliable.”
Shifting Metrics: Uptime SLAs? Necessary, but not sufficient. Welcome to the world of accuracy SLAs. We need to define and measure performance based on precision, recall, fairness, and model drift.

Emerging AI Challenges – SRE Day – AIRe 2025

Infrastructure Evolution: Concepts like ingress and horizontal pod autoscaling evolve. We now need tools and techniques like model mesh, LoRa balancing, AI Gateways, and dynamic resource allocation, especially for GPU-heavy workloads. Kubernetes itself is adapting through efforts like Working Groups of Serving, DRA (Dynamic resource allocation), and the Gateway API to better handle these specialized needs.
Observability Gaps: Standard tools track CPU, memory, and latency well, but often miss AI-specific issues like drift, confidence scores, or hallucination rates. We need AI-specific observability.
New Failure Modes: Forget simple crashes. We now face “silent model degradation” or “model decay.” This is a gradual, often invisible decline in performance, accuracy, or fairness. Treating this like the critical production incident it is requires a new mindset and tooling

Model Decay — Silent model degradation — unlike traditional software issues that trigger immediate crashes or errors, AI models can degrade silently, continuing to function but with increasingly inaccurate, biased, or inconsistent outputs.

Why We Treat Silent Model Degradation Like a Production Incident

Because it is one. Unlike crashing pods or failing endpoints, silent model degradation slips under the radar — the model keeps responding, but it’s answers grow weaker, biased, or just wrong. Users don’t see 500 errors; they get hallucinations, toxic outputs, or faulty decisions. That’s not just a bug — it’s a breach of trust. In the world of AI, correctness is uptime. When reliability means quality, degradation is downtime.

Gateway API Inference Extension, OpenInference and AI Gateways

Perhaps we won’t just extend Kubernetes for AI — we might eventually need to fork it.

Large Language Models (LLMs) require specialized traffic routing, rate limiting, and security enforcement capabilities that standard Kubernetes Ingress mechanisms weren’t built to handle. Kubernetes, architected around stateless web apps, wasn’t designed with inference in mind. While it’s adapting, key gaps remain.

Inference workloads demand tightly integrated solutions for hardware acceleration, resource orchestration, and high-throughput traffic control. The Kubernetes ecosystem is catching up with initiatives like WG-Serving (targeting optimized AI/ML serving), Device Management (focused on integrating GPUs/TPUs via DRA), and the evolving Gateway API Inference Extension, which lays the groundwork for scalable and secure LLM endpoint routing. Meanwhile, emerging AI Gateways step in to fill the void — providing routing logic, observability, and access control tailored to inference.

Still, we’re layering AI on top of an orchestration system that wasn’t originally meant for it. Google’s announcement of supporting 65K-node Kubernetes clusters by swapping etcd with Spanner-backed storage hints at a future where foundational changes might be required. Perhaps we won’t just extend Kubernetes for AI — we might eventually need to fork it.

So, how do we apply SRE practices to this new AI reality?

Define AI-Centric SLOs/SLAs: Move beyond uptime to include accuracy, fairness, latency, and drift targets. Establish clear commitments (SLAs) for metrics like TTFT (Time To First Token) and TPOT (Time Per Output Token), accuracy or bounds on bias.
Build AI Observability: Implement robust monitoring using tools like OpenTelemetry and Grafana, but augment them with AI-specific tracing and evaluation platforms (OpenInference) to track metrics like model response distribution, confidence scores, and error types (e.g., hallucinations).
Develop AI Incident Response: Create playbooks specifically for AI failures like sudden drift or bias spikes. Implement automated rollbacks to stable model versions or AI circuit breakers.
Engineer for Scale and Security: Leverage techniques like load balancing across model replicas, caching, optimized GPU scheduling (an area still evolving in Kubernetes), and AI Gateways for managing traffic, security (like token-based rate limiting, semantic caching), and authorization. Protect model integrity through provenance tracking, secure distribution, and runtime monitoring.
Continuous Evaluation: Model evaluation isn’t a one-off task. It spans pre-deployment (offline tests), pre-release (shadow/A-B tests), and continuous post-deployment monitoring for drift and degradation.

Example Model Evaluation SLA in Production

AI Gateways: The SRE Tool for the AI Era

In the early days of SRE, we relied on load balancers, service meshes, and API gateways to manage traffic, enforce policies, and maintain observability. Today, inference workloads demand the same — but with more complexity, more scale, and far less tolerance for latency or failure. That’s where AI Gateways come in.

Think of them as the modern SRE’s all-in-one box for AI: routing requests to the right model, balancing load across replicas, enforcing rate limits and security

policies, and exposing deep observability hooks — all at once. Projects like Gloo AI Gateway are pushing this forward. They’re tackling enterprise-grade challenges, such as model cost control, token-based security, and real-time tracing of LLM responses — challenges that traditional service meshes weren’t built for.

This is where SRE belongs today: not just tuning autoscalers, but operating the control plane for intelligent systems.

The AI Gateway is the new tool on our belt — and maybe the most important one.

The Third Age of SRE Is AI Reliability Engineering

Our role as SREs is evolving. We need the curiosity described in “97 Things Every SRE Should Know” more than ever — the drive to understand the entire system, from silicon to the nuances of model behavior. We must build AI we can trust, leveraging the emerging ecosystem of tools and standards.

Björn Rabenstein spoke of a “third age” of SRE, where its principles become universally embedded. While this is true, the new era is being shaped by AI. AI Reliability Engineering isn’t just an extension of SRE; it’s a fundamental reshaping, shifting focus from infrastructure reliability to the reliability of intelligent systems themselves.

Because if Inference truly is the new web app, then ensuring its Reliability is the new Age of SRE. And an unreliable AI? That’s worse than no AI at all.

Technology professional with over 17 years of experience in Telecom, Software Development, DevOps, SRE, and Kubernetes. Led technical teams, co-founded startups, and contributed to cloud-native projects. Coach and speaker focused on practical mentoring in DevOps/SRE, cloud technologies and AI.