LLM-D, Supercharged HPA and GKE AI Labs
The News
GKE
- Introducing llm-d: llm-d is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale. It’s a collaboration between Google, Nvidia, IBM and Redhat aiming at simplifying LLM serving on Kubernetes. The project has multiple efforts you can read about in the blog or check the github repo.
- Performance HPA profile GA and default in Autopilot: We introduced the new HPA Performance profile back in Nov 2024. Since we made many improvments. This new stack delivers 3x faster autoscaling and improved reliability at scale, supporting 1000 objects within SLO (an increase from 300). It also opens up expansion of HPA capabilities to support native custom metrics Integration, parallel processing, and tolerance handling, as well as future multidimensional capabilities.
- GKE AI Labs: Is a new one stop shop for everything AI on GKE. We moved all tutorials and guides to this website. You can find code and steps to deploy LLMs but also OSS solutions to GKE.
- Confidential nodes for GPU Workloads: GKE now supports using Confidential Nodes for GPU workloads. Depending on the version (check the release notes) various VM families are supported.
- Container Optimized Compute is default: From GKE 1.32.3+ CoC (Container Optimize Compute) is the default Autoscaler stack. CoC is our revamped cluster autoscaler stack with improved Pod scheduling latency.
- GKE Threat detection in SCC: Container Threat Detection works by triggering findings based on signals extracted from running containers on GKE. There are multiple types of signals like cli execution, malicious code execution…(more details).Now you can find these findings in Security Command Center.
- vLLM TPU Support is GA: vLLM now supports TPU chips. This guide shows how to deploy Llama 3.1 70b on TPU v6 (Trillium) on GKE Autopilot.
- [Live] Five Key Google Kubernetes Engine Features You Must Know: Tune-In on June 5th to hear about the five key GKE features you should know from Gari Singh.
The recordings for Google Cloud Next 2025 are available on-demand https://cloud.withgoogle.com/next/25/session-library?filters=vod-recorded-session#all
Recommended by LinkedIn
AI/ML
- Deploy to Cloud Run from AI Studio: You can start working on an app on AI Studio and deploy directly to Cloud Run. Even if you app needs a local LLM Cloud Run supports running LLMs with a GPU attached to them. Check it out it’s cool.
- Deploy to Cloud Run from VertexAI: Same as AI Studio, Vertex AI also supports deploying GenAI apps to Cloud Run straight from the console.
- Gemini Cloud Assist launched new cool stuff: Asking Cloud Monitoring about incidents, Artifacts Analysis about detected vulnerabilities and test org policies among other things. Cloud Assist is launching a lot of cool stuff. Check out the release notes.
- Transforming Kubernetes and GKE into the leading platform for AI/ML: This is not news per-se but rather a summary of all the work we are doing in Kubernetes and GKE to make the prime platform for running your AI/ML Workloads.
The Community
- Zero-Downtime Pod Migration in Kubernetes: Learn how to acheive near-zero downtime migrations in Kubernetes using readinessProbe and preStop lifecycle hooks.
- GKE Cost Analysis with BigQuery and KubeCost: Learn how to combine BigQuery and KubeCost to analyse and manage GKE costs.
- Cloud Service Mesh global control, zero pain upgrades: We are making an effort to make Service Mesh easy on Google Cloud. This article highlights how CSM makes that possible.
Solutions @ Union
4moGreat to see Flyte in the GKE AI Labs!