Excited to announce Ray on Anyscale is now available on CoreWeave to further deliver on our multi-cloud support. This gives users access to CoreWeave’s purpose-built AI infrastructure, optimized for distributed AI at scale with low latency, fast startup times, and high performance GPUs. It's a powerful combination for teams building and scaling production AI workloads. Read more from CoreWeave: https://lnkd.in/dN9AjA3s
Ray on Anyscale now available on CoreWeave for AI workloads
More Relevant Posts
-
I learned this the hard way while scaling an AI system When users batch 1000+ AI tasks, your infra either scales…or burns. And trust me, it usually BURNS FIRST. My first thought was: scaling AI = more GPUs, faster inference, better prompts. But that’s not the real bottleneck. The real problem? Task orchestration. When multiple users trigger concurrent generations, you need to handle: – Async execution – Retries + failures – Credit tracking – Worker stability That’s where Celery + Redis save your system: - Redis Queue: Manages async workloads — no blocked threads, no timeouts - Celery Workers: Scale horizontally when demand spikes - Atomic Updates: Concurrency-safe credit + DB ops - Caching: Stops duplicate LLM calls, saves $$, and compute Once this pipeline clicks: - Backend breathes - Costs drop - Users stay happy Scaling AI isn’t about bigger models - it’s about smarter pipelines. You don’t need more compute. You need better architecture.
To view or add a comment, sign in
-
-
𝐒𝐩𝐞𝐧𝐝𝐢𝐧𝐠 $15𝐊+/𝐦𝐨𝐧𝐭𝐡 𝐨𝐧 𝐬𝐩𝐞𝐞𝐜𝐡-𝐭𝐨-𝐭𝐞𝐱𝐭? 𝐓𝐡𝐞𝐫𝐞'𝐬 𝐚𝐧𝐨𝐭𝐡𝐞𝐫 𝐰𝐚𝐲. As a solutions architect, I've watched customers hit cost ceilings with proprietary speech-to-text(STT) services. A contact center processing 10K hours of calls monthly, can easily spend more than $15K+ on transcriptions alone. The open-source STT landscape has matured in OpenAI Whisper, Mistral AI Voxtral, NVIDIA Parakeet-V2, Microsoft Phi-4, #Nvidia Canary-Qwen-2.5B now rivaling proprietary solutions in accuracy/latency. But evaluating them for your usecase is messy: dependency conflicts, inconsistent APIs, complex setup. So I built 𝑽𝒐𝒙𝑺𝒄𝒓𝒊𝒃𝒆 - a lightweight platform to test multiple open-source STT models through a single interface. #Voxscribe supports 11 STT models including canary-qwen which tops the opensource STT leaderboard on huggingface. Models are cached for reuse, can be tried singly or in a Compare mode with other VoxScribe models. 𝗩𝗼𝘅𝘀𝗰𝗿𝗶𝗯𝗲: ✅ Handles dependency conflicts (transformers version hell = solved) ✅ Compare models side-by-side(Sequentially) ✅ FastAPI backend with clean REST endpoints ✅ Runs on AWS G6.xlarge ($0.805/hr vs. $5K/month) and offers a fixed-transcription cost model for large scale transcription use-cases. 𝐅𝐚𝐢𝐫 𝐰𝐚𝐫𝐧𝐢𝐧𝐠: This is an MVP. There are some bugs I'm still actively fixing. But the core works, and I think it solves a real problem for teams evaluating STT solutions. If you break it please tell me how, if you fix it even better - PRs welcome. Blog Link in the comments.. #OpenSource #AWS #SpeechToText
To view or add a comment, sign in
-
🆕 𝗪𝗵𝗮𝘁’𝘀 𝗻𝗲𝘄 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗔𝗜 𝗛𝘆𝗽𝗲𝗿𝗰𝗼𝗺𝗽𝘂𝘁𝗲𝗿? 𝘃𝗟𝗟𝗠 𝗼𝗻 𝗧𝗣𝗨, 𝗮𝗻𝗱 𝗺𝗼𝗿𝗲 🆕 ➡️ The latest quarterly update on Google Cloud’s AI Hypercomputer introduces major enhancements including a new TPU-optimized backend for vLLM, broader model support, improved profiling tools, and tighter GPU/TPU integration. 🚀 What’s new: • vLLM TPU: A new hardware plugin called tpu-inference brings high-performance TPU support to vLLM, unifying PyTorch and JAX workflows on TPUs with minimal code changes. • Improved hardware / software stack: The update adds broader model coverage (Gemma, Llama, Qwen), better performance than earlier TPU backends, and an upgraded profiling library (XProf Profiler, Cloud Diagnostics XProf) for JAX and PyTorch/XLA. • Expanded tooling and architecture flexibility: New reference recipes for disaggregated inference (e.g., NVIDIA Dynamo on Google Cloud), RL scaling workflows with NVIDIA NeMo RL, and more instrumentation for time-to-first-token (TTFT) and time-per-output-token (TPOT) metrics. 💡 Why it matters For developers, ML engineers and platform teams building large-scale AI systems, these updates mean you can run more open-source, multi-framework models on TPUs with less friction, get deeper insight into performance bottlenecks, and choose architectures (GPU, TPU, disaggregated inference) that better match your model and cost profile. 🔗 Read the full blog here: https://lnkd.in/dZ7vVadC Which model-serving bottleneck is your team most focused on right now: latency, cost, hardware-choice, or framework compatibility? 🤔 #GoogleCloud #AIHypercomputer #TPU #vLLM #Inference #MLInfrastructure #LargeLanguageModels
To view or add a comment, sign in
-
Kubernetes is evolving fast for AI/ML workloads, and Google's latest GKE developments are worth noting: • Dynamic Resource Allocation (DRA) - proper GPU/TPU management in core K8s • 65K node clusters supporting 50K TPU chip training jobs • Inference Gateway achieving 30% cost reduction, 60% lower latency • Secondary boot disks: 29x faster container mounting for large ML images What impresses me most: these capabilities are being driven upstream into core Kubernetes, not kept proprietary. DRA and JobSet benefit the entire ecosystem. The infrastructure layer for AI/ML is maturing. Time to rethink how we architect these platforms. https://lnkd.in/gCYv_u3N #Kubernetes #MLOps #CloudArchitecture #AIInfrastructure
To view or add a comment, sign in
-
Big news in AI infrastructure: Tensormesh just secured $4.5M to supercharge server efficiency with advanced KV Caching—promising up to 10x inference improvements. This could be a game-changer for enterprise AI scalability. Would these optimizations impact your AI stack? [via @TechCrunch]
To view or add a comment, sign in
-
Organisations that want to run large language models (LLMs) on their own infrastructure—whether in private data centres or in the cloud often face significant challenges related to GPU availability, capacity, and cost. Learn how to address these challenges with #RedHat #OpenShift AI. #RHAI #AI
To view or add a comment, sign in
-
ICYMI: Platform teams running AI and machine learning workloads will see immediate benefits from GPU sharing and dynamic allocation capabilities. By Janakiram MSV
To view or add a comment, sign in
-
⚙️ 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗷𝘂𝘀𝘁 𝘁𝗼𝗼𝗸 𝗮 𝗵𝘂𝗴𝗲 𝗹𝗲𝗮𝗽 𝗶𝗻 𝘀𝗶𝗺𝗽𝗹𝗶𝗳𝘆𝗶𝗻𝗴 𝗔𝗜 𝘄𝗼𝗿𝗸𝗹𝗼𝗮𝗱𝘀 — 𝗶𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗦𝗲𝗿𝘃𝗲𝗿𝗹𝗲𝘀𝘀 𝗚𝗣𝗨 𝗖𝗼𝗺𝗽𝘂𝘁𝗲! One of the biggest headaches for data and AI teams has always been managing GPU infrastructure — spinning up clusters, tuning configurations, and optimizing costs. Now, Databricks Serverless GPU Compute changes the game completely. 𝗛𝗲𝗿𝗲’𝘀 𝘄𝗵𝗮𝘁 𝗶𝘁 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 𝗱𝗮𝘁𝗮 𝗲𝗻��𝗶𝗻𝗲𝗲𝗿𝘀 𝗮𝗻𝗱 𝗠𝗟 𝘁𝗲𝗮𝗺𝘀 👇 💻 𝗡𝗼 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗛𝗮𝘀𝘀𝗹𝗲𝘀 You don’t need to set up or manage GPU clusters manually. Databricks automatically provisions the right GPU resources based on your workload. ⚡ 𝗜𝗻𝘀𝘁𝗮𝗻𝘁 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 Workloads scale up or down automatically — whether you’re fine-tuning an LLM or running large-scale inference. 💰 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗖𝗼𝘀𝘁𝘀 You pay only for what you use. Serverless GPUs are automatically paused when idle, eliminating wasted spend. 🧠 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗲𝗱 𝘄𝗶𝘁𝗵 𝗠𝗼𝘀𝗮𝗶𝗰 𝗔𝗜 & 𝗠𝗟𝗳𝗹𝗼𝘄 Train, deploy, and monitor models directly within the Databricks ecosystem — with full observability and governance through Unity Catalog. 💡 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: This makes AI development accessible even for teams without deep infrastructure expertise — freeing up time to focus on innovation instead of configuration. 🚀 Serverless + GPU = the best of both worlds for modern data and AI workloads. #Databricks #Serverless #GPUCompute #AI #MLOps #DataEngineering #Lakehouse #MosaicAI
To view or add a comment, sign in
-
-
2 weeks ago, we participated at The AI Conference in SF, what a vibe!! Some takeaways: 1. AI engineers are becoming AI managers. Software engineers are becoming managers of AI agents, not just code, and are asking for stronger agent orchestration, observability and debugging tools, not just GPUs. 2. Infra is big budget line. Infra costs are ballooning with more data being generated than originally budgeted for and more GPU demand that can be . Unpredictable fees (especially egress) are getting flagged by finance teams and more consistent cost profiles. 3. Vendor lock-in is becoming unacceptable. More teams want to keep optionality open across compute and orchestration platforms. Platforms that offer free data movement are key to provide the flexibility necessary in this growth cycle of tools. 4. Security is top of mind. As adoption grows, so do attack vectors. Advanced teams are asking how to guarantee integrity of LLM weights over time. How do you prove an LLM was not tampered with through data poisoning? Immutable, verifiable logs are key and onchain attestations of LLM snapshots and weights can provide strong integrity validation. Check out the detailed version: https://lnkd.in/eX5m-2m9
To view or add a comment, sign in
-
Planning your 2026 tech roadmap? WebGPU should be on your radar.📍 Browser-based GPU compute is no longer experimental. It's shipping in production browsers today. And it fundamentally changes how AI inference gets deployed. If you're planning AI features or evaluating compute infrastructure for 2026, now's the time to understand where WebGPU fits in your stack. Our latest post explores the technical architecture, real-world implementations, and when to make the transition: https://lnkd.in/gzZiSsDX #WebGPU #AI #WebGL #EnterpriseArchitecture #ClientSideAI
To view or add a comment, sign in
Big win Ravindra Gupta