vLLM

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

Discover all 13 employees

About us

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

Website: https://github.com/vllm-project/vllm
External link for vLLM
Industry: Software Development
Company size: 51-200 employees
Type: Nonprofit

Employees at vLLM

See all employees

Updates

vLLM

3,915 followers
4w
Report this post
🇸🇬 vLLM Singapore Meetup — Highlights Thanks to everyone who joined! Check out the slides by vLLM’s DarkLight1337 with tjtanaa / Embedded LLM * V1 is here: faster startup, stronger CI & perf checks. * Scaling MoE: clear Expert Parallelism (EP) setup for single/multi-node + elastic EP to match traffic. * Disaggregated serving: split prefill vs. decode to tune TTFT (time-to-first-token) vs. throughput. * MLLM speedups: reuse embeddings with a processor cache, optional GPU-side processors, and encoder DP-across-TP (replicate small encoders per TP rank; shard the decoder) to cut comms overhead. Also: WEKA — vLLM + LMCache Lab + SSD for high-perf KV cache. @ASTARsg MERaLiON — deploying AudioLLM with vLLM + Ray for autoscaling & load balancing. Slides Folder: https://lnkd.in/gwVdv6-k
1 Comment

Like Comment Share
vLLM reposted this
Stephen Watt

Vice President, Distinguished Engineer, Office of the CTO
4w Edited
Report this post
Hi folks - If you're in the Austin Area on Wednesday September 17th, we (PyTorch ATX) are hosting a joint meetup with the vLLM community at the Capitol Factory and we'd love to have you join us. The sessions are listed below. You'll get a solid grounding in vLLM and also learn about two really cool new ground breaking projects in the semantic router and llm-d. We have 200 people already signed up, but still have a few spots open, please help us share the event. It's going to be awesome! - https://lnkd.in/gPwt-ZQn - Getting started with inference using vLLM - Steve Watt, PyTorch ambassador - An intermediate guide to inference using vLLM - PagedAttention, Quantization, Speculative Decoding, Continuous Batching and more - Luka Govedič, vLLM core committer - vLLM Semantic Router - Intelligent Auto Reasoning Router for Efficient LLM Inference on Mixture-of-Models - Huamin Chen, vLLM Semantic Router project creator - Combining Kubernetes and vLLM to deliver scalable, distributed inference with llm-d - Greg Pereira, llm-d maintainer
6 Comments

Like Comment Share
vLLM

3,915 followers
1mo
Report this post
🚀Join us for the Boston vLLM Meetup on September 18! Our first Boston meetup back in March was fully packed, so register early! Hosted by Red Hat and Venture Guides, this event brings together vLLM users, developers, maintainers, and engineers to explore the latest in vLLM and optimized inference. Expect deep technical talks, live demos, and plenty of time to connect with the community. 📍Location: Venture Guides office by TD Garden/North Station 🕔Time: 5:00 PM – 8:30 PM Agenda highlights: * Intro to vLLM & project update * Model optimization with LLM Compressor and Speculators * Demo: vLLM + LLM Compressor in action * Distributed inference with llm-d * Q&A, discussion, and networking (with pizza 🍕 & refreshments) 👉 Register here: https://luma.com/vjfelimw Come meet the vLLM team, learn from experts, and connect with others building the future of inference.

Boston vLLM Meetup · Luma luma.com

Like Comment Share
vLLM

3,915 followers
1mo
Report this post
LinkedIn not only uses vLLM at massive scale, but also actively contributes to the community, checkout their wonderful blog https://lnkd.in/gFV6zA5J

Qing Lan

MLSys@LinkedIn | Ex AWS AI
1mo

This blog post was completed back in May, and looking at it now, it still feels like a diary of the journey we’ve been on together in AI Infra Model Serving. As I shared in my earlier post, the LLM Serving team was founded by a group of incredibly talented and passionate engineers. I first met some of them during a vLLM meetup with AWS, and it’s been amazing to see how far we’ve come since then. In just 1.5 years, the team has grown at a remarkable pace. We started by learning how to use vLLM, then mastered it, and eventually customized it to meet LinkedIn’s unique needs. Along the way, our work has been adopted broadly across the LinkedIn ecosystem. Early examples include Hiring Agent and Job Search, and today many LinkedIn products and services are powered by vLLM. At the end of that blog, we expressed gratitude to our partners and friends who have supported us—because none of these achievements would have been possible without you. Red Hat: Michael Goin, Robert Shaw, Nick Hill NVIDIA: Rachel O., Ed Nieda, Harry Kim UCB SkyComputing: Simon Mo, Woosuk Kwon, Zhuohan Li, Lily (Xiaoxuan) Liu LMCache: Yihua Cheng, Kuntai Du, Junchen Jiang https://lnkd.in/dJAAAXFH

How we leveraged vLLM to power our GenAI applications at LinkedIn linkedin.com

Like Comment Share
vLLM reposted this
Daniel van Strien

Machine Learning Librarian at Hugging Face 🤗 | Making AI work for libraries, archives, and their communities
2mo
Report this post
I just ran batch inference on a 30B parameter LLM across 4 GPUs with a single Python command! The secret? Modern AI infrastructure where everyone handles their specialty: 📦 UV (by Astral) handles dependencies via uv scripts 🖥️ Hugging Face Jobs handles GPU orchestration 🧠 Qwen AI team handles the model (Qwen3-30B-A3B-Instruct-2507) ⚡ vLLM handles efficient batched inference I'm very excited about using uv scripts as a nice way of packaging fairly simple but useful ML tasks in a somewhat reproducible way. This combined with Jobs opens up some nice oppertunities for making pipelines that require different types of compute. Technical deep dive and code examples: https://lnkd.in/e5BEBU95

Efficient batch inference for LLMs with vLLM + UV Scripts on HF Jobs danielvanstrien.xyz

8 Comments

Like Comment Share
vLLM reposted this
Anyscale

53,154 followers
2mo Edited
Report this post
🚨 Attention vLLM users – last call! 🚨 The Call for Proposals for our vLLM Featured Track at Ray Summit closes this Wednesday, July 30. If you're building with vLLM in production, optimizing inference, or exploring advanced use cases — we want to see it. This track is all about showcasing real-world implementations and hard-won lessons from the vLLM community. Need inspiration? Check out last year's top vLLM talks: https://lnkd.in/gmRhSbHk Submit your proposal here: https://lnkd.in/gjvKdvFF
Like Comment Share
vLLM reposted this
Raushan Turganbay

ML engineer at 🤗 | Multimodality and Generation | Erasmus Mundus MSc
2mo
Report this post
🚀 Big big news for multimodal devs! The transformers ↔️ vLLM integration just leveled up: Vision-Language Models are now supported out of the box If the model is integrated into Transformers, you can now run it directly with vLLM — no need to rewrite or duplicate code. Just plug it in and go. Zero extra effort Performance might differ model to model (we’re working on that!), but functional support is guaranteed Curious how to serve Transformers models with vLLM? Full docs here 👉 https://lnkd.in/d-KjqbmU #multimodal #transformers #vLLM #VLM #opensource
19 Comments

Like Comment Share
vLLM reposted this
NVIDIA AI

1,397,120 followers
2mo
Report this post
🎉Congratulations to Microsoft for the new Phi-4-mini-flash-reasoning model trained on NVIDIA H100 and A100 GPUs. This latest edition to the Phi family provides developers with a new model optimized for high-throughput and low-latency reasoning in resource-constrained environments. Bring your data and try out demos on the multimodal playground for Phi on the NVIDIA API Catalog ➡️ https://lnkd.in/geuGhZsS 📷 The first plot shows average inference latency as a function of generation length, while the second plot illustrates how inference latency varies with throughput. Both experiments were conducted using the vLLM inference framework on a single A100-80GB GPU over varying concurrency levels of user requests. 🤗 https://lnkd.in/gswYMYt9
13 Comments

Like Comment Share

vLLM

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

About us

Employees at vLLM

Michael Goin

Inference Optimization @ Red Hat | vLLM Maintainer

Robert Shaw

LLM Inference with vLLM and llm-d

Wenlong Wang

Ph.D. @UMN | @ex-Google | @vLLM

Luka Govedič

vLLM committer @ Red Hat | Performance engineering, HPC, parallel computing, CPU & CUDA

Updates

Join now to see what you are missing

Similar pages

sgl-project

Hugging Face

Ollama

Embedded LLM

LMCache Lab

Unsloth AI

llm-d

Anyscale

Canopy Labs

Red Hat