Paper list for broad topics in machine learning systems
NOTE: Survey papers are annotated with [Survey 🔍] prefix.
- Paper List for Machine Learning Systems
- Table of Contents
- Data Processing
- Training System
- Inference System
- Attention Optimization
- Mixture of Experts (MoE)
- Communication Optimization & Network Infrastructure for Distributed ML
- Fault tolerance & Straggler mitigation
- GPU Memory Management & Optimization
- GPU Sharing
- Compiler
- GPU Kernel Optimization
- LLM Long Context
- Model Compression
- Federated Learning
- Privacy-Preserving ML
- ML APIs & Application-Side Optimization
- ML for Systems
- Energy Efficiency
- Retrieval-Augmented Generation (RAG)
- Simulation
- Systems for Agentic AI
- RL Post-Training
- Multimodal
- Hybrid LLMs
- Others
- References
General
- [arxiv'25] Scalable and Performant Data Loading
- [arxiv'25] OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
- [arxiv'25] The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution
- [arxiv'25] In-Network Preprocessing of Recommender Systems on Multi-Tenant SmartNICs
- [VLDB'25] cedar: Composable and Optimized Machine Learning Input Data Pipelines
- [HotInfra'24] Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines
- [arxiv'24] TensorSocket: Shared Data Loading for Deep Learning Training
- [arxiv'24] Efficient Tabular Data Preprocessing of ML Pipelines
- [MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
- [ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
- [SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
- [VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
- [VLDB'21] tf.data: A Machine Learning Data Processing Framework
Preprocessing stalls
- [arxiv'24] PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
- [ATC'24] Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
- [HotStorage'24] A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL Training
- [VLDB'24] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
- [arxiv'23] Rinas: Training with Dataset Shuffling Can Be General and Fast
- [CVPR'23] FFCV: Accelerating Training by Removing Data Bottlenecks
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [SIGMOD'23] GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning
- [VLDB'23] FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
- [SoCC'23] tf.data service: A Case for Disaggregating ML Input Data Processing
- [ATC'22] Cachew: Machine Learning Input Data Processing as a Service
- [OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters
- [ICPP'19] DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines
Fetch stalls (I/O)
- [TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
- [ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
- [SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O
Specific workloads (GNN, DLRM)
- [VLDB'25] Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression
- [ISCA'24] PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models
- [arxiv'23] Towards Data-centric Graph Machine Learning: Review and Outlook
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [MLSys'23] RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
- [ASPLOS'22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [arxiv'23] MTrainS: Improving DLRM training efficiency using heterogeneous memories
- [SOSP'23] Bagpipe: Accelerating Deep Recommendation Model Training
- [SOSP'23] gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning
- [NSDI'23] BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
- [DAC'22] A Joint Management Middleware to Improve Training Performance of Deep Recommendation Systems with SSDs
- [VLDB'22] Accelerating Recommendation System Training by Leveraging Popular Choices
- [ATC'25] HyCache: Hybrid Caching for Accelerating DNN Input Preprocessing Pipelines
- [ICDE'25] MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage
- [TPDS'23] High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
- [SOSP'23] UGACHE: A Unified GPU Cache for Embedding-based Deep Learning
- [ATC'23] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
- [EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 2.1]
- [FAST'23] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
- [HPCA'23] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
- [NeurIPS'22] A Deep Learning Dataloader with Shared Data Preparation
- [CLUSTER'22] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
- [ICDE'22] Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
- [ATC'21] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training
- [FAST'20] Quiver: An Informed Storage Cache for Deep Learning
- [ICPP'20] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
- [arXiv'19] Faster Neural Network Training with Data Echoing
- [HotCloud'19] The Case for Unifying Data Loading in Machine Learning Clusters
- [SIGMOD'26] Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
- [EMNLP'25] Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
- [ICDE'25] Training Data Distribution Estimation for Optimized Pre-Training Data Management
- [arxiv'25] Mixtera: A Data Plane for Foundation Model Training
Data formats
- [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
- [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data
Data pipeline fairness and correctness
- [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines
Data labeling automation
- [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision
- [ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
- [NSDI'24] Characterization of Large Language Model Development in the Datacenter
- [NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (
PAI) - [ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (
Philly)
-
[arxiv'25] Semantic-Aware Scheduling for GPU Clusters with Large Language Models
-
[arxiv'25] Tesserae: Scalable Placement Policies for Deep Learning Workloads
-
[arxiv'25] LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems
-
[EuroSys'25] Eva: Cost-Efficient Cloud-Based Cluster Scheduling
-
[arxiv'25] TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
-
[arxiv'24] Zeal: Rethinking Large-Scale Resource Allocation with "Decouple and Decompose"
-
[TACO'24] Taming Flexible Job Packing in Deep Learning Training Clusters
-
[SoCC'24] Kale: Elastic GPU Scheduling for Online DL Model Training
-
[arxiv'24] Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
-
[SC'24] PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
-
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
-
[ASPLOS'24] Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
-
[Middleware'24] Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters
-
[IPDPS'24] Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster
-
[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
-
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
-
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
-
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
-
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
-
[Survey 🔍] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
-
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
-
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
-
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
-
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
-
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
-
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
-
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
-
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
-
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
-
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
-
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
-
[Survey 🔍] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
-
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
-
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (
PAI) -
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (
Synergy) -
[SIGCOMM'22] Multi-resource interleaving for deep learning training (
Muri) -
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
-
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
-
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (
Helios) -
[OSDI'21] Privacy Budget Scheduling (
DPF) -
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (
AFS) -
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
-
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (
GandivaFair) -
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
-
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
-
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (
Gavel) -
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
-
[MLSys'20] Resource Elasticity in Distributed Deep Learning
-
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
-
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (
Philly) -
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
-
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning
-
[ASPLOS'26] SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
-
[NeurIPS'25] Synergistic Tensor and Pipeline Parallelism
-
[arxiv'25] AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training
-
[NeurIPS'25] First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training
-
[arxiv'25] A Flexible Programmable Pipeline Parallelism Framework for Efficient DNN Training
-
[arxiv'25] SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training
-
[arxiv'25] AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models
-
[arxiv'25] HAPT: Heterogeneity-Aware Automated Parallel Training on Heterogeneous Clusters
-
[arxiv'25] Scaling Up Data Parallelism in Decentralized Deep Learning
-
[arxiv'25] Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters
-
[arxiv'25] TrainVerify: Equivalence-Based Verification for Distributed LLM Training
-
[arxiv'25] Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage
-
[arxiv'25] ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
-
[arxiv'25] Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization
-
[arxiv'25] H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips
-
[arxiv'25] Balanced and Elastic End-to-end Training of Dynamic LLMs
-
[arxiv'25] ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
-
[arxiv'25] Parallel Scaling Law for Language Models
-
[arxiv'25] Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters
-
[arxiv'25] You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models
-
[arxiv'25] WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
-
[arxiv'25] Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
-
[arxiv'25] Cornstarch: Distributed Multimodal Training Must Be Multimodality-Aware
-
[arxiv'25] PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
-
[arxiv'25] AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs
-
[arxiv'25] Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs
-
[arxiv'25] Scaling Inference-Efficient Language Models
-
[arxiv'25] MiniMax-01: Scaling Foundation Models with Lightning Attention
-
[SC'25] Hypertron: Efficiently Scaling Large Models by Exploring High-Dimensional Parallelization Space
-
[CLUSTER'25] BMPipe: Bubble-Memory Co-Optimization Strategy Planner for Very-Large DNN Training
-
[OSDI'25] WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
-
[ISCA'25] FRED: A Wafer-scale Fabric for 3D Parallel DNN Training
-
[ISCA'25] MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training
-
[ISCA'25] Scaling Llama 3 Training with Efficient Parallelism Strategies
-
[MLSys'25] Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training
-
[ICLR'25] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
-
[INFOCOM'25] Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud
-
[ASPLOS'25] GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
-
[ASPLOS'25] FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
-
[ASPLOS'25] Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
-
[EuroSys'25] JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs
-
[arxiv'24] Automatically Planning Optimal Parallel Strategy for Large Language Models
-
[arxiv'24] Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
-
[arxiv'24] Scaling Deep Learning Training with MPMD Pipeline Parallelism
-
[arxiv'24] Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences
-
[arxiv'24] HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models
-
[arxiv'24] Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training
-
[arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
-
[arxiv'24] BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
-
[arxiv'24] Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
-
[arxiv'24] SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
-
[arxiv'24] FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
-
[arxiv'24] PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
-
[arxiv'24] Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters
-
[arxiv'24] Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
-
[arxiv'24] FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment
-
[arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
-
[arxiv'24] Unicron: Economizing Self-Healing LLM Training at Scale
-
[arxiv'24] TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading
-
[arxiv'24] Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
-
[Survey 🔍] [arxiv'24] Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
-
[arxiv'24] LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
-
[arxiv'24] PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
-
[arxiv'24] BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
-
[arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
-
[arxiv'24] Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
-
[arxiv'24] GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models
-
[arxiv'24] BitDelta: Your Fine-Tune May Only Be Worth One Bit
-
[arxiv'24] NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
-
[arxiv'24] Accelerating Parallel Sampling of Diffusion Models
-
[arxiv'24] Training DNN Models over Heterogeneous Clusters with Optimal Performance
-
[arxiv'24] Breaking MLPerf Training: A Case Study on Optimizing BERT
-
[arxiv'24] LocMoE: A Low-overhead MoE for Large Language Model Training
-
[arxiv'24] Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
-
[TPDS'24] UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training
-
[Survey 🔍] [ACM CSUR'24] Resource-efficient Algorithms and Systems of Foundation Models: A Survey
-
[SOSP'24] Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
-
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
-
[TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
-
[NeurIPS'24] Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models
-
[NeurIPS'24] SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation
-
[SC'24] Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching
-
[SC'24] Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
-
[SoCC'24] Distributed training of large language models on AWS Trainium
-
[TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
-
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
-
[SOSP'24] TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections
-
[ICPP'24] AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU Cluster
-
[COLM'24] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
-
[OSDI'24] nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
-
[ATC'24] Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
-
[ATC'24] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
-
[ATC'24] OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model
-
[HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
-
[ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
-
[ICML'24] Integrated Hardware Architecture and Device Placement Search
-
[MLSys'24] DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines
-
[MobiCom'24] Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices
-
[EuroSys'24] DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
-
[EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
-
[EuroMLSys@EuroSys'24] ML Training with Cloud GPU Shortages: Is Cross-Region the Answer?
-
[ASPLOS'24] AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
-
[ASPLOS'24] PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
-
[EuroSys'24] Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
-
[NSDI'24] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
-
[NSDI'24] DISTMM: Accelerating Distributed Multi-modal Model Training
-
[NSDI'24] Accelerating Neural Recommendation Training with Embedding Scheduling
-
[NSDI'24] Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
-
[NSDI'24] QuickUpdate: a Real-Time Personalization System for Large-Scale Recommendation Models
-
[NSDI'24] Scaling Large Language Model Training to More Than 10,000 GPUs
-
[TKDE'24] Improving Automatic Parallel Training via Balanced Memory Workload Optimization
- extended version of Galvatron (VLDB'23)
- arxiv version (2023): link
-
[ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
-
[AAMAS'24] Holonic Learning: A Flexible Agent-based Distributed Machine Learning Framework
-
[VLDB'24] Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
-
[HPCA'24] Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
-
[NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
-
[EuroSys'24] HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
-
[arxiv'23] ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
-
[arxiv'23] FlexModel: A Framework for Interpretability of Distributed Large Language Models
-
[arxiv'23] Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
-
[arxiv'23] RTP: Rethinking Tensor Parallelism with Memory Deduplication
-
[arxiv'23] FP8-LM: Training FP8 Large Language Models
-
[arxiv'23] Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
-
[arxiv'23] FLM-101B: An Open LLM and How to Train It with $100K Budget
-
[arxiv'23] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
-
[arxiv'23] Modeling Parallel Programs using Large Language Models
-
[arxiv'23] Proteus: Simulating the Performance of Distributed DNN Training
-
[arxiv'23] Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training
-
[arxiv'23] Decoupled Model Schedule for Deep Learning Training
-
[arxiv'23] RAF: Holistic Compilation for Deep Learning Model Training
-
[arxiv'23] Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches
-
[arxiv'23] Does compressing activations help model parallel training?
-
[arxiv'23] Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
-
[arxiv'23] Scaling Vision Transformers to 22 Billion Parameters
-
[arxiv'23] Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
-
[arxiv'23] TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation
-
[arxiv'23] SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
-
[arxiv'23] ATP: Adaptive Tensor Parallelism for Foundation Models
-
[ICPP'23] Mercury: Fast and Optimal Device Placement for Large Deep Learning Models
-
[IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
-
[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
-
[NeurIPS'23] ASPEN: Breaking Operator Barriers for Efficient Parallelization of Deep Neural Networks
-
[NeurIPS'23] DeepPCR: Parallelizing Sequential Operations in Neural Networks
-
[DAC'23] MixPipe: Efficient Bidirectional Pipeline Parallelism for Training Large-Scale Models
-
[SC'23] Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
-
[SOSP'23] PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
-
[SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
-
[MICRO'23] Grape: Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUs
-
[HPCA'23] Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism
-
[ACL'23] Sequence Parallelism: Long Sequence Training from System Perspective
-
[CCGrid'23] A Deep Learning Pipeline Parallel Optimization Method
-
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
-
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
-
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
-
[ATC'23] MSRL: Distributed Reinforcement Learning with Dataflow Fragments
-
[Survey 🔍] [TPDS'23] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
-
[ICML'23] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
-
[ICML'23] BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
-
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
-
[NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
-
[NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
-
[NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
-
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
-
[MLSys'23] On Optimizing the Communication of Model Parallelism
-
[MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
-
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
-
[TPDS'23] Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
-
[PPoPP'23] Elastic Averaging for Efficient Pipelined DNN Training
-
[PPoPP'23] Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
-
[VLDB'23] MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
-
[VLDB'23] Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
-
[ASPLOS'23] Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
-
[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression
-
[arxiv'22] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
-
[arxiv'22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
-
[ICPP'22] Tesseract: Parallelize the Tensor Parallelism Efficiently
-
[NeurIPS'22] Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees
-
[SoCC'22] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
-
[MLSys'22] Pathways: Asynchronous distributed dataflow for ML
-
[MLSys'22] SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud
-
[MLSys'22] Efficient Strong Scaling Through Burst Parallel Training
-
[EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
-
[ATC'22] Whale: Efficient Giant Model Training over Heterogeneous GPUs
-
[NeurIPS'22] AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
-
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
-
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
-
[ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
-
[HPDC'22] Hare: Exploiting Inter-job and Intra-job Parallelism of Distributed Machine Learning on Heterogeneous GPUs
-
[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
-
[NSDI'22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks
-
[arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
-
[arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
-
[JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
-
[TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
-
[ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
-
[SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
-
[MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
-
[ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
-
[NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
-
[ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
-
[ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
-
[ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
-
[SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
-
[SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (
PTD-PorMegatron-LM v2) -
[FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
-
[PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
-
[VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
-
[HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
-
[NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
-
[arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
-
[KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
-
[VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
-
[OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (
BytePS) -
[SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
-
[NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
-
[arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]
-
[HPCA'19] HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
-
[IEEE MICRO'19] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
-
[MLSys'19] Beyond data and model parallelism for deep neural networks (
FlexFlow) -
[MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
-
[EuroSys'19] Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
-
[EuroSys'19] Supporting Very Large Models using Automatic Dataflow Graph Partitioning (
Tofu) -
[SOSP'19] A Generic Communication Scheduler for Distributed DNN Training Acceleration
-
[NeurIPS'19] Mesh-TensorFlow: Deep Learning for Supercomputers
-
[NeurIPS'19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
-
[ICML'18] Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
-
[Survey 🔍] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
-
[Survey 🔍] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
-
[Survey 🔍] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools
- [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
- [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
- [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework
For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.
- [SC'25] Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training
- [SIGMOD'25] NeutronHeter: Optimizing Distributed Graph Neural Network Training for Heterogeneous Clusters
- [ICDE'25] CaliEX: A Disk-Based Large-Scale GNN Training System with Joint Design of Caching and Execution
- [arxiv'25] Plexus: Taming Billion-edge Graphs with 3D Parallel GNN Training
- [HPCA'25] Mithril: A Scalable System for Deep GNN Training
- [arxiv'25] Armada: Memory-Efficient Distributed Training of Large-Scale Graph Neural Networks
- [VLDB'25] NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism
- [arxiv'24] FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
- [ICPP'24] GNNDrive: Reducing Memory Contention and I/O Congestion for Disk-based GNN Training
- [VLDB'24] NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams
- [arxiv'23] ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training
- [arxiv'23] Helios: An Efficient Out-of-core GNN Training System on Terabyte-scale Graphs with In-memory Performance
- [arxiv'23] GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism
- [MLSys'23] Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
- [SIGMOD'23] DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
- [OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
- [EuroSys'23] MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks
- [KDD'22] Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs
- [VLDB'22] TGL: a general framework for temporal GNN training on billion-scale graphs
- [OSDI'21] P3: Distributed Deep Graph Learning at Scale
-
[AAAI'26] Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
-
[EuroSys'26] FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
-
[EuroSys'26] KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
-
[EuroSys'26] TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling
-
[arxiv'25] TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
-
[arxiv'25] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
-
[arxiv'25] Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
-
[arxiv'25] OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency
-
[arxiv'25] OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving
-
[arxiv'25] Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems
-
[arxiv'25] CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
-
[arxiv'25] FengHuang: Next-Generation Memory Orchestration for AI Inferencing
-
[arxiv'25] Synera: Synergistic LLM Serving across Device and Cloud at Scale
-
[arxiv'25] DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
-
[Middleware'25] Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System
-
[arxiv'25] From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
-
[arxiv'25] TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
-
[NeurIPS'25] SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
-
[arxiv'25] FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
-
[EMNLP'25] Distributed LLM Serving on Consumer-Grade GPUs by Reconciling Computation and Communication
-
[arxiv'25] Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
-
[MICRO'25] MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving
-
[MICRO'25] Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
-
[arxiv'25] SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
-
[arxiv'25] From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill
-
[CLUSTER'25] Scalable and Fast Inference Serving via Hybrid Communication Scheduling on Heterogeneous Networks
-
[arxiv'25] TridentServe: A Stage-level Serving System for Diffusion Pipelines
-
[arxiv'25] MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment
-
[Survey 🔍] [ACM CSUR'25] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
-
[SOSP'25] Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market
-
[SOSP'25] IC-Cache: Efficient Large Language Model Serving via In-context Caching
-
[SOSP'25] DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction
-
[arxiv'25] TetriServe: Efficient DiT Serving for Heterogeneous Image Generation
-
[arxiv'25] Parallax: Efficient LLM Inference Service over Decentralized Environment
-
[arxiv'25] RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
-
[arxiv'25] Cronus: Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
-
[arxiv'25] Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
-
[COLM'25] OverFill: Two-Stage Models for Efficient Language Model Decoding
-
[ACM MM'25] TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
-
[SC'25] Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism
-
[arxiv'25] FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
-
[arxiv'25] AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving
-
[arxiv'25] Predictable LLM Serving on GPU Clusters
-
[SIGCOMM'25] SCX: Stateless KV-Cache Encoding for Cloud-Scale Confidential Transformer Serving
-
[arxiv'25] Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
-
[arxiv'25] Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics
-
[OSDI'25] BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching
-
[OSDI'25] WaferLLM: Large Language Model Inference at Wafer Scale
-
[OSDI'25] NanoFlow: Towards Optimal Large Language Model Serving Throughput
-
[arxiv'25] HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
-
[arxiv'25] Equinox: Holistic Fair Scheduling in Serving Large Language Models
-
[arxiv'25] Efficient Mixed-Precision Large Language Model Inference with TurboMind
-
[ICML'25] Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving
-
[arxiv'25] Kairos: Low-latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud
-
[arxiv'25] Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
-
[arxiv'25] Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
-
[arxiv'25] Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling
-
[ACL'25] SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation
-
[arxiv'25] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding
-
[arxiv'25] Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
-
[arxiv'25] MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving
-
[CODEML @ ICML'25] TorchAO: PyTorch-Native Training-to-Serving Model Optimization
-
[arxiv'25] On Evaluating Performance of LLM Inference Serving Systems
-
[arxiv'25] PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications
-
[ICML'25] EPIC: Efficient Position-Independent Caching for Serving Large Language Models
-
[arxiv'25] SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference
-
[arxiv'25] Utility-Driven Speculative Decoding for Mixture-of-Experts
-
[ATC'25] DEEPSERVE: Serverless Large Language Model Serving at Scale
-
[ISCA'25] WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
-
[ISCA'25] Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
-
[ICLR'25] TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
-
[arxiv'25] Cascadia: A Cascade Serving System for Large Language Models
-
[arxiv'25] Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing
-
[arxiv'25] SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference
-
[arxiv'25] EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving
-
[arxiv'25] SCORPIO: Serving the Right Requests at the Right Time for Heterogeneous SLOs in LLM Inference
-
[arxiv'25] HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
-
[arxiv'25] ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
-
[arxiv'25] TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
-
[arxiv'25] Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving
-
[OSDI'25] Clover: Exploiting Intra-device Parallelism for High Throughput Large Language Model Serving
-
[arxiv'25] ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
-
[arxiv'25] ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor
-
[arxiv'25] Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
-
[arxiv'25] Tempo: Application-aware LLM Serving with Mixed SLO Requirements
-
[arxiv'25] Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
-
[arxiv'25] Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving
-
[arxiv'25] Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration
-
[Survey 🔍] [arxiv'25] Taming the Titans: A Survey of Efficient LLM Inference Serving
-
[MLSys'25] SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling
-
[MLSys'25] Marconi: Prefix Caching for the Era of Hybrid LLMs
-
[arxiv'25] PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
-
[arxiv'25] Circinus: Efficient Query Planner for Compound ML Serving
-
[arxiv'25] HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
-
[Mobicom'25] D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
-
[arxiv'25] SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference
-
[arxiv'25] gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
-
[arxiv'25] Optimizing SLO-oriented LLM Serving with PD-Multiplexing
-
[arxiv'25] SLO-Aware Scheduling for Large Language Model Inferences
-
[arxiv'25] Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
-
[ISPASS'25] Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
-
[arxiv'25] HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
-
[arxiv'25] DynaServe: Unified and Elastic Tandem-Style Execution for Dynamic Disaggregated LLM Serving
-
[arxiv'25] Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
-
[arxiv'25] Understanding and Optimizing Multi-Stage AI Inference Pipelines
-
[arxiv'24] Fast and Live Model Auto Scaling with O(1) Host Caching
-
[SIGMOD'25] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
-
[EuroMLSys'25] Performance Aware LLM Load Balancer for Mixed Workloads
-
[MLSys'25] Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
-
[arxiv'25] WaferLLM: A Wafer-Scale LLM Inference System
-
[HPCA'25] PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
-
[HPCA'25] throttLL'eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
-
[arxiv'25] Niyama : Breaking the Silos of LLM Inference Serving
-
[ASPLOS'25] Aqua: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains
-
[ASPLOS'25] Past-Future Scheduler for LLM Serving under SLA Guarantees
-
[ASPLOS'25] Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management
-
[EuroSys'25] SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
-
[EuroSys'25] Multiplexing Dynamic Deep Learning Workloads with SLO-awareness in GPU Clusters
-
[EuroSys'25] NeuStream: Bridging Deep Learning Serving and Stream Processing
-
[arxiv'25] ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving
-
[arxiv'25] PipeBoost: Resilient Pipelined Architecture for Fast Serverless LLM Scaling
-
[ISCA'25] Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
-
[arxiv'25] Jenga: Effective Memory Management for Serving LLM with Heterogeneity
-
[arxiv'25] Collaborative Speculative Inference for Efficient LLM Inference Serving
-
[NSDI'25] SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
-
[arxiv'25] Seesaw: High-throughput LLM Inference via Model Re-sharding
-
[arxiv'25] SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
-
[arxiv'25] ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput
-
[arxiv'25] Long-Context Inference with Retrieval-Augmented Speculative Decoding
-
[arxiv'25] Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
-
[arxiv'25] KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
-
[arxiv'25] Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale
-
[arxiv'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
-
[arxiv'25] HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
-
[arxiv'25] Autellix: An Efficient Serving Engine for LLM Agents as General Programs
-
[MLSys'25] ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
-
[ICLR'25] HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment
-
[arxiv'25] Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
-
[EuroSys'25] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
-
[ASPLOS'25] Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow
-
[ASPLOS'25] Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
-
[arxiv'25] MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving
-
[arxiv'25] Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
-
[arxiv'25] HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
-
[arxiv'25] DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
-
[arxiv'25] DeepFlow: Serverless Large Language Model Serving at Scale
-
[arxiv'25] AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
-
[arxiv'25] EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation
-
[arxiv'25] OMEGA: A Low-Latency GNN Serving System for Large Graphs
-
[arxiv'25] PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
-
[arxiv'25] Hierarchical Autoscaling for Large Language Model Serving with Chiron
-
[arxiv'25] Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
-
[arxiv'25] Accelerated Diffusion Models via Speculative Sampling
-
[MLSys'25] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
-
[EuroSys'25] A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro
-
[arxiv'24] LLM Inference Unveiled: Survey and Roofline Model Insights
-
[arxiv'24] Efficiently Serving LLM Reasoning Programs with Certaindex
-
[arxiv'24] LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System
-
[arxiv'24] TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications
-
[arxiv'24] Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
-
[arxiv'24] SYMPHONY: Improving Memory Management for LLM Inference Workloads
-
[arxiv'24] A System for Microserving of LLMs
-
[arxiv'24] HashAttention: Semantic Sparsity for Faster Inference
-
[arxiv'24] SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
-
[arxiv'24] Unifying KV Cache Compression for Large Language Models with LeanKV
-
[arxiv'24] PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
-
[Survey 🔍] [ACM CSUR'24] Resource-efficient Algorithms and Systems of Foundation Models: A Survey
-
[ICML'25] SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization [Code]
-
[ICLR'25] SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [Code]
-
[ICML'25] SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference [Code]
-
[arxiv'24] Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
-
[ACL'24] LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
-
[ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
-
[arxiv'24] EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
-
[IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
-
[arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
-
[NeurIPS'24] Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting
-
[NeurIPS'24] Toward Efficient Inference for Mixture of Experts
-
[NeurIPS'24] Sequoia: Scalable and Robust Speculative Decoding
-
[arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
-
[SC'24] PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
-
[SC'24] SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing
-
[arxiv'24] SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
-
[arxiv'24] V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM
-
[SenSys'24] LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning
-
[arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
-
[arxiv'24] NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
-
[MICRO'24] Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
-
[arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
-
[arxiv'24] Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
-
[arxiv'24] POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
-
[PML4LRS @ ICLR2024] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
-
[arxiv'24] MagicPIG: LSH Sampling for Efficient LLM Generation
-
[arxiv'24] Revisiting SLO and Goodput Metrics in LLM Serving
-
[arxiv'24] EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
-
[arxiv'24] ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
-
[EuroSys'25] Fast State Restoration in LLM Serving with HCache
-
[arxiv'24] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
-
[arxiv'24] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
-
[arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
-
[HPCA'24] KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers
-
[arxiv'24] Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
-
[NeurIPS'24] Efficient LLM Scheduling by Learning to Rank
-
[arxiv'24] P/D-Serve: Serving Disaggregated Large Language Model at Scale
-
[arxiv'24] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
-
[SOSP'24] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
-
[SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
-
[SOSP'24] Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
-
[SOSP'24] Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
-
[arxiv'24] LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
-
[ICPP'24] GMM: An Efficient GPU Memory Management-based Model Serving System for Multiple DNN Inference Models
-
[SIGCOMM'24] CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
-
[ES-FoMO @ ICML'24] CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models
-
[OSDI'24] dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
-
[OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
-
[OSDI'24] USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
-
[OSDI'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
-
[OSDI'24] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
-
[OSDI'24] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
-
[OSDI'24] Llumnix: Dynamic Scheduling for Large Language Model Serving
-
[OSDI'24] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
-
[ATC'24] Power-aware Deep Learning Model Serving with μ-Serve
-
[ATC'24] Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
-
[ATC'24] PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
-
[TPDS'24] ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG
-
[Survey 🔍] [arxiv'24] LLM Inference Serving: Survey of Recent Advances and Opportunities
-
[arxiv'24] Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
-
[arxiv'24] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
-
[arxiv'24] One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
-
[OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
-
[arxiv'24] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
-
[ISCA'24] Splitwise: Efficient generative LLM inference using phase splitting
-
[ICML'24] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
-
[ICML'24] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
-
[ICML'24] HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
-
[ICML'24] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
-
[ICML'24] MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
-
[MobiSys'24] ARISE: High-Capacity AR Offloading Inference Serving via Proactive Scheduling
-
[MobiSys'24] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
-
[arxiv'24] Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
-
[MLSys'24] HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
-
[MLSys'24] S-LoRA: Serving Thousands of Concurrent LoRA Adapters
-
[MLSys'24] Vidur: A Large-Scale Simulation Framework For LLM Inference
-
[arxiv'24] The CAP Principle for LLM Serving
-
[WWW'24] λGrapher: A Resource-Efficient Serverless System for GNN Serving through Graph Sharing
-
[ICML'24] CLLMs: Consistency Large Language Models
-
[arxiv'24] BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
-
[EuroSys'24] Model Selection for Latency-Critical Inference Serving
-
[arxiv'24] Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
-
[arxiv'24] Learn To be Efficient: Build Structured Sparsity in Large Language Models
-
[arxiv'24] Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
-
[ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
-
[arxiv'24] ALTO: An Efficient Network Orchestrator for Compound AI Systems
-
[ASPLOS'24] ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
-
[ASPLOS'24] NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
-
[arxiv'24] ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys
-
[arxiv'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
-
[ICML'24] DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
-
[ICLR'24] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
-
[arxiv'24] FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
-
[arxiv'24] Wisdom of Committee: Distilling from Foundation Model to SpecializedApplication Model
-
[arxiv'24] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
-
[arxiv'24] LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
-
[NSDI'24] Approximate Caching for Efficiently Serving Diffusion Models
-
[arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
-
[arxiv'24] ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
-
[arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
-
[arxiv'24] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
-
[arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
-
[arxiv'24] CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
-
[arxiv'24] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
-
[arxiv'24] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
-
[Survey 🔍] [arxiv'24] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
-
[arxiv'24] Learned Best-Effort LLM Serving
-
[arxiv'24] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
-
[ASPLOS'24] SpotServe: Serving Generative Large Language Models on Preemptible Instances
-
[arxiv'23] DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
-
[arxiv'23] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
-
[arxiv'23] Fairness in Serving Large Language Models
-
[arxiv'23] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
-
[arxiv'23] Punica: Multi-Tenant LoRA Serving
-
[arxiv'23] Pipeline Parallelism for DNN Inference with Practical Performance Guarantees
-
[arxiv'23] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
-
[arxiv'23] High-throughput Generative Inference of Large Language Models with a Single GPU
-
[NeurIPS'23] SpecTr: Fast Speculative Decoding via Optimal Transport
-
[HPDC'23] Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
-
[SOSP'23] Paella: Low-latency Model Serving with Virtualized GPU Scheduling
-
[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
-
[MLSys'23] Efficiently Scaling Transformer Inference
-
[EuroSys'23] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access
-
[EuroSys'23] Tabi: An Efficient Multi-Level Inference System for Large Language Models
-
[EuroSys'23] Pocket: ML Serving from the Edge
-
[OSDI'23] AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
-
[NSDI'23] SHEPHERD: Serving DNNs in the Wild
-
[VLDB'23] Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
-
[ICML'23] Fast Inference from Transformers via Speculative Decoding
-
[SIGMOD'22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving
-
[OSDI'22] Orca: A Distributed Serving System for Transformer-Based Generative Models
-
[OSDI'22] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
-
[ATC'22] SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
-
[ATC'22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
-
[ATC'22] Tetris: Memory-efficient Serverless Inference through Tensor Sharing
-
[ATC'22] PetS: A Unified Framework for Parameter-Efficient Transformers Serving
-
[ATC'21] INFaaS: Automated Model-less Inference Serving
-
[SoCC'21] Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving
-
[arxiv'21] Supporting Massive DLRM Inference through Software Defined Memory
-
[MobiCom'20] SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud
- [SC'25] UltraAttn: Efficiently Parallelizing Attention through Hierarchical Context-Tiling
- [SC'25] RingX: Scalable Parallel Attention for Long-Context Learning on HPC
- [NeurIPS'25 Spotlight] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training [Code]
- [arxiv'25] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention [Code]
- [MLSys'25] FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference
- [MLSys'25] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
- [NeurIPS'24] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- [ICLR'24] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- [NeurIPS'22] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
-
[EuroSys'26] Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading
-
[EuroSys'26] MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
-
[arxiv'25] Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems
-
[arxiv'25] MicroMoE: Fine-Grained Load Balancing for Mixture-of-Experts with Token Scheduling
-
[arxiv'25] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
-
[arxiv'25] Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models
-
[arxiv'25] FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
-
[arxiv'25] DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
-
[arxiv'25] BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference
-
[SC'25] Diff-MoE: Efficient Batched MoE Inference with Priority-Driven Differential Expert Caching
-
[SC workshop'25] Compression Error Sensitivity Analysis for Different Experts in MoE Model Inference
-
[SC workshop'25] Batch Tiling on Attention: Efficient Mixture of Experts Training on Wafer-Scale Processors
-
[arxiv'25] Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
-
[arxiv'25] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
-
[arxiv'25] ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts
-
[arxiv'25] MergeMoE: Efficient Compression of MoE Models via Expert Output Merging
-
[MICRO'25] Optimizing All-to-All Collective Communication with Fault Tolerance on Torus Networks
-
[arxiv'25] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models
-
[arxiv'25] Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting
-
[arxiv'25] ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models
-
[SOSP'25] KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models
-
[arxiv'25] MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
-
[arxiv'25] DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning
-
[arxiv'25] Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
-
[NeurIPS'25] BrainMoE: Cognition Joint Embedding via Mixture-of-Expert Towards Robust Brain Foundation Model
-
[NeurIPS'25] S’MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning
-
[NeurIPS'25] The Omni-Expert: A Computationally Efficient Approach to Achieve a Mixture of Experts in a Single Expert Model
-
[NeurIPS'25] MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE
-
[NeurIPS'25] FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts
-
[NeurIPS'25] FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training
-
[NeurIPS'25] FlashMoE: Fast Distributed MoE in a Single Kernel [Code]
-
[arxiv'25] Steering MoE LLMs via Expert (De)Activation
-
[arxiv'25] HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
-
[arxiv'25] LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference
-
[SC'25] MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
-
[arxiv'25] LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference
-
[arxiv'25] LongCat-Flash Technical Report
-
[arxiv'25] Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
-
[arxiv'25] HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
-
[arxiv'25] MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models
-
[SIGCOMM'25] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
-
[ICLR'25] Ada-K Routing: Boosting the Efficiency of MoE-based LLMs
-
[arxiv'25] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
-
[ICML'25] I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts
-
[arxiv'25] Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
-
[SC'25] X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms
-
[SIGCOMM'25] MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training
-
[arxiv'25] HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
-
[arxiv'25] PiKV: KV Cache Management System for Mixture of Experts
-
[arxiv'25] BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs
-
[arxiv'25] Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
-
[ACL'25] EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
-
[ACL'25] FOLDMOE: Efficient Long Sequence MoE Training via Attention-MoE Pipelining
-
[arxiv'25] The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts
-
[arxiv'25] Muon is Scalable for LLM Training
-
[arxiv'25] Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model
-
[arxiv'25] Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging
-
[arxiv'25] HarMoEny: Efficient Multi-GPU Inference of MoE Models
-
[arxiv'25] Load Balancing Mixture of Experts with Similarity Preserving Routers
-
[arxiv'25] MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing
-
[arxiv'25] EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
-
[arxiv'25] CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning
-
[arxiv'25] PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval
-
[arxiv'25] Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
-
[arxiv'25] Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
-
[ICML'25] FloE: On-the-Fly MoE Inference on Memory-constrained GPU
-
[arxiv'25] PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning
-
[arxiv'25] Faster MoE LLM Inference for Extremely Large Models
-
[arxiv'25] Accelerating Mixture-of-Experts Training with Adaptive Expert Replication
-
[NAACL'25] Marrying LLMs with Dynamic Forecasting: A Graph Mixture-of-expert Perspective
-
[NAACL'25] Sparser Mixture-of-Adapters with Cross-Layer Generalization
-
[NAACL'25] SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse
-
[Mobicom'25] D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
-
[arxiv'25] MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
-
[arxiv'25] Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
-
[arxiv'25] Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models
-
[arxiv'25] Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
-
[arxiv'25] MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
-
[arxiv'25] C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
-
[arxiv'25] Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models
-
[arxiv'25] S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning
-
[DAC'25] HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
-
[arxiv'25] Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations
-
[arxiv'25] HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs
-
[TKDE'25] A Survey on Mixture of Experts
-
[ICLR'25] NetMoE: Accelerating MoE Training through Dynamic Sample Placement
-
[arxiv'25] ProMoE: Fast MoE-based LLM Serving using Proactive Caching
-
[arxiv'25] Mixture of Lookup Experts
-
[EuroSys'25] Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores
-
[EuroMLSys'25] Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
-
[EuroMLSys'25] Accelerating MoE Model Inference with Expert Sharding
-
[arxiv'25] eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
-
[KDD'25] ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
-
[arxiv'25] Continual Pre-training of MoEs: How robust is your router?
-
[arxiv'25] Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
-
[arxiv'25] Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
-
[MLSys'25] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
-
[arxiv'25] CoSMoEs: Compact Sparse Mixture of Experts
-
[CVPR'25] DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models
-
[ASPLOS'25] CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
-
[arxiv'25] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
-
[arxiv'25] BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference
-
[arxiv'25] DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
-
[arxiv'25] MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing
-
[arxiv'25] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
-
[arxiv'25] Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models
-
[arxiv'25] fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving
-
[TPDS'25] EfficientMoE: Optimizing Mixture-of-Experts Model Training with Adaptive Load Balance
-
[arxiv'25] Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism
-
[NAACL'25] MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs
-
[arxiv'25] BTS: Harmonizing Specialized Experts into a Generalist LLM
-
[ASPLOS'25] FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
-
[arxiv'25] Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
-
[arxiv'25] Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
-
[MICRO'24] SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
-
[TPDS'24] MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism
- Journal version of [IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
-
[arxiv'24] DeepSeek-V3 Technical Report
-
[arxiv'24] HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy
-
[arxiv'24] Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
-
[arxiv'24] ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
-
[Survey 🔍] [arxiv'24] A Survey on Inference Optimization Techniques for Mixture of Experts Models
-
[arxiv'24] DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
-
[arxiv'24] Llama 3 Meets MoE: Efficient Upcycling
-
[arxiv'24] Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
-
[arxiv'24] Mixture of A Million Experts
-
[arxiv'24] MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems
-
[arxiv'24] Toward Inference-optimal Mixture-of-Expert Large Language Models
-
[arxiv'24] Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection
-
[MLArchSys'24 @ ISCA'24] MoE-ERAS: Expert Residency Aware Selection
-
[arxiv'24] MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
-
[arxiv'24] Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing
-
[arxiv'24] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
-
[COLM'24] Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
-
[ME-FoMo @ ICLR'24] Scaling Laws for Fine-Grained Mixture of Experts
-
[arxiv'24] UOE: Unlearning One Expert Is Enough For Mixture-of-experts LLMS
-
[ML for Sys workshop @ NeurIPS'24] IFMoE: An Inference Framework Design for Fine-grained MoE
-
[ML for Sys workshop @ NeurIPS'24] TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation
-
[arxiv'24] Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
-
[arxiv'24] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
-
[EMNLP'24] MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning
-
[EMNLP'24] Mixture of Diverse Size Experts
-
[EMNLP'24] AdaMOE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
-
[ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
-
[SoCC'24] MoEsaic: Shared Mixture of Experts
-
[KDD'24] Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing
-
[arxiv'24] Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
-
[IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
-
[arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
-
[arxiv'24] Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
-
[NeurIPS'24] Toward Efficient Inference for Mixture of Experts
-
[arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
-
[SC'24] APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes
-
[NeurIPS'24] GraphMETRO: Mitigating Complex Graph Distribution Shifts via Mixture of Aligned Experts
-
[arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
-
[arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
-
[NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
-
[arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
-
[arxiv'24] Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
-
[NeurIPS'24] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
-
[arxiv'24] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
-
[PML4LRS @ ICLR'24] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
-
[arxiv'24] Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
-
[arxiv'24] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
-
[arxiv'24] Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
-
[arxiv'24] MoH: Multi-Head Attention as Mixture-of-Head Attention
-
[arxiv'24] AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach
-
[NeurIPS'24 (Splotlight)] Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
-
[arxiv'24] Aria: An Open Multimodal Native Mixture-of-Experts Model
-
[arxiv'24] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
-
[arxiv'24] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
-
[arxiv'24] Upcycling Large Language Models into Mixture of Experts
-
[arxiv'24] No Need to Talk: Asynchronous Mixture of Language Models
-
[arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
-
[arxiv'24] HMoE: Heterogeneous Mixture of Experts for Language Modeling
-
[arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
-
[arxiv'24] AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
-
[arxiv'24] Layerwise Recurrent Router for Mixture-of-Experts
-
[arxiv'24] Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
-
[SRW @ ACL'24] MoExtend: Tuning New Experts for Modality and Task Extension
-
[arxiv'24] MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts
-
[arxiv'24] Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
-
[arxiv'24] Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
-
[ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
-
[MLSys'24] QMoE: Sub-1-Bit Compression of Trillion-Parameter Models
-
[arxiv'24] CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
-
[arxiv'24] AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts
-
[SIGIR'24] M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation Framework
-
[EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
-
[arxiv'24] MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts
-
[ICLR'24] Mixture of LoRA Experts
-
[arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
-
[arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
-
[IJCAI'24] LocMoE: A Low-overhead MoE for Large Language Model Training
-
[ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
-
[IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
-
[EMNLP'23] Adaptive Gating in Mixture-of-Experts based Language Models
-
[ICLR'23] Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
-
[arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
-
[arxiv'23] Fast Inference of Mixture-of-Experts Language Models with Offloading
-
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
-
[OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm
-
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
-
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
-
[MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
-
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
-
[arxiv'22] ST-MoE: Designing Stable and Transferable Sparse Expert Models
-
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
-
[SustaiNLP @ EMNLP'22] Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production
-
[NeurIPS'22] Mixture-of-Experts with Expert Choice Routing
-
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
-
[ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
-
[JMLR'22] Switch transformers: scaling to trillion parameter models with simple and efficient sparsity
-
[EMNLP'21] Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
-
[ICLR'17] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- [EuroSys'26] Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
- [arxiv'25] Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
- [arxiv'25] FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
- [arxiv'25] GPU-Initiated Networking for NCCL
- [SC'25] CPU- and GPU-initiated Communication Strategies for Conjugate Gradient Methods on Large GPU Clusters
- [SC'25] SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication
- [HotNets'25] Photonic Rails in ML Datacenters
- [arxiv'25] DMA Collectives for Efficient ML Communication Offloads
- [arxiv'25] Collective Communication for 100k+ GPUs
- [arxiv'25] Uno: A One-Stop Solution for Inter- and Intra-Datacenter Congestion Control and Reliable Connectivity
- [SOSP'25] Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
- [MICRO'25] SuperMesh: Energy-Efficient Collective Communications for Accelerators
- [MICRO'25] SkipReduce: (Interconnection) Network Sparsity to Accelerate Distributed Machine Learning
- [MICRO'25] Optimizing All-to-All Collective Communication with Fault Tolerance on Torus Networks
- [arxiv'25] MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications
- [arxiv'25] Toward Co-adapting Machine Learning Job Shape and Cluster Topology
- [APNET'25] Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization
- [arxiv'25] Efficient AllReduce with Stragglers
- [arxiv'25] TASP: Topology-aware Sequence Parallelism
- [NAIC @ SIGCOMM'25] Chronos: Prescheduled circuit switching for LLM training
- [arxiv'25] Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality
- [SIGCOMM'25] Falcon: A Reliable, Low Latency Hardware Transport
- [SIGCOMM'25] ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs
- [SIGCOMM'25] From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training
- [SIGCOMM'25] Astral: A Datacenter Infrastructure for Large Language Model Training at Scale
- [SIGCOMM'25] ResCCL: Resource-Efficient Scheduling for Collective Communication
- [OSDI'25] ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization
- [OSDI'25] Enabling Efficient GPU Communication over Multiple NICs with FuseLink
- [arxiv'25] RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs
- [arxiv'25] RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems
- [arxiv'25] Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
- [APNET'25] Congestion Control for AI Workloads with Message-Level Signaling
- [ASPLOS'25] Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning
- [ISCA'25] Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models
- [arxiv'25] NoLoCo: No-all-reduce Low Communication Training Method for Large Models
- [arxiv'25] TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
- [arxiv'25] FLASH: Fast All-to-All Communication in GPU Clusters
- [arxiv'25] MCMComm: Hardware-Software Co-Optimization for End-to-End Communication in Multi-Chip-Modules
- [arxiv'25] GenTorrent: Scaling Large Language Model Serving with An Overley Network
- [arxiv'25] Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
- [arxiv'25] FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation
- [arxiv'25] An Extensible Software Transport Layer for GPU Networking (
UCCL) [Code] - [HPCA'25] Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization
- [arxiv'25] HeteroPod: XPU-Accelerated Infrastructure Offloading for Commodity Cloud-Native Applications
- [Survey 🔍] [arxiv'25] GPU-centric Communication Schemes for HPC and ML Applications
- [EuroMLSys'25] TAGC: Optimizing Gradient Communication in Distributed Transformer Training
- [arxiv'25] UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture
- [MLSys'25] TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
- [arxiv'25] Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
- [NSDI'25] AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
- [NSDI'25] Efficient Direct-Connect Topologies for Collective Communications
- [arxiv'25] InfinitePOD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
- [IEEE MICRO'25] Understanding and Characterizing Communication Characteristics for Distributed Transformer Models
- [arxiv'25] In-Network Preprocessing of Recommender Systems on Multi-Tenant SmartNICs
- [arxiv'25] Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
- [arxiv'25] The Power of Negative Zero: Datatype Customization for Quantized Large Language Models
- [arxiv'25] mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training
- [NSDI'25] OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
- [APNET'24] Understanding Communication Characteristics of Distributed Training
- [arxiv'24] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication
- [arxiv'24] The Landscape of GPU-Centric Communication
- [arxiv'24] Revisiting the Time Cost Model of AllReduce
- [arxiv'24] LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
- [HotInfra'24] Immediate Communication for Distributed AI Tasks
- [NeurIPS'24] SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
- [SC'24] Optimizing Distributed ML Communication with Fused Computation-Collective Operations
- [SC'24] Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
- [NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
- [arxiv'24] LumosCore: Highly Scalable LLM Clusters with Optical Interconnect
- [TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
- [HOTI'24] Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives
- [HOTI'24] Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters
- [SC'24] Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration
- [HPDC'24] Near-Optimal Wafer-Scale Reduce
- [HPDC'24] Efficient all-to-all Collective Communication Schedules for Direct-connect Topologies
- [arxiv'24] HiCCL: A Hierarchical Collective Communication Library
- [ICS'24] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
- [ICS'24] Snoopie: A Multi-GPU Communication Profiler and Visualizer
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [arxiv'24] Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
- [arxiv'24] Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
- [arxiv'24] Demystifying the Communication Characteristics for Distributed Transformer Models
- [ICPP'24] Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning
- [NAIC @ SIGCOMM'24] Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
- [NAIC @ SIGCOMM'24] Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
- [NAIC @ SIGCOMM'24] OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs
- [HotNets'24] I've Got 99 Problems But FLOPS Ain't One
- [HotNets'24] MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning
- [HotNets'22] Congestion Control in Machine Learning Clusters
- [SIGCOMM'24] Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
- [SIGCOMM'24] RDMA over Ethernet for Distributed Training at Meta Scale
- [SIGCOMM'24] Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
- [SIGCOMM'24] MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
- [SIGCOMM'24] Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
- [arxiv'24] MLTCP: Congestion Control for DNN Training
- [arxiv'24] ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics
- [APNet'24] Understanding Communication Characteristics of Distributed Training
- [ICLR'24] ZeRO++: Extremely Efficient Collective Communication for Large Model Training
- [ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
- [arxiv] [openreview]
- [MLSys'24] L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning
- [MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
- [ASPLOS'24] T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
- [ASPLOS'24] TCCL: Discovering Better Communication Paths for PCIe GPU Clusters
- [ASPLOS'24] Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
- [ASPLOS'24] Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM
- [NSDI'24] THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
- [Survey 🔍] [arxiv'23] Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training
- [INFOCOM'23] Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks
- [ICDCS'23] bbTopk: Bandwidth-Aware Sparse Allreduce with Blocked Sparsification for Efficient Distributed Training
- [ICML'23] CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
- Related to DT-FM (NeurIPS'22)
- [IPDPS'23] MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
- [ASPLOS'23] MSCCLang: Microsoft Collective Communication Language
- [ASPLOS'23] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
- [EuroSys'23] A2TP: Aggregator-aware In-network Aggregation for Multi-tenant Learning
- [MLSys'23] Cupcake: A Compression Optimizer for Scalable Communication-Efficient Distributed Training
- [MLSys'23] On Optimizing the Communication of Model Parallelism
- [NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
- [NSDI'23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
- [NSDI'23] TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
- [NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
- [EuroSys'22] Out-of-order backprop: an effective scheduling technique for deep learning
- [ISCA'22] Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models
- [ISCA'22] Software-hardware co-design for fast and scalable training of deep learning recommendation models
- [SC'22] HammingMesh: A Network Topology for Large-Scale Deep Learning
- [PPoPP'22] Near-optimal sparse allreduce for distributed deep learning
- [MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning (
P^2) - [ASPLOS'22] Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads (
CoCoNET) - [EuroSys'21] DGCL: an efficient communication library for distributed GNN training
- [ICLR'21] Multi-Level Local SGD for Heterogeneous Hierarchical Networks
- [SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.5]
- [SC'21] Flare: flexible in-network allreduce
- [NSDI'21] Scaling Distributed Machine Learning with In-Network Aggregation
- [ISCA'21] Enabling compute-communication overlap in distributed deep learning training platforms
- [PPoPP'21] Synthesizing optimal collective algorithms (
SCCL) - [SIGCOMM'21] SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training
- [ISCA'20] An in-network architecture for accelerating shared-memory multiprocessor collectives
- [NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- [PPoPP'20] Taming unbalanced training workloads in deep learning with partial collective operations
- [MLSys'20] Blink: Fast and Generic Collectives for Distributed ML
- [MLSys'20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training
- [OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (
BytePS) - [MLSys'19] Priority-based Parameter Propagation for Distributed DNN Training (
P3) - [MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
- [SOSP'19] A generic communication scheduler for distributed DNN training acceleration (
ByteScheduler) - [ATC'17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
- [arxiv'25] FFTrainer: Fast Failover in Large-Language Model Training with Almost-Free State Management
- [arxiv'25] FailSafe: High-performance Resilient Serving
- [arxiv'25] GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training
- [MICRO'25] Optimizing All-to-All Collective Communication with Fault Tolerance on Torus Networks
- [APSys'25] Indispensable CPU-centric Checkpointing for GPUs
- [CLUSTER'25] Capricorn: Efficient In-Memory Checkpointing for MoE Model Training with Dynamicity Awareness
- [arxiv'25] MoE-PHDS: One MoE checkpoint for flexible runtime sparsity
- [arxiv'25] ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training
- [arxiv'25] Efficient AllReduce with Stragglers
- [SOSP'25] Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
- [SOSP'25] Robust LLM Training Infrastructure at ByteDance
- [SC'25] LowDiff: Efficient Frequent Checkpointing via Low-Cost Differential for High-Performance Distributed Training Systems
- [OSDI'25] Understanding Stragglers in Large Model Training Using What-if Analysis
- [SIGMOD'25] Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
- [arxiv'25] Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication
- [ATC'25] SAVE: Software-Implemented Fault Tolerance for Model Inference against GPU Memory Bit Flips
- [ATC'25] Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism
- [arxiv'25] Adaptra: Straggler-Resilient Hybrid-Parallel Training with Pipeline Adaptation
- [arxiv'25] Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
- [arxiv'25] Characterizing GPU Resilience and Impact on AI/HPC Systems
- [NSDI'25] BCP: A Unified Checkpointing System for Large Foundation Model Development
- [NSDI'25] Minder: Faulty Machine Detection for Large-scale Distributed Model Training
- [EuroSys'25] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
- [ASPLOS'25] PCcheck: Persistent Concurrent Checkpointing for ML
- [arxiv'24] FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
- [arxiv'24] MoEtion: Efficient and Reliable Checkpointing for Mixture-of-Experts Models at Scale
- [arxiv'24] MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
- [arxiv'24] TrainMover: Efficient ML Training Live Migration with No Memory Overhead
- [arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
- [arxiv'24] ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
- [arxiv'24] Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
- [arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
- [arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
- [SOSP'24] ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
- [HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
- [EuroSys'24] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
- [NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
- [arxiv'23] Unicron: Economizing Self-Healing LLM Training at Scale
- [VLDB'23] Eficient Fault Tolerance for Recommendation Model Training via Erasure Coding
- [SOSP'23] GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
- [SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
- [NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
- [EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
- [ATC'22] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure
- [MLSys'21] Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
- [FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
- [ICSE'20] An Empirical Study on Program Failures of Deep Learning Jobs
- [SC'25] HELM: Characterizing Unified Memory Accesses to Improve GPU Performance under Memory Oversubscription
- [SC'25] MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
- [arxiv'25] CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator
- [arxiv'25] Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training
- [ISCA'25] Forest: Access-aware GPU UVM Management
- [EuroSys'25] MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective Accelerators
- [EuroSys'25] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
- [FAST'25 WiP] Baton: Orchestrating GPU Memory for LLM Training on Heterogeneous Cluster
- [CGO'25] IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
- [arxiv'25] Memory Analysis on the Training Course of DeepSeek Models
- [IJCAI'24] LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
- [MICRO'24] SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
- [arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
- [TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
- [ICML'24] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- [ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (
QSDP) - [arxiv'23] Does compressing activations help model parallel training?
- [SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
- [VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- [SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
- [HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
- [HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
- [IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
- [ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
- [VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
- [ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
- [ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
- [ICLR'21] Dynamic Tensor Rematerialization
- [SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
- [HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
- [MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
- [ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
- [ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
- [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models
- [SC'20] ZeRO: memory optimizations toward training trillion parameter models
- [ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
- [PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
- [MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
- [arxiv'16] Training Deep Nets with Sublinear Memory Cost
- [SC workshop'25] WAGES: Workload-Aware GPU Sharing System for Energy-Efficient Serverless LLM Serving
- [SOSP'25] LithOS: An Operating System for Efficient Machine Learning on GPUs
- [arxiv'25] Towards Efficient and Practical GPU Multitasking in the Era of LLM
- [arxiv'25] Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
- [OSDI'25] XSched: Preemptive Scheduling for Diverse XPUs
- [EuroSys'25] Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing
- [PPOPP'25] SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs
- [arxiv'24] PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
- [SC'24] ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments
- [arxiv'24] Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
- [ICPP'24] MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning Clusters
- [ASPLOS'24] RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing
- [EuroSys'24] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
- [ATC'23] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
- [NSDI'23] Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
- [ICPP'23] FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
- [arxiv'23] GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning
- [arxiv'23] MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
- [SoCC'22] MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
- [PACT'22] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
- [ATC'21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
- [MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
- [OSDI'20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
- [OSDI'20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
- [RTAS'19] Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs
- [arxiv'25] Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
- [arxiv'25] Dato: A Task-Based Programming Model for Dataflow Accelerators
- [arxiv'25] Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
- [NeurIPS'25] REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
- [SOSP'25] Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling
- [MICRO'25] StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs
- [OSDI'25] PipeThreader: Software-Defined Pipelining for Efficient DNN Execution
- [OSDI'25] QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
- [OSDI'25] Mirage: A Multi-Level Superoptimizer for Tensor Programs
- [OSDI'25] KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
- [arxiv'25] TileLang: A Composable Tiled Programming Model for AI Systems
- [arxiv'25] Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
- [arxiv'25] DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training
- [ASPLOS'25] Mosaic: Exploiting Instruction-Level Parallelism on Deep Learning Accelerators with iTex Tessellation
- [ASPLOS'25] Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning
- [arxiv'25] Hercules: A Compiler for Productive Programming of Heterogeneous Systems
- [CC'25] LLM Compiler: Foundation Language Models for Compiler Optimization
- [CGO'25] IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
- [SOSP'24] Scaling Deep Learning Computation over the Inter-core Connected Intelligence Processor with T10
- [OSDI'23] Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
- [OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
- [OSDI'23] Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
- [OSDI'23] EINNET: Optimizing Tensor Programs with Derivation-Based Transformations
- [OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm
- [OSDI'22] ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
- [OSDI'20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
- [OSDI'20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
- [ASPLOS'20] FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
- [OSDI'18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- [EuroSys'26] Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
- [arxiv'25] Flash Multi-Head Feed-Forward Network
- [arxiv'25] Iris: First-Class Multi-GPU Programming Experience in Triton
- [arxiv'25] AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
- [arxiv'25] ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
- [SC'25] HyTiS: Hybrid Tile Scheduling for GPU GEMM with Enhanced Wave Utilization and Cache Locality
- [SC'25] UltraAttn: Efficiently Parallelizing Attention through Hierarchical Context-Tiling
- [arxiv'25] HipKittens: Fast and Furious AMD Kernels
- [TACO'25] HuntKTm: Hybrid Scheduling and Automatic Management for Efficient Kernel Execution on Modern GPUs
- [NeurIPS'25] FlashMoE: Fast Distributed MoE in a Single Kernel
- [MLSys'25] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
- [arxiv'25] LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
- [arxiv'25] TileLang: A Composable Tiled Programming Model for AI Systems
- [PLDI'25] Task-Based Tensor Computations on Modern GPUs
- [TACO'25] Kitsune: Enabling Dataflow Execution on GPUs
- [ICLR'25] ThunderKittens: Simple, Fast, and Adorable Kernels
- [PLDI'25] Task-Based Tensor Computations on Modern GPUs
- [ASPLOS'25] Composing Distributed Computations Through Task and Kernel Fusion
- [MLSys'25] FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference
- [arxiv'24] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs
- [arxiv'24] Flex Attention: A Programming Model for Generating Optimized Attention Kernels
- [NeurIPS'24] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- [ICLR'24] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- [CGO'24] A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
- [RTAS'24] Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management
- slides: link
- [arxiv'23] Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
- [OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
- [arxiv'21] Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads
- [SIGMETRICS'21] Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels
- [NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- [NeurIPS'22] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- [RTSS'17] GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed
- [SC'25] UltraAttn: Efficiently Parallelizing Attention through Hierarchical Context-Tiling
- [SC'25] RingX: Scalable Parallel Attention for Long-Context Learning on HPC
- [arxiv'25] Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
- [NeurIPS'25] StarTrail: Concentric Ring Sequence Parallelism for Efficient Near-Infinite-Context Transformer Model Training
- [arxiv'25] Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
- [arxiv'25] Efficient Long-context Language Model Training by Core Attention Disaggregation
- [SOSP'25] DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism
- [arxiv'25] Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
- [arxiv'25] Strata: Hierarchical Context Caching for Long Context Language Model Serving
- [arxiv'25] TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving
- [ACL'25] MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference
- [arxiv'25] HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
- [arxiv'25] SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
- [arxiv'25] Training Long-Context LLMs Efficiently via Chunk-wise Optimization
- [arxiv'25] SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training
- [ASPLOS'25] FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
- [arxiv'25] XAttention: Block Sparse Attention with Antidiagonal Scoring
- [arxiv'25] SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading
- [arxiv'25] ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
- [arxiv'25] Long-Context Inference with Retrieval-Augmented Speculative Decoding
- [PODC'25] System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
- [arxiv'25] ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
- [arxiv'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
- [arxiv'25] MoBA: Mixture of Block Attention for Long-Context LLMs
- [arxiv'25] Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
- [arxiv'25] APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
- [SIGMOD'25] MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
- [arxiv'25] Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning
- [arxiv'25] Adjoint sharding for very long context training of state space models
- [arxiv'24] LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System
- [arxiv'24] Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training
- [ICLR'24] Efficient Streaming Language Models with Attention Sinks [Code]
- [SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
- [arxiv'24] USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
- [arxiv'24] Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
- [NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
- [arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
- [arxiv'24] Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [COLM'24] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
- [arxiv'24] FocusLLM: Scaling LLM's Context by Parallel Decoding
- [Survey 🔍] [IJCAI'24] X-former Elucidator: Reviving Efficient Attention for Long Context Language Modeling
For comprehensive list of quantization papers, refer to https://github.com/Efficient-ML/Awesome-Model-Quantization.
- [arxiv'25] Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
- [EMNLP'25] Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
- [NeurIPS'25] 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
- [arxiv'25] MergeMoE: Efficient Compression of MoE Models via Expert Output Merging
- [CLUSTER'25] SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization
- [JMLR'25] BitNet: 1-bit Pre-training for Large Language Models
- [OSDI'25] DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization
- [arxiv'25] TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network
- [arxiv'25] DECA: A Near-Core LLM Decompression Accelerator Supporting Out-of-Order Invocation
- [arxiv'25] ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition
- [ISCA'25] Transitive Array: An Efficient GEMM Accelerator with Result Reuse
- [arxiv'24] Accelerating Distributed Deep Learning using Lossless Homomorphic Compression
- [ICML'24] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
- [ACL'23] Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
- [ICLR'23] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- [OSDI'23] AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
- [EuroSys'23] Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
- [ICML'22] TSPipe: Learn from Teacher Faster with Pipelines
- [VLDB'25] PS-MI: Accurate, E!icient, and Private Data Valuation in Vertical Federated Learning
- [arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
- [MLSys'24] LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
- [arxiv'24] FedEx: Expediting Federated Learning over Heterogeneous Mobile Devices by Overlapping and Participant Selection
- [KDD'24] FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model
- [CCGrid'24] Apodotiko: Enabling Efficient Serverless Federated Learning in Heterogeneous Environments
- [EuroSys'24] Dordis: Efficient Federated Learning with Dropout-Resilient Differential Privacy
- [arxiv'24] Decoupled Vertical Federated Learning for Practical Training on Vertically Partitioned Data
- [SAC'24] Training Heterogeneous Client Models using Knowledge Distillation in Serverless Federated Learning
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [arxiv'23] Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
- [IMWUT'23] AttFL: A Personalized Federated Learning Framework for Time-series Mobile and Embedded Sensor Data Processing
- [Survey 🔍] [FGCS'23] Model aggregation techniques in federated learning: A comprehensive survey
- [SoCC'23] Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering
- [MLSys'23] GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning
- [WWW'23] To Store or Not? Online Data Selection for Federated Learning with Limited Storage
- [EuroSys'23] REFL: Resource-Efficient Federated Learning
- [VLDB'23] FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
- [RecSys'22] Towards Fair Federated Recommendation Learning: Characterizing the Inter-Dependence of System and Data Heterogeneity
- [TMLR'22] Optimal Client Sampling for Federated Learning
- [ICML'22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
- [MobiSys'22] FedBalancer: data and pace control for efficient federated learning on heterogeneous clients
- [MobiCom'22] PyramidFL: A Fine-grained Client Selection Framework for Efficient Federated Learning
- [MLSys'22] PAPAYA: Practical, Private, and Scalable Federated Learning
- [AISTATS'22] Federated Learning with Buffered Asynchronous Aggregation
- [NeurIPS'21] Federated Reconstruction: Partially Local Federated Learning
- [NeurIPS'21] FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout
- [OSDI'21] Oort: Efficient Federated Learning via Guided Participant Selection
- [MICRO'21] AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning
- [MLSys'19] Towards Federated Learning at Scale: System Design
- [Survey 🔍] [ACM CSUR'22] Federated Learning for Smart Healthcare: A Survey
- [CCS'25] MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs
- [USENIX Security'25] Phantom: Privacy-Preserving Deep Neural Network Model Obfuscation in Heterogeneous TEE and GPU System
- [ASPLOS'24] LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models
- [NeurIPS'24] Nimbus: Secure and Efficient Two-Party Inference for Transformers
- [ACL'24] SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC
- [S&P'24] BOLT: Privacy-Preserving, Accurate and Efficient Inference for Transformers
- [DAC'23] Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators
- [ICLR'23] MPCFormer: fast, performant and private Transformer inference with MPC
- [NeurIPS'22] Iron: Private Inference on Transformers
- [ASPLOS'25] Towards End-to-End Optimization of LLM-based Applications with Ayo
- [arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
- [OSDI'24] ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications
- [ICML'22] Efficient Online ML API Selection for Multi-Label Classification Tasks (
FrugalMCT) - [NeurIPS'20] FrugalML: How to use ML Prediction APIs more accurately and cheaply
- [arxiv'25] AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
- [arxiv'25] ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training
- [NeurIPS'25] REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
- [arxiv'25] Barbarians at the Gate: How AI is Upending Systems Research [Code]
- [arxiv'25] SuperCoder: Assembly Program Superoptimization with Large Language Models
- [HotOS'25] How I learned to stop worrying and love learned OS policies
- [VLDB'25] E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model
- [SenSys'25] CheckMate: LLM-Powered Approximate Intermittent Computing
- [ICSE'25] Large Language Models as Configuration Validators
- [NeurIPS'24] IaC-Eval: A code generation benchmark for Infrastructure-as-Code programs
- [arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
- [arxiv'24] LLMTune: Accelerate Database Knob Tuning with Large Language Models
- [SIGCOMM'24] NetLLM: Adapting Large Language Models for Networking
- [arxiv'24] LLM-Enhanced Data Management
- [arxiv'24] MPIrigen: MPI Code Generation through Domain-Specific Language Models
- [arxiv'24] Can Large Language Models Write Parallel Code?
- [arxiv'23] LLM-Assisted Code Cleaning For Training Accurate Code Generators
- [arxiv'23] Large Language Models for Compiler Optimization
- [VLDB'23] How Large Language Models Will Disrupt Data Management
- [NeurIPS'25] CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization
- [MICRO'25] SuperMesh: Energy-Efficient Collective Communications for Accelerators
- [MICRO'25] Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
- [arxiv'25] VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
- [arxiv'25] GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving
- [arxiv'25] Power Stabilization for AI Training Datacenters
- [arxiv'25] The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization
- [arxiv'25] EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
- [NSDI'25] GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters
- [HPCA'25] throttLL'eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
- [arxiv'25] EcoServe: Designing Carbon-Aware AI Inference Systems
- [arxiv'25] Life-Cycle Emissions of AI Hardware: A Cradle-To-Grave Approach and Generational Trends
- [arxiv'24] GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions
- [arxiv'24] EaCO: Resource Sharing Dynamics and Its Impact on Energy Efficiency for DNN Training
- [arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- [SOSP'24] Perseus: Removing Energy Bloat from Large Model Training
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [ATC'23] EnvPipe: Performance-preserving DNN Training Framework for Saving Energy
- [NSDI'23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
- [ICDE'25] SAGE: A Framework of Precise Retrieval for RAG
- [SOSP'25] HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows
- [ISCA'25] HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
- [arxiv'25] Patchwork: A Unified Framework for RAG Serving
- [arxiv'25] Accelerating Retrieval-Augmented Language Model Serving with Speculation
- [arxiv'25] RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
- [arxiv'25] Long-Context Inference with Retrieval-Augmented Speculative Decoding
- [VLDB'25] Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
- [arxiv'24] Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference
- [arxiv'24] RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation
- [arxiv'24] Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation
- [arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
- [NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
- [arxiv'25] Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs
- [MICRO'25] PyTorchSim: A Comprehensive, Fast, and Accurate NPU Simulation Framework
- [MICRO'25] Swift and Trustworthy Large-Scale GPU Simulation with Fine-Grained Error Modeling and Hierarchical Clustering
- [arxiv'25] Frontier: Simulating the Next Generation of LLM Inference Systems
- [NAIC @ SIGCOMM'25] MLSynth: Towards Synthetic ML Traces
- [NAIC @ SIGCOMM'25] Simulating LLM training workloads for heterogeneous compute and network infrastructure
- [arxiv'25] Frontier: Simulating the Next Generation of LLM Inference Systems
- [arxiv'25] Maya: Optimizing Deep Learning Training Workloads using Emulated Virtual Accelerators
- [NSDI'25] Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
- [ASPLOS'25] Forecasting GPU Performance for Deep Learning Training and Inference
- [MLSys'24] Vidur: A Large-Scale Simulation Framework For LLM Inference
- [arxiv'25] Measuring Agents in Production
- [arxiv'25] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
- [arxiv'25] Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows
- [arxiv'25] AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
- [ML for Systems @ NeurIPS'25] Agentic Bridge Framework: Closing the Gap Between Agentic Capability and Performance Benchmarks
- [arxiv'25] Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
- [arxiv'25] Sherlock: Reliable and Efficient Agentic Workflow Execution
- [arxiv'25] A CPU-Centric Perspective on Agentic AI
- [SAA'25] Useful Agentic AI: A Systems Outlook
- [SAA'25] Toward Systems Foundations for Agentic Exploratio
- [SAA'25] Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First
- [SAA'25] Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
- [SAA'25] Tetris: Efficient and Predictive KV Cache Offloading for Agentic and Reasoning Workloads
- [SAA'25] GPU Memory Prediction for Multimodal Model Training
- [SAA'25] DMAS-Forge: A Framework for Transparent Deployment of AI Applications as Distributed Systems
- [SAA'25] Automated Annotation Inference for MCP-based Agents
- [SAA'25] EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models
- [SAA'25] Unified Agentic Interfaces is All You Need for AI Agent Observability
- [arxiv'25] Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
- [arxiv'25] MobiAgent: A Systematic Framework for Customizable Mobile Agents
- [ICML'25] The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models
- [SIGCOMM'25] Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework
- [arxiv'25] rStar2-Agent: Agentic Reasoning Technical Report
- [COLM'25] R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents
- [arxiv'25] Efficient and Scalable Agentic AI with Heterogeneous Systems
- [arxiv'25] Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC
- [arxiv'25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
- [ASPLOS'25] ReCA: Integrated Acceleration for Real-Time and Efficient Cooperative Embodied Autonomous Agents
- [arxiv'25] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
- [arxiv'24] AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
- [ICML'24] AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
- [arxiv'25] ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models
- [arxiv'25] RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting
- [arxiv'25] Fast LLM Post-training via Decoupled and Best-of-N Speculation
- [arxiv'25] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
- [arxiv'25] Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
- [arxiv'25] WeChat-YATT: A Scalable, Simple, Efficient, and Production Ready Training Library
- [arxiv'25] The Path Not Taken: RLVR Provably Learns Off the Principals
- [arxiv'25] AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs
- [NeurIPS'25] Greedy Sampling Is Provably Efficient for RLHF
- [arxiv'25] Ask a Strong LLM Judge when Your Reward Model is Uncertain
- [arxiv'25] RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
- [arxiv'25] Laminar: A Scalable Asynchronous RL Post-Training Framework
- [arxiv'25] The Art of Scaling Reinforcement Learning Compute for LLMs
- [arxiv'25] xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning
- [arxiv'25] Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
- [arxiv'25] Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
- [arxiv'25] Spurious Rewards: Rethinking Training Signals in RLVR
- [arxiv'25] Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
- [arxiv'25] RL in the Wild: Characterizing RLVR Training in LLM Deployment
- [arxiv'25] APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation
- [NeurIPS'25] AReaL: Asynchronous Reinforcement Learning for Efficient and Scalable Language Reasoning
- [arxiv'25] ToRL: Scaling Tool-Integrated RL
- [arxiv'25] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
- [arxiv'25] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
- [Survey 🔍] [arxiv'25] A Survey of Reinforcement Learning for Large Reasoning Models
- [arxiv'25] RewardDance: Reward Scaling in Visual Generation
- [arxiv'25] floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
- [arxiv'25] ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
- [arxiv'25] History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL
- [COLM'25] Sample Efficient Preference Alignment in LLMs via Active Exploration
- [COLM'25] Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
- [arxiv'25] SeamlessFlow: A Trainer Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling
- [arxiv'25] SPECS: Faster Test-Time Scaling through Speculative Drafts
- [arxiv'25] Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models
- [COLM'25] Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
- [arxiv'25] ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
- [IPDPS'25] FlexRLHF: A Flexible Placement and Parallelism Framework for Efficient RLHF Training
- [arxiv'25] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
- [ACL'25] RLKGF: Reinforcement Learning from Knowledge Graph Feedback Without Human Annotations
- [arxiv'25] Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
- [arxiv'25] Scaling RL to Long Videos
- [arxiv'25] Test-Time Training Done Right
- [arxiv'25] LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training
- [arxiv'25] On-Policy RL with Optimal Reward Baseline
- [arxiv'25] StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
- [arxiv'25] DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- [MLSys'25] ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
- [arxiv'25] Reward Reasoning Model
- [arxiv'24] Optimizing RLHF Training for Large Language Models with Stage Fusion
https://github.com/friedrichor/Awesome-Multimodal-Papers
- [arxiv'25] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
- [SoCC'25] ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
- [arxiv'25] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
- [arxiv'25] OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
- [arxiv'25] Fast-dLLM v2: Efficient Block-Diffusion LLM
- [arxiv'25] Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
- [arxiv'25] Mordal: Automated Pretrained Model Selection for Vision Language Models
- [arxiv'25] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
- [arxiv'24] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
- [Survey 🔍] [arxiv'24] A Survey of Resource-efficient LLM and Multimodal Foundation Models
- [MICRO'25] HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models
- [MLSys'25] Marconi: Prefix Caching for the Era of Hybrid LLMs
- [arxiv'25] Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures
- [arxiv'25] Streaming Tensor Program: A streaming abstraction for dynamic parallelism
- [arxiv'25] OckBench: Measuring the Efficiency of LLM Reasoning
- [SC workshop'25] Roofline Analysis of Tightly-Coupled CPU-GPU Superchips: A Study on MI300A and GH200
- [NeurIPS'25] Spark Transformer: Reactivating Sparsity in FFN and Attention
- [MICRO'25] ORCHES: Orchestrated Test-Time-Compute-based LLM Reasoning on Collaborative GPU-PIM HEterogeneous System
- [arxiv'25] vAttention: Verified Sparse Attention
- [USENIX ;login:] Wafer-Scale AI Compute: A System Software Perspective
- [arxiv'25] Training Large Language Models To Reason In Parallel With Global Forking Tokens
- [arxiv'25] How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models
- [arxiv'25] Slm-mux: Orchestrating small language models for reasoning
- [arxiv'25] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
- [arxiv'25] Less is More: Recursive Reasoning with Tiny Networks
- [arxiv'25] ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
- [arxiv'25] Rethinking Thinking Tokens: LLMs as Improvement Operators
- [arxiv'25] Generalized Parallel Scaling with Interdependent Generations
- [arxiv'25] Composer: A Search Framework for Hybrid Neural Architecture Design
- [arxiv'25] dParallel: Learnable Parallel Decoding for dLLMs
- [NeurIPS'25] Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
- [arxiv'25] AI Factories: It's time to rethink the Cloud-HPC divide
- [arxiv'25] Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving
- [arxiv'25] SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences
- [arxiv'25] Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs
- [arxiv'25] LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
- [arxiv'25] DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
- [VLDB'25] Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads on Modern GPUs
- [arxiv'25] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning
- [arxiv'25] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
- [arxiv'25] LobRA: Multi-tenant Fine-tuning over Heterogeneous Data
- [arxiv'25] Copilot Arena: A Platform for Code LLM Evaluation in the Wild
- [arxiv'25] ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
- [MICRO'25] Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving
- [CFAgentic @ ICML'25] LLMSELECTOR: Learning to Select Models in Compound AI Systems
- [arxiv'25] Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication
- [arxiv'25] Prompt-to-Leaderboard: Prompt-Adaptive LLM Evaluations [Code]
- [ISCA'25] Meta’s Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
- [ISCA'25] Debunking the CUDA Myth Towards GPU-based AI Systems
- [ISCA'25] UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
- [arxiv'25] SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
- [arxiv'25] Reinforcement Pre-Training
- [arxiv'25] MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models
- [NSDI'25] Optimizing RLHF Training for Large Language Models with Stage Fusion
- [arxiv'25] Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
- [arxiv'25] Faster Video Diffusion with Trainable Sparse Attention
- [arxiv'25] SSR: Speculative Parallel Scaling Reasoning in Test-time
- [arxiv'25] Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
- [arxiv'25] Think Only When You Need with Large Hybrid-Reasoning Models
- [MLSys'25] Optimizing LLM Queries in Relational Data Analytics Workloads
- [arxiv'25] Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
- [arxiv'25] Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads
- [arxiv'25] Process Reward Models That Think
- [arxiv'25] Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning
- [arxiv'25] Sleep-time Compute: Beyond Inference Scaling at Test-time
- [arxiv'25] SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
- [arxiv'25] Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
- [arxiv'25] OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
- [arxiv'25] NotebookOS: A Notebook Operating System for Interactive Training with On-Demand GPUs
- [arxiv'25] Alchemist: Towards the Design of Efficient Online Continual Learning System
- [arxiv'25] Linear Attention for Efficient Bidirectional Sequence Modeling
- [arxiv'25] S*: Test Time Scaling for Code Generation
- [arxiv'25] Optimizing Model Selection for Compound AI Systems
- [arxiv'25] Copilot Arena: A Platform for Code LLM Evaluation in the Wild
- [arxiv'25] Efficient-vDiT: Efficient Video Diffusion Transformers with Attention Tile
- [arxiv'25] BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation
- [arxiv'25] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
- [arxiv'25] Adaptive Semantic Prompt Caching with VectorQ
- [EuroSys'25] HybridFlow: A Flexible and Efficient RLHF Framework
- [arxiv'25] Measuring GPU utilization one level deeper
- [ASPLOS'25] PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption
- [arxiv'24] Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
- [arxiv'24] Debunking the CUDA Myth Towards GPU-based AI Systems
- [arxiv'24] XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
- [CPAL'24 (PMLR)] Jaxpruner: A Concise Library for Sparsity Research
- [arxiv'24] Scorch: A Library for Sparse Deep Learning
- [arxiv'24] Drowning in Documents: Consequences of Scaling Reranker Inference
- [arxiv'24] Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions
- [arxiv'24] Computational Bottlenecks of Training Small-scale Large Language Models
- [Survey 🔍] [arxiv'24] A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness
- [NeurIPS'24] Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- [arxiv'24] Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
- [arxiv'24] DroidSpeak: Enhancing Cross-LLM Communication
- [arxiv'24] Disaggregating Embedding Recommendation Systems with FlexEMR
- [arxiv'24] JudgeBench: A Benchmark for Evaluating LLM-based Judges
- [arxiv'24] You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- [arxiv'24] Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
- [ATC'24] Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor
- [arxiv'23] Efficiently Programming Large Language Models using SGLang
- [MICRO'23] Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads
- [arxiv'23] Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- [arxiv'22] Training language models to follow instructions with human feedback
This repository is motivated by:
- https://github.com/HuaizhengZhang/Awesome-System-for-Machine-Learning
- https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
- https://github.com/ganler/ResearchReading
- https://jeongseob.github.io/readings_mlsys.html
- https://github.com/chwan1016/awesome-gnn-systems
- https://github.com/ConnollyLeon/awesome-Auto-Parallelism