Paper List for Machine Learning Systems

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

Data Processing

Preprocessing stalls

Fetch stalls (I/O)

[TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
[ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
[SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O

Specific workloads (GNN, DLRM)

Data formats

[ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
[VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

Data pipeline fairness and correctness

[CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

Data labeling automation

[VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

Training System

[ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
[NSDI'24] Characterization of Large Language Model Development in the Datacenter
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)

[arxiv'25] Semantic-Aware Scheduling for GPU Clusters with Large Language Models
[arxiv'25] Holistic Heterogeneous Scheduling for Autonomous Applications using Fine-grained, Multi-XPU Abstraction
[arxiv'25] Tesserae: Scalable Placement Policies for Deep Learning Workloads
[arxiv'25] LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems
[EuroSys'25] Eva: Cost-Efficient Cloud-Based Cluster Scheduling
[arxiv'25] TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
[arxiv'24] Zeal: Rethinking Large-Scale Resource Allocation with "Decouple and Decompose"
[TACO'24] Taming Flexible Job Packing in Deep Learning Training Clusters
[SoCC'24] Kale: Elastic GPU Scheduling for Online DL Model Training
[arxiv'24] Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
[SC'24] PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
[ASPLOS'24] Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
[Middleware'24] Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters
[IPDPS'24] Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster
[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
[Survey 🔍] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
[Survey 🔍] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (Synergy)
[SIGCOMM'22] Multi-resource interleaving for deep learning training (Muri)
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (Helios)
[OSDI'21] Privacy Budget Scheduling (DPF)
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (AFS)
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (GandivaFair)
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (Gavel)
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
[MLSys'20] Resource Elasticity in Distributed Deep Learning
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning

[OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
[NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
[OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

Inference System

Attention Optimization

Mixture of Experts (MoE)

Communication Optimization & Network Infrastructure for Distributed ML

Fault tolerance & Straggler mitigation

GPU Memory Management & Optimization

[SC'25] HELM: Characterizing Unified Memory Accesses to Improve GPU Performance under Memory Oversubscription
[SC'25] MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
[arxiv'25] CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator
[arxiv'25] Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training
[ISCA'25] Forest: Access-aware GPU UVM Management
[EuroSys'25] MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective Accelerators
[EuroSys'25] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
[FAST'25 WiP] Baton: Orchestrating GPU Memory for LLM Training on Heterogeneous Cluster
[CGO'25] IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
[arxiv'25] Memory Analysis on the Training Course of DeepSeek Models
[IJCAI'24] LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
[MICRO'24] SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
[arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
[TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
[ICML'24] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
[ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
[arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
[arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (QSDP)
[arxiv'23] Does compressing activations help model parallel training?
[SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
[VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
[HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
[HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
[IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
[ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
[VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
[ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
[ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
[ICLR'21] Dynamic Tensor Rematerialization
[SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
[HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
[MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
[ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
[ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
[ESEC/FSE'20] Estimating GPU memory consumption of deep learning models
[SC'20] ZeRO: memory optimizations toward training trillion parameter models
[ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
[PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
[MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
[arxiv'16] Training Deep Nets with Sublinear Memory Cost

GPU Sharing

Compiler

GPU Kernel Optimization

LLM Long Context

Model Compression

For comprehensive list of quantization papers, refer to https://github.com/Efficient-ML/Awesome-Model-Quantization.

Federated Learning

Privacy-Preserving ML

ML APIs & Application-Side Optimization

ML for Systems

Energy Efficiency

Retrieval-Augmented Generation (RAG)

Simulation

Systems for Agentic AI

RL Post-Training

Multimodal

Hybrid LLMs

Others

References

This repository is motivated by:

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
README.md		README.md

byungsoo-oh/ml-systems-papers

Folders and files

Latest commit

History

Repository files navigation

Paper List for Machine Learning Systems

Table of Contents

Data Processing

Data pipeline optimization

Caching and distributed storage for ML training

LLM data plane

Others

Training System

ML job analysis on GPU clusters

Resource scheduling

Distributed training

AutoML

GNN training system

Inference System

Attention Optimization

Mixture of Experts (MoE)

Communication Optimization & Network Infrastructure for Distributed ML

Fault tolerance & Straggler mitigation

GPU Memory Management & Optimization

GPU Sharing

Compiler

GPU Kernel Optimization

LLM Long Context

Model Compression

Federated Learning

Privacy-Preserving ML

ML APIs & Application-Side Optimization

ML for Systems

Energy Efficiency

Retrieval-Augmented Generation (RAG)

Simulation

Systems for Agentic AI

RL Post-Training

Multimodal

Hybrid LLMs

Others

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages