Skip to content

byungsoo-oh/ml-systems-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 

Repository files navigation

Paper List for Machine Learning Systems

Awesome PRs Welcome

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

Table of Contents

Data Processing

Data pipeline optimization

General

Preprocessing stalls

Fetch stalls (I/O)

Specific workloads (GNN, DLRM)

Caching and distributed storage for ML training

LLM data plane

Others

Data formats

  • [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
  • [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

Data pipeline fairness and correctness

  • [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

Data labeling automation

  • [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

Training System

ML job analysis on GPU clusters

  • [ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
  • [NSDI'24] Characterization of Large Language Model Development in the Datacenter
  • [NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
  • [ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)

Resource scheduling

Distributed training

AutoML

  • [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
  • [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
  • [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

GNN training system

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

Inference System

Attention Optimization

Mixture of Experts (MoE)

Communication Optimization & Network Infrastructure for Distributed ML

Fault tolerance & Straggler mitigation

GPU Memory Management & Optimization

GPU Sharing

Compiler

GPU Kernel Optimization

LLM Long Context

Model Compression

For comprehensive list of quantization papers, refer to https://github.com/Efficient-ML/Awesome-Model-Quantization.

Federated Learning

Privacy-Preserving ML

ML APIs & Application-Side Optimization

ML for Systems

Energy Efficiency

Retrieval-Augmented Generation (RAG)

Simulation

Systems for Agentic AI

RL Post-Training

Multimodal

https://github.com/friedrichor/Awesome-Multimodal-Papers

Hybrid LLMs

Others

References

This repository is motivated by:

About

Curated collection of papers in machine learning systems

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published