-

Hiding host-device synchronization via CUDA stream interleaving
17 min read -

A deep dive on data transfer bottlenecks, their identification, and their resolution with the help…
15 min read -

A deep dive on data transfer bottlenecks, their identification, and their resolution with the help…
13 min read -

A deep dive on data transfer bottlenecks, their identification, and their resolution with the help…
16 min read -

Tips for accelerating AI/ML on CPU — Part 2
11 min read -

Flyin’ Like a Lion on Intel Xeon
20 min read -

How to upgrade and optimize legacy AI/ML models
19 min read -

Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch
Deep LearningPyTorch Model Performance Analysis and Optimization — Part 11
10 min read -

A demonstration of PyTorch’s exciting new export feature on a HuggingFace model
18 min read -

Since its inception in PyTorch 2.0 in March 2023, the evolution of torch.compile has been one of…
31 min read