Transformer architectures are the backbone of many modern large language and generative AI models. As these models grow in size, training runs consume more GPU hours and more engineering iteration time. Accelerating transformers is therefore not just a performance optimization, but directly affects how quickly teams can experiment and how large a model they can afford to train. NVIDIA Hopper and NVIDIA Blackwell GPUs help solve this problem by introducing low-precision operator support including FP8 and NVFP4.
Transformers spend much of their training time in GEMMs, and low-precision formats speed up training mainly by making those matrix multiplications faster and cheaper. However, your transformer config does not tell you which GEMMs are actually running in your model. If you want to understand where training time goes, you need to turn your transformer config and batch size into the exact M×K×N matrix shapes your model executes, then benchmark those shapes across precisions. This will help you determine the optimal precision for your architecture before committing to a more expensive training run.
NVIDIA Transformer Engine (TE) can handle quantization and kernel dispatch unlocking low precision formats. This post shows you how to move from high-level model settings to concrete GEMM workloads, profile them with a microbenchmark, and estimate where lower precision will actually translate into speedups to help you accelerate your transformer-based models. The use case features ESM2-15B, a protein language model used for biological sequence understanding and drug discovery.
Model configuration and training inputs
Suppose you’re working with a 15B-parameter model such as ESM2-15B. It will have a config such as:
hidden_size: 5120
intermediate_size: 20480
num_attention_heads: 40
num_hidden_layers: 48
Your training configuration is:
micro_batch_size: 32
sequence_length: 1024
The benchmark tool can then take these hyperparameters directly and then use a single command to derive GEMM shapes, benchmark them across precisions, and compute the full speedup analysis:
python benchmarks/gemm/benchmark_gemm.py \
--hidden_size 5120 \
--intermediate_size 20480 \
--num_attention_heads 40 \
--num_hidden_layers 48 \
--micro_batch_size 32 \
--sequence_length 1024 \
-o ./images/esm2_model_config_speedup.png
Note: To disable Blackwell-specific flags, add --no-fp8 --no-fp4. --no-fp8 --no-fp4 provides BF16 plus the three FP8 recipes that work on Hopper.
--no-fp8disables MXFP8--no-fp4disables NVFP4
Using autocast mode versus prequantizing
By default, the tool runs in autocast mode, which is what TE does during training: inputs are dynamically quantized to the target precision before each GEMM, so the measured time includes both the quantization cost and the GEMM kernel itself. This provides you with the realistic per-GEMM picture during a training step.
The tool computes M = 31 × 512 = 15,872 tokens, derives all 12 GEMM shapes, benchmarks each across enabled precisions, and prints the full results. Fprop, Dgrad, and Wgrad shapes are all benchmarked separately to capture the impact of different matrix aspect ratios on kernel selection.
By default, the tool runs in autocast mode, which is what TE does during training: inputs are dynamically quantized to the target precision before each GEMM, so the measured time includes both the quantization cost and the GEMM kernel itself. This provides you with the realistic per-GEMM picture during a training step.
The tool computes M = 32 × 1024 = 32768 tokens, derives all 12 GEMM shapes, benchmarks each across enabled precisions, and prints the full results. Fprop, Dgrad, and Wgrad shapes are all benchmarked separately to capture the impact of different matrix aspect ratios on kernel selection.

To isolate raw GEMM kernel performance, add --pre-quantize. This prequantizes all inputs once before the timed loop, so the measured time reflects only the GEMM kernel execution—no dynamic quantization, no block scaling computation, no format conversion during the timed region.
Note that FP8 DelayedScaling always runs in autocast mode, even with --pre-quantize because it relies on an amax history that requires dynamic quantization. It therefore has no prequantized variant and is excluded from the prequantized chart.
python benchmarks/gemm/benchmark_gemm.py \
--hidden_size 5120 \
--intermediate_size 20480 \
--num_attention_heads 40 \
--num_hidden_layers 48 \
--micro_batch_size 32 \
--sequence_length 1024 \
--pre-quantize \
-o ./images/esm2_model_config_speedup_prequant.png

Comparing the autocast and prequantized speedups tells you exactly how much quantization overhead costs: NVFP4 versus BF16 goes from 2.69x (autocast) to 4.01x (kernel-only). The gap between these two numbers is the overhead from dynamic quantization, Hadamard transforms, and block scaling that occurs in each training step.
Use autocast results for predicting real training speedups. This is what TE actually does during training. Use prequantized results to understand whether quantization overhead is the bottleneck, or to compare raw tensor core throughput across precisions independent of the quantization implementation.
Interpreting the results for a real model
This section walks through how to interpret these results for a real model. Using the same ESM2-15B config, we ran the full model config benchmark on NVIDIA B300. The per-shape NVFP4 versus MXFP8 speedups from the Fprop results are as follows:
QKV proj: 1.922 / 1.064 = 1.81x
Attn out: 0.814 / 0.566 = 1.44x
MLP up: 2.699 / 1.415 = 1.91x
MLP down: 3.121 / 1.728 = 1.81x
Take note of the following points:
- Every projection benefits at this scale. At ESM2-15B dimensions and M = 32,768 tokens, even the attention output GEMM (5120×5120)—the smallest weight matrix in the layer—gets a real 1.44x NVFP4-over-MXFP8 speedup. This is a notable contrast with smaller configs, where the attention output GEMM is often too small for lower precision to overcome the quantization overhead and barely moves. The largest GEMM, MLP Up (5120×20480), leads at 1.91x. The lesson holds in both directions: shape still dictates the gain, but a 15B model’s GEMMs are large enough that every projection clears the overhead.
- Big GEMMs: real but sub-theoretical gains. The FP4 tensor cores deliver 1.81x to 1.91x over MXFP8 on the three large GEMMs (QKV, MLP Up, MLP Down)—still short of the theoretical 2x to 3x from the hardware spec. Including the smaller attention-output GEMM, the blended Fprop NVFP4-over-MXFP8 speedup is 1.79x. After adding Wgrad times, non-GEMM overhead, and NVFP4-specific quantization costs, the end-to-end gap between NVFP4 and MXFP8 in training is consistent with these kernel-level numbers.
- FP8 DelayedScaling is the fastest FP8 recipe on Blackwell. At 23.76 ms/layer in autocast mode (1.74x over BF16), it outperforms both FP8 CurrentScaling (26.02 ms, 1.59x) and MXFP8 (26.78 ms, 1.54x). DelayedScaling’s amax-history approach avoids the per-step amax pass, lowering its quantization overhead. Comparing FP8 CurrentScaling’s autocast (26.02 ms) against its pre-quantized (21.86 ms) result shows roughly 16% of its autocast time is quantization overhead. (DelayedScaling has no pre-quantized variant—it relies on an amax history—so it is excluded from the pre-quantized comparison.)
- The prequantized results reveal the true kernel potential. Running with
--pre-quantizeremoves quantization overhead entirely, and NVFP4 versus BF16 jumps from 2.69x (autocast) to 4.01x (kernel-only). This shows the FP4 tensor cores are delivering real speedups. It’s the quantization overhead in autocast mode that narrows the gap. - The Fprop versus Dgrad comparison reveals that the 2x approximation is imprecise for quantized formats. While BF16 Dgrad is within ~1% of Fprop, quantized formats show 4–5% slower Dgrad sums. The QKV Proj Dgrad is especially asymmetric—15–22% slower than Fprop for FP8/FP4—because swapping K (5120) and N (15360) dramatically changes the matrix aspect ratio and kernel selection. This is exactly why the tool benchmarks Fprop and Dgrad separately rather than counting Fprop time twice.
Once you have the estimated GEMM-only speedup, compare it against your observed end-to-end training speedup:
- GEMM speedup ≈ training speedup: GEMMs dominate the step, everything is working as expected
- GEMM speedup >> training speedup: Overhead outside of GEMMs is eating the gains. For NVFP4 in particular, this overhead includes Random Hadamard transforms on Wgrad inputs, stochastic rounding on gradients, 2D block scaling for weights, and the extra memory pass for per-tensor amax computation. These are all additional ops that MXFP8 doesn’t need, and they can significantly narrow the gap even if the raw FP4 GEMMs are much faster
- GEMM speedup ≈ 1.0 even in the microbenchmark. The FP4 kernels aren’t actually faster at these shapes, or they’re silently falling back to FP8
The last case is especially worth checking. Set NVTE_LOG_LEVEL=1 or inspect with NVIDIA Nsight Systems to confirm that TE is actually dispatching FP4 kernels. TE can silently fall back to FP8 or BF16 for layers or ops that don’t support FP4 yet, which would explain identical performance with no other symptoms. You can also compare GPU memory usage between MXFP8 and NVFP4 runs. If memory is nearly identical, that’s a strong signal that FP4 weights aren’t actually being stored.
Get started benchmarking your model for low-precision training
Low-precision training speedups are highly dependent on the actual GEMM shapes your model runs and running in low precision does not automatically translate into end-to-end training gains, especially when quantization overhead, kernel selection, and non-GEMM operations are included. By turning a transformer config into concrete M×K×N workloads, you can benchmark BF16, MXFP8, and NVFP4 on the shapes that matter for your model before committing to a full training run.
Benchmark your GEMMs to see which precision is right for you. To get started, check out the benchmark script. For the full documentation and to understand how these shapes are derived, see the GEMM profiling tutorial in the Transformer Engine documentation.
Use this benchmark to:
- Autocast results to set realistic training-speedup expectations
- Prequantize results to know whether you’re bottlenecked on kernels or on quantization
- Run candidate model configs through the tool before committing to a training run, as the tool is a useful architecture co-design instrument