Overview

Master AI Token Economics With NVIDIA Full-Stack Inference

AI inference—how we experience AI through chatbots, copilots, and creative tools—is scaling at a double exponential pace. User adoption is accelerating while the AI tokens generated per interaction, driven by agentic workflows, long-thinking reasoning, and mixture-of-experts (MoE) models, soars in parallel. 

To enable inference at this massive scale, NVIDIA delivers data-center-scale architecture on an annual rhythm. Our extreme hardware and software codesign delivers order-of-magnitude leaps in performance and the lowest token cost, making advanced AI experiences economically viable at scale.

NVIDIA GB300 NVL72 delivers 50x tokens per watt and 35x lower token cost over Hopper™, maximizing revenue within the same power budget and driving higher profit margins. Continuous software optimizations extract maximum performance at chip, rack, and data center scale, further improving return on investment over time.

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

Cost per token is the metric that defines inference total cost of ownership (TCO), and NVIDIA Blackwell delivers the lowest token cost in the industry.

Leading Inference Providers Achieve Lowest Token Cost on NVIDIA Blackwell

Baseten, Deep Infra, Fireworks AI, and Together AI are reducing cost per token across industries with optimized inference stacks running on the NVIDIA Blackwell platform.

Inference Performance Drives Down Token Cost

DeepSeek-R1 8K/1K results show a 15x performance benefit and revenue opportunity for NVIDIA Blackwell GB200 NVL72 over Hopper H200.

What Are the Factors That Lower Token Cost?

Many enterprises evaluating AI infrastructure focus on the numerator: the cost per GPU per hour. For cloud deployments, this is the hourly rate paid to a cloud provider; for on-premise deployments, it’s the effective hourly cost derived from amortizing owned infrastructure. The real key to reducing token cost, however, lies in the denominator: maximizing the delivered token output.

That denominator carries two business implications.

1. Minimize token cost: When this increase in token output is reflected through the cost equation, it drives down cost per token, which is what grows the profit margin on every interaction served.

2. Maximize revenue: More tokens delivered per second also translates to more tokens per megawatt, which means more intelligence to use in AI-powered products and services, generating more revenue from the same infrastructure investment.

Cost per Token Is the Key Metric For AI Infrastructure TCO

Looking at compute cost alone, the NVIDIA Blackwell platform appears to cost roughly 2x more than NVIDIA Hopper™—but compute cost says nothing about the output that investment buys. An analysis of mere FLOPS per dollar suggests a 2x NVIDIA Blackwell advantage compared with the NVIDIA Hopper architecture. 

However, the actual outcome differs by orders of magnitude: NVIDIA Blackwell delivers more than 50x greater token output per megawatt than Hopper, resulting in nearly 35x lower cost per million tokens.

Metric NVIDIA Hopper
(HGX H200)
NVIDIA Blackwell
(GB300 NVL72)
NVIDIA Blackwell Relative to Hopper
Cost per GPU per Hour ($) $1.41 $2.65 2x
FLOPS per Dollar (PFLOPS) 2.8 5.6 2x
Tokens per Second per GPU 90 6,000 65x
Tokens per Second per MW 54K 2.8M 50x
Cost per Million Tokens ($) $4.20 $0.12 35x lower

Benefits

Highest Performance Maximizes Revenue

With extreme hardware and software codesign, NVIDIA GB300 NVL72 delivers 50x tokens per watt over Hopper, maximizing AI factory revenue within the same power budget. Continuous software optimizations extract maximum performance at chip, rack, and data center scale, further improving return on investment over time.

Lowest Token Cost Expands Profit Margins

NVIDIA GB300 NVL72 system delivers 35x lower cost per token over NVIDIA Hopper platform, driving higher profit margins for AI factories. With each generation, performance improvements far outpace infrastructure costs, creating better economics to enable advanced AI experiences at massive scale.

Full Stack Optimizes Every Model and Use Case

NVIDIA supports every model across generative AI, traditional ML, scientific computing, biology, and physical AI. From latency-sensitive real-time applications to high-throughput batch processing, NVIDIA delivers the best performance for every use case. The platform provides maximum flexibility and programmability to choose the optimal configuration for evolving workload and business requirements.

Native Integration Accelerates Deployment

NVIDIA’s production-ready software, including Dynamo and TensorRT™ LLM, and native integration with leading frameworks such as PyTorch, vLLM, SGLang, and llm-d, deliver the most robust AI inference stack. As model architectures and inference techniques rapidly evolve, NVIDIA’s stack ensures the fastest path from innovation to production.

Platform

Extreme Hardware–Software Codesign

Powerful hardware without smart orchestration wastes potential; great software without fast hardware means sluggish inference performance. NVIDIA’s inference platform delivers a continuously optimized full-stack solution with codesigned compute, networking, storage, and software to enable the highest performance across diverse workloads. 

Explore some of the key NVIDIA hardware and software innovations.

NVIDIA Vera Rubin NVL72

The NVIDIA Vera Rubin platform delivers 10x better performance per watt and 10x lower cost per token than Blackwell. Through extreme codesign, the platform pairs Rubin GPUs for massive context prefill with LPX for fast decode, eliminating the trade-off between speed and scale.

NVIDIA Grace Blackwell Ultra NVL72

GB300 NVL72 features 72 B300 GPUs connected with 130 TB/s NVLink™, so they can communicate seamlessly with each other, and unlock massive mixture-of-experts models at scale.

NVIDIA Dynamo

NVIDIA Dynamo is an open source distributed inference-serving framework to deploy models in multi-node environments at AI-factory-scale. It streamlines distributed serving by disaggregating inference, optimizing routing, and extending memory through data caching to cost-effective storage tiers.

TensorRT LLM

TensorRT LLM is an open source library for continuously optimized high-performance, real-time LLM inference on NVIDIA GPUs. With a modular Python runtime, PyTorch-native authoring, and a stable production API, it’s optimized to maximize throughput, minimize costs, and deliver fast user experiences.

Decoding the Performance Paretos

Ever wonder how complex AI trade-offs translate into real-world outcomes? Explore different points across the performance curves below to see firsthand how extreme hardware and software codesign make NVIDIA Blackwell Ultra the most performant, efficient, and profitable choice.

Toy Jensen
TPS / user
TPS / MW
Simulated Chat Experience

DeepSeek R1 ISL = 32K, OSL = 8K, GB300 NVL72 with FP4 Dynamo disaggregation. H100 with FP8 in-flight batching. Projected performance subject to change.

Wondering how each configuration translates to real user experiences? Explore the curves solo or with TJ’s guidance by clicking “Explore with TJ”, and see it brought to life in the simulated chat on the right.

 

Customer Stories

How Industry Leaders Are Driving Innovation With AI Inference

Amdocs

Accelerate Generative AI Performance and Lower Costs

Read how Amdocs built amAIz, a domain-specific generative AI platform for telcos, using NVIDIA DGX™ Cloud and NVIDIA NIM inference microservices to improve latency, boost accuracy, and reduce costs.

Snapchat

Enhancing Apparel Shopping With AI

Learn how Snapchat enhanced the clothes shopping experience and emoji-aware optical character recognition using Triton Inference Server to scale, reduce costs, and accelerate time to production.

Amazon

Accelerate Customer Satisfaction

Discover how Amazon improved customer satisfaction by accelerating their inference 5X faster with TensorRT.

Resources

The Latest in AI Inference Resources

Training for AI Infrastructure Professionals

Learn to deploy, run, and optimize AI infrastructure.

Learn About AI Factory Deployment

Whether your team is responsible for configuring switches and validating cabling, or installing cluster management software and orchestrating GPU workloads, this training provides the structured guidance to get it done right.

Intro to Inference: How to Run AI Models on a GPU

Learn how to set up and run AI inference on GPUs in Google Cloud. This pathway gets you started with the inference pipeline, model formats, and performance metrics through hands-on examples.

Extreme Codesign for Efficient Tokenomics and AI at Scale

As AI shifts to real-time reasoning, the key challenge is lowering cost per token—the cost of generating intelligence—while handling massive workloads from models like MoE. Achieving this requires tightly optimizing the entire stack, making end-to-end system design the most effective way to scale efficient, high-ROI AI.

Why Cost Per Token Is the Only Metric You Need for AI TCO

Today, AI data centers are token factories. Cost per token captures end-to-end performance across GPUs, CPUs, networking, software, and ecosystems—making it the key driver of real profitability and scalability in AI. NVIDIA delivers the lowest cost per token and highest performance per watt, maximizing AI factory revenue.

UneeQ

How DeepL Built an AI Infrastructure for Real-Time Language AI

DeepL is leveraging NVIDIA TensorRT LLM and NVFP4 inference on NVIDIA GB200 NVL72 systems to train Mixture of Experts (MoE) models, advancing its model architecture to improve efficiency during training and inference, setting new benchmarks for performance in AI.

FAQs About the Total Cost of Ownership (TCO) of the NVIDIA Inference Platform

GB300 NVL72 delivers AI inference at $0.123 per million tokens at 116 TPS/user interactivity using NVIDIA Dynamo and TensorRT™-LLM—the lowest cost per token among major platforms, according to SemiAnalysis InferenceX benchmarks as of April 2026.

NVIDIA Blackwell Ultra (GB300 NVL72) delivers up to 50x higher throughput per megawatt and up to 35x lower cost per token than NVIDIA Hopper™ for low-latency agentic workloads, through hardware–software codesign, according to SemiAnalysis InferenceX benchmarks (Q1 2026). The GB300 NVL72 combines 72 Blackwell Ultra GPUs with 288 GB HBM3e per GPU in a single rack-scale system, all interconnected through NVIDIA NVLink™ Switch into a unified NVLink fabric delivering 130 TB/s of bandwidth. This architecture minimizes all-to-all communication latency, enabling large-scale Mixture-of-Experts (MoE) models like DeepSeek-R1 to scale expert parallelism efficiently across up to 72 GPUs simultaneously.

Only looking at compute pricing or FLOPs per dollar gives an incomplete view of inference TCO. The most important metric for AI inference TCO is cost per token, or the price-performance actually delivered. GB300 NVL72 delivers AI inference at $0.123 per million tokens at 116 TPS/user interactivity using NVIDIA Dynamo and TensorRT-LLM—the lowest cost per token among major platforms, according to SemiAnalysis InferenceX benchmarks as of April 2026.

When evaluating inference TCO, it’s important to look at large-scale Mixture-of-Experts (MoE) and reasoning models such as DeepSeek-R1. Nearly all of the latest closed and open source LLMs have adopted MoE and reasoning architectures, due to their superior intelligence and efficiency. By evaluating these models for inference TCO, you ensure your analysis is representative of what will likely be deployed.

NVIDIA's TensorRT-LLM and Dynamo software stack delivers continuous inference cost improvements without hardware changes. NVIDIA Blackwell B200 cost per million tokens dropped from $0.11 at launch to $0.02 on GPT-OSS-120B within two months, according to SemiAnalysis InferenceX benchmarks as of April 2026—a 5x improvement from software alone. Each TensorRT-LLM release typically delivers throughput gains through kernel fusion, quantization improvements, and scheduling optimizations.

Next Steps: Learn More About AI Inference TCO

Ready to Get Started?

Explore everything you need to start developing your AI application, including the latest documentation, tutorials, technical blogs, and more.

Find the Right Hardware for Your Inference Workloads

NVIDIA data center solutions are available through select NVIDIA Partner Network (NPN) partners. Explore flexible and affordable options for accessing the latest NVIDIA data center technologies through our network of partners.

Get the Latest on NVIDIA AI Inference

Sign up for the latest AI inference news, updates, and more from NVIDIA.

Get the latest from NVIDIA on AI Inference