Amazon SageMaker HyperPod features
Scale and accelerate generative AI model development across thousands of AI accelerators
Task governance
Flexible training plans
Optimized recipes to customize models
SageMaker HyperPod recipes help data scientists and developers of all skill sets benefit from state-of-the-art performance while quickly getting started training and fine-tuning publicly available generative AI models, including Llama, Mixtral, Mistral, and DeepSeek models. In addition, you can customize Amazon Nova foundation models (FMs), including Nova Micro, Nova Lite, and Nova Pro using a suite of techniques, which includes Supervised Fine-Tuning (SFT), Knowledge Distillation, Direct Preference Optimization (DPO), Proximal Policy Optimization, and Continued Pre-Training—with support for both parameter-efficient and full-model training options across SFT, Distillation, and DPO. Each recipe includes a training stack that has been tested by AWS, removing weeks of tedious work testing different model configurations. You can switch between GPU-based and AWS Trainium–based instances with a one-line recipe change, enable automated model checkpointing for improved training resiliency, and run workloads in production on SageMaker HyperPod.
High-performing distributed training
Advanced observability and experimentation tools
SageMaker HyperPod observability provides a unified dashboard preconfigured in Amazon Managed Grafana, with the monitoring data automatically published to an Amazon Managed Prometheus workspace. You can see real-time performance metrics, resource utilization, and cluster health in a single view, allowing teams to quickly spot bottlenecks, prevent costly delays, and optimize compute resources. SageMaker HyperPod is also integrated with Amazon CloudWatch Container Insights, providing deeper insights into cluster performance, health, and use. Managed TensorBoard in SageMaker helps you save development time by visualizing the model architecture to identify and remediate convergence issues. Managed MLflow in SageMaker helps you efficiently manage experiments at scale.

Workload scheduling and orchestration
Automatic cluster health check and repair
Accelerate open-weights model deployments from SageMaker Jumpstart
SageMaker HyperPod automatically streamlines the deployment of open-weights FMs from SageMaker JumpStart and fine-tuned models from Amazon S3 and Amazon FSx. SageMaker HyperPod automatically provisions the required infrastructure and configures endpoints, eliminating manual provisioning. With SageMaker HyperPod task governance, endpoint traffic is continuously monitored and dynamically adjusts compute resources while simultaneously publishing comprehensive performance metrics to the observability dashboard for real-time monitoring and optimization.

Managed tiered checkpointing
SageMaker HyperPod managed tiered checkpointing uses CPU memory to store frequent checkpoints for rapid recovery, while periodically persisting data to Amazon Simple Storage Service (Amazon S3) for long-term durability. This hybrid approach minimizes training loss and significantly reduces the time to resume training after a failure. Customers can configure checkpoint frequency and retention policies across both in-memory and persistent storage tiers. By storing frequently in memory, customers can recover quickly while minimizing storage costs. Integrated with PyTorch's Distributed Checkpoint (DCP), customers can easily implement checkpointing with only a few lines of code, while gaining the performance benefits of in-memory storage.
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages