AWS News Blog logoAWS News Blog
|Machine Learning

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

2025-12-03
5 min read
0
by Channy Yun (윤석찬)

Endigest AI Core Summary

Amazon SageMaker HyperPod introduces two new AI model training features: checkpointless training and elastic training.

  • Checkpointless training eliminates checkpoint-restart cycles via peer-to-peer state recovery, reducing fault recovery time by over 80% compared to traditional methods
  • It is built on four components: collective communications initialization, memory-mapped data loading, in-process recovery, and peer-to-peer state replication orchestrated by the HyperPod training operator
  • Amazon Nova models were trained using this technology on tens of thousands of accelerators
  • Elastic training automatically scales training jobs up or down by adding or removing data parallel replicas based on cluster resource availability
  • Scaling is orchestrated via the HyperPod training operator integrated with Kubernetes, monitoring pod lifecycle events, node availability, and resource scheduler priority signals
Tags:
#Amazon SageMaker HyperPod
#Artificial Intelligence
#AWS re:Invent
#Launch
#News