Introducing checkpointless and elastic training on Amazon SageMaker HyperPod
2025-12-03
5 min read
0
by Channy Yun (윤석찬)
Endigest AI Core Summary
Amazon SageMaker HyperPod introduces two new AI model training features: checkpointless training and elastic training.
- •Checkpointless training eliminates checkpoint-restart cycles via peer-to-peer state recovery, reducing fault recovery time by over 80% compared to traditional methods
- •It is built on four components: collective communications initialization, memory-mapped data loading, in-process recovery, and peer-to-peer state replication orchestrated by the HyperPod training operator
- •Amazon Nova models were trained using this technology on tens of thousands of accelerators
- •Elastic training automatically scales training jobs up or down by adding or removing data parallel replicas based on cluster resource availability
- •Scaling is orchestrated via the HyperPod training operator integrated with Kubernetes, monitoring pod lifecycle events, node availability, and resource scheduler priority signals
Tags:
#Amazon SageMaker HyperPod
#Artificial Intelligence
#AWS re:Invent
#Launch
#News
