PRX Part 3 — Training a Text-to-Image Model in 24h! | Endigest
Hugging Face
|Machine LearningGet the latest tech trends every morning
Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
This post describes a 24-hour speedrun for training a text-to-image diffusion model using 32 H200 GPUs and a ~$1500 compute budget.
- •X-prediction formulation enables training directly in pixel space with patch size 32 and a 256-dim bottleneck, eliminating the need for a VAE
- •LPIPS and DINOv2-based perceptual losses are added on top of the flow matching objective to improve convergence speed and visual quality
- •TREAD token routing bypasses 50% of tokens through a contiguous chunk of transformer blocks, reducing per-step compute cost
- •REPA representation alignment uses DINOv3 as a teacher at the 8th transformer block with loss weight 0.5, applied only to non-routed tokens
- •Muon optimizer handles 2D parameters while Adam handles non-2D parameters, showing clear improvement over Adam-only training
This summary was automatically generated by AI based on the original article and may not be fully accurate.