PRX Part 3 — Training a Text-to-Image Model in 24h!

2026-03-03

1 min read

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

This post describes a 24-hour speedrun for training a text-to-image diffusion model using 32 H200 GPUs and a ~$1500 compute budget.

•X-prediction formulation enables training directly in pixel space with patch size 32 and a 256-dim bottleneck, eliminating the need for a VAE
•LPIPS and DINOv2-based perceptual losses are added on top of the flow matching objective to improve convergence speed and visual quality
•TREAD token routing bypasses 50% of tokens through a contiguous chunk of transformer blocks, reducing per-step compute cost
•REPA representation alignment uses DINOv3 as a teacher at the 8th transformer block with loss weight 0.5, applied only to non-routed tokens
•Muon optimizer handles 2D parameters while Adam handles non-2D parameters, showing clear improvement over Adam-only training

Related Articles