TorchTPU: Running PyTorch Natively on TPUs at Google Scale

2026-04-07

1 min read

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

TorchTPU is Google's native PyTorch integration for TPUs, enabling high-performance ML workloads on custom ASIC hardware with minimal code changes.

•Implements three eager execution modes: Debug Eager (synchronous per-op), Strict Eager (asynchronous), and Fused Eager (automated op fusion with 50–100%+ performance gains over Strict Eager).
•Uses torch.compile with XLA as the backend compiler, translating PyTorch operators to StableHLO IR to generate optimized TPU binaries.
•Supports DDP, FSDPv2, and DTensor natively, and handles MPMD (divergent per-rank execution) unlike its predecessor PyTorch/XLA which required pure SPMD.
•Custom kernels written in Pallas or JAX can be integrated via the @torch_tpu.pallas.custom_jax_kernel decorator.
•

Related Articles