Google Developers Blog logoGoogle Developers Blog
|Machine Learning

A Developer's Guide to Debugging JAX on Cloud TPUs: Essential Tools and Techniques

2026-01-05
1 min read
0

Endigest AI Core Summary

This post provides a practical guide to debugging JAX workloads on Cloud TPUs, covering essential tools and their relationships in distributed environments.

  • libtpu (containing the XLA compiler and TPU driver) and JAX/jaxlib are the two core components that nearly all debugging tools depend on
  • Verbose logging can be enabled via environment flags (TPU_VMODULE, TPU_MIN_LOG_LEVEL, TF_CPP_MIN_LOG_LEVEL) on all TPU worker nodes using gcloud ssh commands
  • Libtpu logs are automatically written to /tmp/tpu_logs/tpu_driver.INFO on each TPU VM and can be retrieved across all workers via a gcloud scp bash script
  • The TPU Monitoring Library (bundled with jax[tpu]) provides programmatic access to hardware metrics like duty_cycle_pct via the tpumonitoring API
  • tpu-info is a CLI tool similar to nvidia-smi that displays real-time TPU chip memory usage and duty cycle metrics