Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
Endigest AI Core Summary
Pinterest engineers debugged why Ray-based ML training jobs were crashing with intermittent network connectivity issues on Kubernetes clusters backed by AWS EC2.
•Ray generates high-volume inter-pod gRPC traffic across control and data planes, making network stability critical for distributed ML training at scale
•ENA network driver resets were triggered by CPU starvation when TX threads paused for over 5 seconds, causing automatic device resets and packet loss
•Mitigation attempts including TransparentHugePages, jemalloc memory allocator, CPU affinity with taskset, and interrupt pinning showed minimal improvement
•Network resets correlated with high system CPU usage and page faults, with temporary relief from machine reboots but issues returning after approximately one week
•
Metrics revealed the problem was specific to a single AWS Availability Zone, while identical jobs ran without issues in other zones
This summary was automatically generated by AI based on the original article and may not be fully accurate.