|DevOps

Finding zombies in our systems: A real-world story of CPU bottlenecks

2026-04-15

16 min read

by Pinterest Engineering

Tags:

performance

kubernetes

machine-learning

engineering

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

Pinterest engineers debugged why Ray-based ML training jobs were crashing with intermittent network connectivity issues on Kubernetes clusters backed by AWS EC2.

•Ray generates high-volume inter-pod gRPC traffic across control and data planes, making network stability critical for distributed ML training at scale
•ENA network driver resets were triggered by CPU starvation when TX threads paused for over 5 seconds, causing automatic device resets and packet loss
•Mitigation attempts including TransparentHugePages, jemalloc memory allocator, CPU affinity with taskset, and interrupt pinning showed minimal improvement
•Network resets correlated with high system CPU usage and page faults, with temporary relief from machine reboots but issues returning after approximately one week
•

Finding zombies in our systems: A real-world story of CPU bottlenecks

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Sitar-agent: Building a reliable dynamic configuration sidecar at scale

June 04, 2026

Multigres v0.1 Alpha: an operating system for Postgres

Lights Out, Systems On: Validating Instant Power Loss Readiness