11 articles
Pinterest engineers debugged why Ray-based ML training jobs were crashing with intermittent network connectivity issues on Kubernetes clusters backed by AWS EC2.
Pinterest's approach to automatically measuring user-perceived latency (Visually Complete) on Android surfaces by embedding measurement logic into base UI classes.
This post details Netflix's multi-step CPU optimization journey for video serendipity scoring in the Ranker recommendation service using JDK's Vector API.
Netflix explains how migrating to kubelet + containerd with per-container user namespaces triggered severe mount lock contention on r5.metal nodes.
This post describes five engineering and algorithmic interventions developed at Microsoft to stabilize reinforcement learning post-training of multimodal agents for Copilot at production scale.
This article compares SVG and raster image loaders (GIF/PNG) for web loading indicators, explaining when and why SVG is generally the better choice.
Grab celebrates the 10th anniversary of its bug bounty program in partnership with HackerOne, reflecting on a decade of collaborative security research.
This article describes how Grab's internal AI platform SpellVault evolved from a no-code LLM app builder into an agentic AI platform capable of reasoning and acting dynamically.
Grab migrated its macOS CI/CD infrastructure from a US cloud vendor to a self-owned colocation cluster in Malaysia, achieving major cost and performance gains.
Grab built a custom ~1B Vision LLM to improve eKYC document processing for Southeast Asian languages and documents.
BBC Online shares how they redesigned their Webpack code splitting strategy after core bundles grew beyond 1MB combined size.