21 articles
Airbnb migrated its identity graph from third-party PaaS to an internally-managed knowledge graph infrastructure built on JanusGraph and DynamoDB.
This post explains how Airbnb eliminated circular dependencies in its observability stack to ensure reliable monitoring at scale.
Netflix's Switchboard processes 1 million requests per second, providing centralized ML abstraction for clients.
Pinterest optimized ML serving network efficiency by implementing Feature Trimmer to reduce bandwidth bottleneck.
Skipper is Airbnb's embedded workflow engine designed to enable durable execution of multi-step business processes without requiring external orchestration infrastructure.
Airbnb built a metrics storage system ingesting 50 million samples per second and storing 1.3 billion active time series.
Pinterest's MIQPS algorithm automatically learns which URL parameters affect content identity, enabling efficient deduplication across millions of merchant URLs at scale.
Pinterest shares their technique of request-level deduplication to manage infrastructure costs when scaling recommendation systems with 100x increased model parameters.
This post details a production migration of a large-scale metrics pipeline from StatsD to OpenTelemetry (OTLP) with Prometheus-based storage and vmagent for streaming aggregation.
Slack addresses HTTP/3 observability challenges through QUIC support in Prometheus Blackbox Exporter.
Pinterest describes how they built a production MCP (Model Context Protocol) ecosystem to enable AI agents to safely automate engineering tasks.
Airbnb shares hard-won lessons from migrating its observability platform from third-party vendors to a custom in-house solution built on Prometheus across 1,000 services.
Airbnb explains how they rebuilt their Observability as Code (OaC) alert development workflow to eliminate weeks-long validation cycles.
Pinterest's Piqama is a generic quota management ecosystem that handles the full lifecycle of resource quotas across Big Data Processing and Online Services.
This post describes how Airbnb built "Sitar," their internal dynamic configuration platform for shipping runtime config changes safely at scale.
Airbnb evolved Mussel, its multi-tenant key-value store, from simple QPS rate limiting to an adaptive traffic management system to maximize goodput during traffic spikes.
Slack's Deploy Safety Program reduced customer impact hours by 90% over 18 months by overhauling deployment practices and safety culture.
Airbnb shares how they completely rearchitected Mussel, their internal key-value store for derived data, migrating from v1 to a NewSQL-based v2 running in production for over a year.
Airbnb introduces Viaduct, a data-oriented service mesh built on GraphQL that addresses the complexity of large-scale microservices dependency graphs.
Airbnb completed a 4.5-year migration of their JVM monorepo (tens of millions of lines of Java, Kotlin, and Scala) from Gradle to Bazel.