Engineering at Slack logoEngineering at Slack
|DevOps

Advancing Our Chef Infrastructure: Safety Without Disruption

2025-10-23
16 min read
0
by Archie Gunasekara

Endigest AI Core Summary

Slack describes how they improved Chef infrastructure safety by splitting environments and implementing a release train model without disrupting existing cookbooks or roles.

  • Single production Chef environment was split into prod-1 through prod-6, mapped by Availability Zone to reduce blast radius during large scale-out events
  • Poptart Bootstrap tool (baked into base AMIs) was extended to inspect AZ ID and assign nodes to the appropriate numbered Chef environment at boot time
  • prod-1 serves as a canary environment, receiving updates every hour with the latest cookbook changes to catch issues early
  • prod-2 through prod-6 follow a release train model where each environment must successfully receive a version before it propagates to the next
  • Changes flow through sandbox and dev environments at the top of the hour, then into production environments starting at 30 minutes past the hour via a Kubernetes cron job
Tags:
#Uncategorized
#aws