Advancing Our Chef Infrastructure: Safety Without Disruption
2025-10-23
16 min read
0
by Archie Gunasekara
Endigest AI Core Summary
Slack describes how they improved Chef infrastructure safety by splitting environments and implementing a release train model without disrupting existing cookbooks or roles.
- •Single production Chef environment was split into prod-1 through prod-6, mapped by Availability Zone to reduce blast radius during large scale-out events
- •Poptart Bootstrap tool (baked into base AMIs) was extended to inspect AZ ID and assign nodes to the appropriate numbered Chef environment at boot time
- •prod-1 serves as a canary environment, receiving updates every hour with the latest cookbook changes to catch issues early
- •prod-2 through prod-6 follow a release train model where each environment must successfully receive a version before it propagates to the next
- •Changes flow through sandbox and dev environments at the top of the hour, then into production environments starting at 30 minutes past the hour via a Kubernetes cron job
Tags:
#Uncategorized
#aws
