October 20: Amazon’s Correction of Error Mechanism (Part 2)
Let’s continue with the example from yesterday where a large server failure takes down the Amazon.com store.
Source: Dave Anderson, Scarlet Ink
Published: June 2021
Amazon’s Correction of Error Mechanism (Part 2)
Let’s continue with the example from yesterday where a large server failure takes down the Amazon.com store. Once the 5 Whys identify the root cause of the issue, action items are assigned.
5 Whys
Why did the site go down? All the servers were turned off.
Why were the servers turned off? A script was run which shut them off.
Why was a script able to turn off so many servers? There are no permission differences between turning off 1 server vs many.
Why did the team not notice immediately? There were no alarms/notifications in place.
Why was someone shutting down servers? Because hardware was being replaced.
Action items
Write a script for restarting servers ASAP, owned by __ employee.
Improve existing alarms by the end of the week, owned by __ employee.
Design automated recovery system for future outages by the end of the quarter, owned by __ employee.
Great post as always. Don't we need to evaluate the cost of these action items? Thanks