October 20: Amazon’s Correction of Error Mechanism (Part 2)

Let’s continue with the example from yesterday where a large server failure takes down the Amazon.com store.

Suggest a topic for a future fact.

Source: Dave Anderson, Scarlet Ink
Published: June 2021

Amazon’s Correction of Error Mechanism (Part 2)

Let’s continue with the example from yesterday where a large server failure takes down the Amazon.com store. Once the 5 Whys identify the root cause of the issue, action items are assigned.

5 Whys

  1. Why did the site go down? All the servers were turned off.

  2. Why were the servers turned off? A script was run which shut them off.

  3. Why was a script able to turn off so many servers? There are no permission differences between turning off 1 server vs many.

  4. Why did the team not notice immediately? There were no alarms/notifications in place.

  5. Why was someone shutting down servers? Because hardware was being replaced.

Action items

  1. Write a script for restarting servers ASAP, owned by __ employee.

  2. Improve existing alarms by the end of the week, owned by __ employee.

  3. Design automated recovery system for future outages by the end of the quarter, owned by __ employee.

Leave a comment