Ricochet is the best place on the internet to discuss the issues of the day, either through commenting on posts or writing your own for our active and dynamic community in a fully moderated environment. In addition, the Ricochet Audio Network offers over 50 original podcasts with new episodes released every day.
Between 2005 and 2008, I worked as a principal engineer for Amazon where I had technical oversight responsibilities for a significant chunk of the Amazon.com retail website. Amazon is one of the most operationally competent companies on the planet but such competence doesn’t happen by accident.
The level of operational availability that Amazon achieves on its website is a consequence of intentional planning and foresight and it comes at a cost. To maintain availability in the face of unexpected events, substantial excess capacity is continuously maintained. At the time I was there, our operational doctrine required us to provision 150% of our expected peak load and to spread that total capacity across three separate geographies. This allowed for the possibility of losing an entire geography without losing the ability to still serve 100% of peak website requests. At one point while I was there, we were using fully 10% of our entire available capacity merely to probe the system for availability problems so that we would discover them before our customers did. A customer-visible problem caused by an engineer could be a career-ending event at Amazon during those years.