As Software Engineers, we often tend to be overly optimistic about software. In particular, it often happens that we underestimate the probability of systems and components failures and the impact this kind of events can have on our applications.
We usually tend to dismiss failure events as random, unlikely and sporadic. And, often, we are proven wrong.
Systems do fail indeed. Moreover, when something goes wrong, either it’s barely noticeable, or it leads to extreme consequences. Take the example of the recent AWS outage: everything was caused by a mistake during a routine network change.
Right now, some days after the event, post-mortem analyses and survival stories count in the dozens. There is one recurring lesson that can be learned from what happened.
Not only we should try to prevent failure, but we should design our systems to deal with failure as if it were systematic. This is the only way to build robust and resilient software, and still it may not be enough.
Most of the sites and applications that were able to stay up and running while others were taken offline were designed with that principle in mind.
Netflix’s example is probably the most significant: while they were affected by the outage, things looked quite calm from a customer’s perspective. In retrospect, it’s not surprising at all, given what they wrote on their Tech Blog:
We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
Even though Amazon explicitly encourages developers to build their applications to deal with random failures or service degradations, the guys at Netflix went out of their way to make sure they were not dismissing that kind of issues as unlikely. It’s good to know that this approach eventually paid off, and even if we’re often not in the position to adopt such radical strategies, we should certainly keep in mind the lessons they have learned.
It’s time to stop making fun of Murphy’s law, and start designing our stuff according to it. Stuff breaks, and when it happens, it breaks badly.