in Startups

Sponsor Post – It’s getting cloudy out there

VM FarmsI was approached by David Crow with an appealing proposition; we would help StartupNorth.ca by migrating their site to our infrastructure and have us manage it, and in return we will become a sponsor and have our logo placed on their site. You can read David’s post about all about it.

In addition I was also asked to be a guest author on the site and provide my thoughts on various topics on the subject of web operations and the cloud – a topic I am quite familiar with. I spent some time thinking about what I should write for my first post, but that decision was made for me late last week.

Photo by Elite PhotoArt - William and Lisa Roberts
AttributionCC BY-ND 2.0No Derivative Works Some rights reserved Photo by Elite PhotoArt

On April 21st, Amazon’s primary storage service (Elastic Block Store or EBS) in their North Virginia Availability Zone decided to kick the bucket. This failure is the equivalent of pulling out your computer’s hard drive while it’s running. The implications were likely dramatic for many businesses running on their cloud. As I watched many of my favorite sites (HootSuite and Reddit to name a couple) showcase their 500 error pages, I couldn’t help but think about how helpless the owners felt as Amazon worked hard to get their service back up and running.

I won’t get into the details of what happened with Amazon, as you can get a good hour-by-hour breakdown of events from Amazon directly (http://status.aws.amazon.com/). Instead, I’d like to outline some of my thoughts and take-away lessons.

This failure rekindled some old fears I’ve had since the sky started clouding over. Considering the amount of resources and engineering required to build massive multi-tenant infrastructures, equally massive failures only seem imminent. Don’t get me wrong, Amazon has done a phenomenal job building arguably the world’s largest and most successful cloud deployment. They have some of the best and brightest engineers this industry has to offer, but it’s still worth thinking twice about putting your business up in the cloud.

I’m reminded of a series of meetings I attended with my previous employer in which NetApp (one of the leaders in online storage) pitched their latest and greatest cloud storage solution. We were looking to build our own cloud and hence selecting the right storage solution was absolutely critical. As the well-informed NetApp architect diagramed the proposed architecture on a whiteboard, I was struck by how many “dependency lines” were connected to the NetApp unit. Inevitably the question came up of what would happen if the unit itself failed. The answer was simple: buy 2 of them. Simple answer, but effectively doubles the price tag. And buying 2 doesn’t necessarily prevent all types of failure.

Amazon was likely bit by one of these failures (post-mortem has yet to be released). Surprisingly, this nearly 5-day outage did not breach their 99.95% uptime SLA (99.95% is equivalent to roughly 4.38 hours of downtime per year). This is because the SLA is only applicable to their EC2 service, and not EBS (http://aws.amazon.com/ec2-sla/). Amazon is not legally obligated to reimburse any of its customers. Eventually though, most will likely forgive them since their service is so revolutionary compared to the old rack-and-stack hosting model.

???363? macro
CC BY-SA 2.0 Some rights reserved photo by svofski

Still, some harsh realities were realized this past week. So what can be learned from all this? Below I’ve identified 4 key points, which I bring up time and time again.

  1. Build for Failure.
    This is by far the most important point I emphasize. Whenever I advise my clients on a proposed architecture, I always ask the question “can you tolerate component X being down for a day?” Framing the question in that manner will highlight the realities of complicated systems – failure will happen, you need to plan for it. As a side note, I find it ironic that Amazon’s tightly coupled infrastructure requires application developers to do the opposite and build their application around Loosely Coupled Design. It is unfortunate that we need to change the way we deploy our applications to fit the provider but it is a necessary evil.
  2. Read the fine print carefully.
    Make sure you read through the SLA before you agree to the terms. This may seem obvious, but it’s surprising how many people gloss over this. Understand that uptime guarantees are usually not all encompassing, such as the case with Amazon.
  3. Explore alternatives.
    Not all cloud infrastructures are alike. The allure of the cloud is that it abstracts away the underlying implementation, but you should still do your homework and investigate the design decisions made by the vendor. The level of complexity should be factored into your decision making process.  Like others, who worked out solutions to avoid such problems, we at VM Farms take a different approach to building redundancy into our Cloud.
  4. Identify your Achilles Heel.
    It helps to diagram your setup and draw dependency lines. If you notice any “hot spots” or “single points of failure” (SPOF), focus your engineering team on them and do what you can to mitigate those risks. Sometimes budgets or design decisions will prevent you from avoiding them. However, make sure you know what they are and incorporate this into your decision making process.

Instant resource availability will make any CTO’s mouth salivate. Just make sure you know what you’re getting into. When it rains, it pours.

  1. Good post, Hany. I might add that one of the reasons to “build for failure” is that even fault-tolerant systems depend on their pieces’ being put together correctly. A few years ago I worked at a company that used a NetApp storage system that was supposedly fault-tolerant, including having two controllers sharing the load with each one able to take the entire load itself if the other failed. Eventually one controller did fail, but the system had been miswired by the vendor and the failover did not happen. The result was that some of our users were without service for several days.

  2. Hey Rohan, I completely agree. Humans are not infallible and this should always be taken into consideration when building for failure. I’ve worked with NetApp units for many years, and regardless of the amount of time/money that’s spent on making them fail-safe, inevitably something went wrong.

Comments are closed.