failure

Editor’s note: This is a cross post from Mark Evans Tech written by Mark Evans of ME Consulting. Follow him on Twitter @markevans or MarkEvansTech.com. This post was originally published in December 15, 2011 on MarkEvansTech.com.

Some rights reserved by Eric Constantineau – www.ericconstantineau.com

In the post I wrote earlier this week about the demise of Thoora, there was a comment suggesting that “Toronto failed Thoora” due to a lack of community support to make it a “winning formula”.

It was a puzzling comment because it suggests a community has an obligation to support a startup so it can thrive. This strikes me as an absurd idea because startups should succeed or fail on their own merits, and the ability to attract an audience near and close.

Sure, it’s good to drink the local flavour of “Kool-Aid” but only if a startup is offering a product or service that meets a need or interest. There are lots of local startups, including some that pitch me directly, that don’t resonate because nothing something interests me or the product/service doesn’t resonate enough to warrant further exploration.

It doesn’t mean I’m not supporting the local community; it just means a startup has a service that didn’t pass the sniff test.

At the same time, I do think Toronto’s startup community is extremely supportive. There’s no lack of enthusiasm, energy and a willingness to share ideas, feedback, resources, real estate and time to provide startups with a boost.

This has been a fact of life for the past five years, even before we started to see a flurry of startups appear on the scene. There has always been a strong, support community that has pulled together in different ways. A great example is tonight’s HoHoTo party, which has become a major fund-raising machine due to tremendous support from the community.

The bottom line is if a startup needs to rely on the community to make it, it also suggests what it’s offering can’t survive without artificial support.

For startups, the market has to be bigger than its own backyard. It needs people to support it or not based on what’s being sold as opposed to a sense of duty or obligation.

I was approached by David Crow with an appealing proposition; we would help StartupNorth.ca by migrating their site to our infrastructure and have us manage it, and in return we will become a sponsor and have our logo placed on their site. You can read David’s post about all about it.

In addition I was also asked to be a guest author on the site and provide my thoughts on various topics on the subject of web operations and the cloud – a topic I am quite familiar with. I spent some time thinking about what I should write for my first post, but that decision was made for me late last week.

No Derivative Works Some rights reserved Photo by Elite PhotoArt

On April 21st, Amazon’s primary storage service (Elastic Block Store or EBS) in their North Virginia Availability Zone decided to kick the bucket. This failure is the equivalent of pulling out your computer’s hard drive while it’s running. The implications were likely dramatic for many businesses running on their cloud. As I watched many of my favorite sites (HootSuite and Reddit to name a couple) showcase their 500 error pages, I couldn’t help but think about how helpless the owners felt as Amazon worked hard to get their service back up and running.

I won’t get into the details of what happened with Amazon, as you can get a good hour-by-hour breakdown of events from Amazon directly (http://status.aws.amazon.com/). Instead, I’d like to outline some of my thoughts and take-away lessons.

This failure rekindled some old fears I’ve had since the sky started clouding over. Considering the amount of resources and engineering required to build massive multi-tenant infrastructures, equally massive failures only seem imminent. Don’t get me wrong, Amazon has done a phenomenal job building arguably the world’s largest and most successful cloud deployment. They have some of the best and brightest engineers this industry has to offer, but it’s still worth thinking twice about putting your business up in the cloud.

I’m reminded of a series of meetings I attended with my previous employer in which NetApp (one of the leaders in online storage) pitched their latest and greatest cloud storage solution. We were looking to build our own cloud and hence selecting the right storage solution was absolutely critical. As the well-informed NetApp architect diagramed the proposed architecture on a whiteboard, I was struck by how many “dependency lines” were connected to the NetApp unit. Inevitably the question came up of what would happen if the unit itself failed. The answer was simple: buy 2 of them. Simple answer, but effectively doubles the price tag. And buying 2 doesn’t necessarily prevent all types of failure.

Amazon was likely bit by one of these failures (post-mortem has yet to be released). Surprisingly, this nearly 5-day outage did not breach their 99.95% uptime SLA (99.95% is equivalent to roughly 4.38 hours of downtime per year). This is because the SLA is only applicable to their EC2 service, and not EBS (http://aws.amazon.com/ec2-sla/). Amazon is not legally obligated to reimburse any of its customers. Eventually though, most will likely forgive them since their service is so revolutionary compared to the old rack-and-stack hosting model.

Some rights reserved photo by svofski

Still, some harsh realities were realized this past week. So what can be learned from all this? Below I’ve identified 4 key points, which I bring up time and time again.

Build for Failure.
This is by far the most important point I emphasize. Whenever I advise my clients on a proposed architecture, I always ask the question “can you tolerate component X being down for a day?” Framing the question in that manner will highlight the realities of complicated systems – failure will happen, you need to plan for it. As a side note, I find it ironic that Amazon’s tightly coupled infrastructure requires application developers to do the opposite and build their application around Loosely Coupled Design. It is unfortunate that we need to change the way we deploy our applications to fit the provider but it is a necessary evil.
Read the fine print carefully.
Make sure you read through the SLA before you agree to the terms. This may seem obvious, but it’s surprising how many people gloss over this. Understand that uptime guarantees are usually not all encompassing, such as the case with Amazon.
Explore alternatives.
Not all cloud infrastructures are alike. The allure of the cloud is that it abstracts away the underlying implementation, but you should still do your homework and investigate the design decisions made by the vendor. The level of complexity should be factored into your decision making process. Like others, who worked out solutions to avoid such problems, we at VM Farms take a different approach to building redundancy into our Cloud.
Identify your Achilles Heel.
It helps to diagram your setup and draw dependency lines. If you notice any “hot spots” or “single points of failure” (SPOF), focus your engineering team on them and do what you can to mitigate those risks. Sometimes budgets or design decisions will prevent you from avoiding them. However, make sure you know what they are and incorporate this into your decision making process.

Instant resource availability will make any CTO’s mouth salivate. Just make sure you know what you’re getting into. When it rains, it pours.

GrowConf’s Entrepreneurs: Breakdowns & Breakthroughs

Should We Drink the Local Kool-Aid?