vmfarms – StartupNorth

I was approached by David Crow with an appealing proposition; we would help StartupNorth.ca by migrating their site to our infrastructure and have us manage it, and in return we will become a sponsor and have our logo placed on their site. You can read David’s post about all about it.

In addition I was also asked to be a guest author on the site and provide my thoughts on various topics on the subject of web operations and the cloud – a topic I am quite familiar with. I spent some time thinking about what I should write for my first post, but that decision was made for me late last week.

No Derivative Works Some rights reserved Photo by Elite PhotoArt

On April 21st, Amazon’s primary storage service (Elastic Block Store or EBS) in their North Virginia Availability Zone decided to kick the bucket. This failure is the equivalent of pulling out your computer’s hard drive while it’s running. The implications were likely dramatic for many businesses running on their cloud. As I watched many of my favorite sites (HootSuite and Reddit to name a couple) showcase their 500 error pages, I couldn’t help but think about how helpless the owners felt as Amazon worked hard to get their service back up and running.

I won’t get into the details of what happened with Amazon, as you can get a good hour-by-hour breakdown of events from Amazon directly (http://status.aws.amazon.com/). Instead, I’d like to outline some of my thoughts and take-away lessons.

This failure rekindled some old fears I’ve had since the sky started clouding over. Considering the amount of resources and engineering required to build massive multi-tenant infrastructures, equally massive failures only seem imminent. Don’t get me wrong, Amazon has done a phenomenal job building arguably the world’s largest and most successful cloud deployment. They have some of the best and brightest engineers this industry has to offer, but it’s still worth thinking twice about putting your business up in the cloud.

I’m reminded of a series of meetings I attended with my previous employer in which NetApp (one of the leaders in online storage) pitched their latest and greatest cloud storage solution. We were looking to build our own cloud and hence selecting the right storage solution was absolutely critical. As the well-informed NetApp architect diagramed the proposed architecture on a whiteboard, I was struck by how many “dependency lines” were connected to the NetApp unit. Inevitably the question came up of what would happen if the unit itself failed. The answer was simple: buy 2 of them. Simple answer, but effectively doubles the price tag. And buying 2 doesn’t necessarily prevent all types of failure.

Amazon was likely bit by one of these failures (post-mortem has yet to be released). Surprisingly, this nearly 5-day outage did not breach their 99.95% uptime SLA (99.95% is equivalent to roughly 4.38 hours of downtime per year). This is because the SLA is only applicable to their EC2 service, and not EBS (http://aws.amazon.com/ec2-sla/). Amazon is not legally obligated to reimburse any of its customers. Eventually though, most will likely forgive them since their service is so revolutionary compared to the old rack-and-stack hosting model.

Some rights reserved photo by svofski

Still, some harsh realities were realized this past week. So what can be learned from all this? Below I’ve identified 4 key points, which I bring up time and time again.

Build for Failure.
This is by far the most important point I emphasize. Whenever I advise my clients on a proposed architecture, I always ask the question “can you tolerate component X being down for a day?” Framing the question in that manner will highlight the realities of complicated systems – failure will happen, you need to plan for it. As a side note, I find it ironic that Amazon’s tightly coupled infrastructure requires application developers to do the opposite and build their application around Loosely Coupled Design. It is unfortunate that we need to change the way we deploy our applications to fit the provider but it is a necessary evil.
Read the fine print carefully.
Make sure you read through the SLA before you agree to the terms. This may seem obvious, but it’s surprising how many people gloss over this. Understand that uptime guarantees are usually not all encompassing, such as the case with Amazon.
Explore alternatives.
Not all cloud infrastructures are alike. The allure of the cloud is that it abstracts away the underlying implementation, but you should still do your homework and investigate the design decisions made by the vendor. The level of complexity should be factored into your decision making process. Like others, who worked out solutions to avoid such problems, we at VM Farms take a different approach to building redundancy into our Cloud.
Identify your Achilles Heel.
It helps to diagram your setup and draw dependency lines. If you notice any “hot spots” or “single points of failure” (SPOF), focus your engineering team on them and do what you can to mitigate those risks. Sometimes budgets or design decisions will prevent you from avoiding them. However, make sure you know what they are and incorporate this into your decision making process.

Instant resource availability will make any CTO’s mouth salivate. Just make sure you know what you’re getting into. When it rains, it pours.

Some rights reserved by Arthur40A

It was at StartupDrinks or the StartupNorth Meetup when I was talking with Christopher and Hany at VMFarms about our hosting woes. StartupNorth.ca had been offline for about 24 hours, and we were in the dark as our previous hosting provider was trying to recover from a disk failure. Now we don’t have particularly complex hosting requirements, but one of the requirements is uptime. We’d like our services to be available to entrepreneurs 24x7x365. Mind you we’re not willing to pay for 99.999% uptime (learn more about high availability). I would have been ecstatic with “three nines” aka 99.9% or 8.76 hours of downtime per year. But we had failed to meet that requirement at we were approaching 99.5% uptime and without a foreseeable solution we could hit “two nines” (99% uptime) if something wasn’t done by us or the hosting provider.

We have been very lucky. We had been able to host StartupNorth.ca on a shared hosting solution that allows us to operate our WordPress installation, StartupNorth.ca, a custom Django application, StartupIndex.ca, and some custom development (stay tuned) built in PHP5 + MySQL + Apache2 for $20/month. Cheap if you compare it to what we could be paying. It worked for a long time but I was in the dark and I couldn’t see a light indicating there would be a path to salvation.

I had met Christopher and Hany a few months before. I personally love the business. In fact, I think I pitch Scott Pelton (@spelton) a similar idea back in the summer of 2010. With Christopher and Hany there is a core team of experienced developer operations and network operations professionals that have cut their chops deploying and supporting high availability leading edge web applications at Avid Life Media (PHP, Rails, Django, etc.). At StartupNorth, we are big proponents of supporting our local ecosystem. We use:

Guestlistapp for our event ticketing,
Freshbooks for invoicing,
WaveAccounting for our accounting system,
Eqentia as a news aggregator.

There was no reason other than cost that we should have our server applications hosted with a non-local provider.

I approached Christopher and Hany with a proposition. The should sponsor StartupNorth, the sponsorship is an in-kind sponsorship. They provide hosting and support on their infrastructure. We add their logo to StartupNorth web site and page footers, plus we give them the opportunity to write a few posts. The posts aren’t meant to be marketing fluff. I’ve ask the VM Farms team to talk about their real-world experiences including:

their experience using different cloud services;
network architecture and application hosting for advanced web & mobile applications (think application server, MongoDB or Hadoop clusters plus relational datastores);
when/how startups should evaluate the different performance vs cost trade-offs in advanced applications (there’s nothing wrong with choosing AWS but when should you look for alternatives)

We are incredibly picky about our sponsors. We are even more picky about the posts and authors we ask to join us. We take our reputation and the commentary we provide about the Canadian startup ecosystem very seriously. We’re hoping that we can help educate entrepreneurs about advanced network infrastructure decisions and the impact these decisions can have on costs, performance and growth. And with the team at VMFarms, we have some partners that are experienced and capable of providing a unique 3rd party view of AWS, Rackspace, GoGrid, Azure, Linode, etc. and traditional hosting environments.

We’ve been on VMFarms for about 30 days now. We have had 2 outages in those 30 days. Both outages have been my fault. Hany and Christopher have been on the ball and responsive to help me diagnose, identify and fix the issues.

Upgrading WordPress 3.1 to 3.1.1 – the Unix user permissions and file access settings I configured on the web directory do not allow the FTP user to write to the web directory. WordPress automatic upgrade requires an FTP user (though we connect using FTP-SSL). I ssh’d to the server, wget the update and “tar -xvzf” to the wordpress directory. This overwrote the .htaccess file and broke the Apache rewrite rules. Resolution time: approximately 15 minutes (because I insisted on doing it myself).
StartupNorth.ca unavailable on April 19, 2011 – turns out we let our DNS registration expire due to an expired credit card. It was identified by 2 users (thank you William and Scott (@scotthom). Hany debugged in about 15 seconds and it required Jevon (@jevon) to renew the DNS registration.

The team at VMFarms have been fantastic. They are helping StartupNorth immensely. I’m really looking forward to some additional discussion about developer operations in startups (should be interesting given my network infrastructure does not yet include VMFarms – we’re github, Heroku and AWS EC2 + S3). I’m wondering what John Philip Green (@johnphilipgreen) uses at CommunityLend, Pete Forde (@peteforde) at BuzzData, Daniel Debow (@ddebow) at Rypple, David Ossip (@dossip) at Dayforce, Chris Sukornyk (@sukornyk) uses at Chango, and Mike McDerment (@MikeMcDerment) at FreshBooks use to host their different application layers.