The Presentation That Changed How I Think About Tech Ops

I started my career in teleco, so I learned a very old school style of managing tech ops. 99.999% availability, planned everything, control-room style networks, etc. 10-levels of permissions & sign-offs to change anything. So imagine me, coming from this world, building a new tech ops stack and processes, and I stumbled across the above presentation from John Allspaw at Velocity. The idea of doing 10+ deploys/day, coming from a world of big releases and maintenance windows, completely blew my mind away. This presentation and ideology has morphed into what is now known as “devops”, which I think is really the only way for a startup to operate their service.

Stumbling around the internet back in the founding days, I found some other tech ops resources & people that helped me a lot in changing my view.

@netik – John Adams who heads up ops at Twitter. He presents often and you can find a lot of his materials on how Twitter handles tech ops topics. This is one of his more recent presentations: http://www.ritholtz.com/blog/2011/05/twitter-ops-clouds-scale/. (I’m amazed that Twitter continues to rely on MySQL as its backbone (so do we at Peek, but obviously at a fraction of their scale))

http://codeascraft.etsy.com/ – Etsy’s notes on deploying and engineering, note that John Allspaw is now at Etsy. Their one-click deployinator tool is intriguing.

@vvuksan – Vladimir has a great blog http://vuksan.com/blog/ with dense information, especially around gathering data and stats.

@cgoldberg – Corey Goldberg has generated a ton of tools for performance testing and understanding the capabilities of your software.

On top of a lot of the information provided above, I’d suggest a few other practices for your initial tech ops infrastructure.

Decouple from your hosting provider

You should only ever require an instance (virtual or physical) + a base OS build from your hosting provider. If you depend on them for more, you make it difficult to switch providers. So if they have an embarrassing outage, you won’t be able to do anything about it. To be more specific in Amazon this means you should always use “raw” AMIs, such as the Canonical Ubuntu AMIs, and then add on the software you need via Puppet, Chef, and your favourite config & deploy tools. If you did this, it would have meant launching a new server in RackSpace would have taken about 20 minutes, and you would have avoided the whole Amazon outage.

Favour tools, automation & culture over process & documentation

Personally, I think wikis are bad, nobody reads them (or remembers logins to them). With the time you could write a procedure in a wiki, you could have probably built a ruby or bash script that did whatever you were explaining. So if you install apache or ejabberd or something, don’t take notes on installation, setup a script in Chef or Puppet, so next time it takes 2 minutes to setup. Everything should be one click/command tools or scripts.

Own your own hardware brother

At some point Amazon and the cloud will stop making sense. This post called Petabytes on a Budget does a great job on outlining why that is the case. Eventually with enough scale, the cost of expertise < cost of cloud.

One last note, a pitch for a product I rely on a lot. If you need an app to wake you up late at night in your tech ops world when things go wrong, I'd take a look at Toronto's own PagerDuty.com. We used them to replace a $2500/month service we had with a $50/month service Pagerduty offers. (I have no connection to them other than liking their product)

I’d love to hear some thoughts from others on how they run their infrastructure.