Who’s on call?

Alan Bradley: [holds up his pager] I was paged last night.

Sam Flynn: Oh, man, still rocking the pager? Good for you.

- TRON: Legacy

Here’s my card. It’s got my cell number, my pager number, my home number and my other pager number. I never take vacations, I never get sick. And I don’t celebrate any major holidays. 

- Dwight Schrute in NBC’s “The Office”

As Software Eats the World and more and more of our daily activities move online, we depend ever more on IT infrastructure. In a day spent emailing, tweeting, catching up on the news, checking Facebook, shopping on Amazon, watching a movie on Netflix, banking at an ATM — it’s all too easy to forget that underlying all of these “mission-critical” activities are servers, routers, load balancers, switches, storage and millions of line of code.

In a world where we’ve come to view the Internet as a utility, a major website outage is almost as serious as a power outage — and usually affects far more customers. Amazon Web Services had several major outages in 2012, taking down Netflix, Reddit, Heroku and many other sites in July and December. In October, it was the turn of YouTube, Dropbox, Tumblr and Google AppEngine. GoDaddy’s September outage affected up to 5 million hosted websites and 50 million domain names for six hours.

In addition to negative publicity and customer dissatisfaction, downtime now has an enormous financial cost. A 2010 study reported that U.S. businesses suffer an average of 10 hours of downtime per year, at a cost of $26.5 billion. Another analysis suggests that one hour of downtime costs the average business $300,000. If there had been a major outage on our most recent Black Friday, it would have jeopardized $1 billion in online sales.

Dealing with downtime

Of course, modern IT infrastructure has been built for redundancy and is extensively instrumented. Automated tools such as Nagios, Keynote, New Relic, Pingdom, SolarWinds and Splunk monitor every element of the stack and alert engineers immediately to urgent or emerging issues. In fact, today’s machines are very good at detecting and reporting incidents. It’s when those incidents get handed off to humans for remediation that things sometimes break down — because the humans are still using processes and technology that haven’t changed much in ten to fifteen years.

When I was at Loudcloud back in 2001, everyone carried a pager. A small team in our 24/7 Network Operations Center (NOC) would watch for critical monitoring system alerts on big screens and then page the administrator on duty, no matter what time of the day or night. If the administrator couldn’t resolve the issue, they would escalate to developers, who also wore pagers. The process was labor-intensive and error-prone, involving emails, phone-calls, written duty rosters and escalation schedules.

While most other aspects of IT have changed dramatically, incident management in many IT organizations looks remarkably like it did back in 2001. The cloud has done away with the need for many NOCs, and the move to DevOps may mean developers are more directly involved in issue resolution, but the processes are frequently still manual, cumbersome and inefficient. Moreover, today’s large complex systems are never the responsibility of just one person — database administrators, developers, and system administrators all have a role to play — and the more people involved, the more complex and error-prone the process becomes. Reporting of incidents and handoffs from person to person are often done manually via email or SMS. Escalations and problem descriptions are handled via person-to-person phone calls. Engineers consult spreadsheets to see who’s on duty at a particular time. I’m aware of at least one major cloud service provider whose ops people still wear pagers.

PagerDuty

Having studied software engineering at the University of Waterloo and then built and supported large-scale systems at Amazon.com, Alex Solomon, Andrew Miklas and Baskar Puvanathasan set out to bring IT incident management into the twenty-first century. The result is PagerDuty, a modern SaaS-based platform for incident tracking, alerting, and on-call management.

In a nutshell, PagerDuty collects alerts from a customer’s existing IT monitoring tools and alerts the on-duty engineer if there’s a problem. PagerDuty doesn’t replace any particular monitoring tool. Instead, the system sits on top of existing monitoring systems and aggregates all of the errors generated by these tools in a single place.

incidents PD

PagerDuty allows each engineer to configure his or her own customized notification chain. Engineers can opt to receive incident alerts using any combination of phone calls, SMSes, emails and iOS push notifications. So, for example, you could opt to get a push notification immediately when an incident occurs, then an SMS 2 minutes later, then a phone call 5 minutes after that. PagerDuty also allows the on-call engineer to acknowledge, escalate or resolve a triggered incident directly from his or her mobile phone. The company utilizes multiple redundant data centers and SMS and telephony gateways to guarantee reliable message delivery across more than 100 countries.

Incidents in PagerDuty are routed according to an escalation policy. A policy specifies how incidents should be escalated within each team. For instance, you can configure a sysadmin policy to route incidents to a primary on-call engineer and automatically escalate the incident to a secondary on-call if the primary doesn’t answer within 20 minutes. Escalations are crucial to incident response because they add redundancy and ensure nothing falls through the cracks.

escalations PD

PagerDuty lets you build different on-call schedules for each specialization within the organization. For example, you can create one schedule for your database administrators, and another for your network engineers. Incidents can be easily configured to alert the appropriate on-call specialist, ensuring that problems are always automatically dispatched to those who are on-duty and best able to handle them. No more spreadsheets!

on call sched PD

Getting customer feedback on PagerDuty proved to be very easy, as it turned out that a large majority of our portfolio companies were using the product — and they were overwhelmingly positive about how it has dramatically simplified and improved their IT operations management. In fact, the company already has several thousand paying customers, including web giants such as Microsoft, Electronic Arts, Adobe, Rackspace and Intuit as well as a growing number of enterprise IT organizations. Overall, PagerDuty has achieved a remarkable amount on about $2 million dollars in initial funding, including generating a substantial and rapidly growing amount of recurring revenue. With a market of almost 10 million infrastructure and application specialists worldwide and multiple ways to expand within the multi-billion dollar IT Service Management segment, this company has a lot of potential.

In closing

The world’s inexorable transition to cloud computing and modern large-scale mission-critical IT systems is creating the opportunity for an exciting new generation of software companies like PagerDuty to play a critical role in its enablement. Many Andreessen Horowitz portfolio companies, for example GitHub, MixPanel, GoodData, CipherCloud and Snaplogic, are members of this class.

As veterans of IT systems management and automation ourselves, we are excited to lead a $10.7 million investment round for PagerDuty and welcome them to the a16z family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,309 other followers