Archive

Tag Archives: John O’Farrell

Alan Bradley: [holds up his pager] I was paged last night.

Sam Flynn: Oh, man, still rocking the pager? Good for you.

– TRON: Legacy

Here’s my card. It’s got my cell number, my pager number, my home number and my other pager number. I never take vacations, I never get sick. And I don’t celebrate any major holidays. 

– Dwight Schrute in NBC’s “The Office”

As Software Eats the World and more and more of our daily activities move online, we depend ever more on IT infrastructure. In a day spent emailing, tweeting, catching up on the news, checking Facebook, shopping on Amazon, watching a movie on Netflix, banking at an ATM — it’s all too easy to forget that underlying all of these “mission-critical” activities are servers, routers, load balancers, switches, storage and millions of line of code.

In a world where we’ve come to view the Internet as a utility, a major website outage is almost as serious as a power outage — and usually affects far more customers. Amazon Web Services had several major outages in 2012, taking down Netflix, Reddit, Heroku and many other sites in July and December. In October, it was the turn of YouTube, Dropbox, Tumblr and Google AppEngine. GoDaddy’s September outage affected up to 5 million hosted websites and 50 million domain names for six hours.

In addition to negative publicity and customer dissatisfaction, downtime now has an enormous financial cost. A 2010 study reported that U.S. businesses suffer an average of 10 hours of downtime per year, at a cost of $26.5 billion. Another analysis suggests that one hour of downtime costs the average business $300,000. If there had been a major outage on our most recent Black Friday, it would have jeopardized $1 billion in online sales.

Dealing with downtime

Of course, modern IT infrastructure has been built for redundancy and is extensively instrumented. Automated tools such as Nagios, Keynote, New Relic, Pingdom, SolarWinds and Splunk monitor every element of the stack and alert engineers immediately to urgent or emerging issues. In fact, today’s machines are very good at detecting and reporting incidents. It’s when those incidents get handed off to humans for remediation that things sometimes break down — because the humans are still using processes and technology that haven’t changed much in ten to fifteen years.

When I was at Loudcloud back in 2001, everyone carried a pager. A small team in our 24/7 Network Operations Center (NOC) would watch for critical monitoring system alerts on big screens and then page the administrator on duty, no matter what time of the day or night. If the administrator couldn’t resolve the issue, they would escalate to developers, who also wore pagers. The process was labor-intensive and error-prone, involving emails, phone-calls, written duty rosters and escalation schedules.

While most other aspects of IT have changed dramatically, incident management in many IT organizations looks remarkably like it did back in 2001. The cloud has done away with the need for many NOCs, and the move to DevOps may mean developers are more directly involved in issue resolution, but the processes are frequently still manual, cumbersome and inefficient. Moreover, today’s large complex systems are never the responsibility of just one person — database administrators, developers, and system administrators all have a role to play — and the more people involved, the more complex and error-prone the process becomes. Reporting of incidents and handoffs from person to person are often done manually via email or SMS. Escalations and problem descriptions are handled via person-to-person phone calls. Engineers consult spreadsheets to see who’s on duty at a particular time. I’m aware of at least one major cloud service provider whose ops people still wear pagers.

PagerDuty

Having studied software engineering at the University of Waterloo and then built and supported large-scale systems at Amazon.com, Alex Solomon, Andrew Miklas and Baskar Puvanathasan set out to bring IT incident management into the twenty-first century. The result is PagerDuty, a modern SaaS-based platform for incident tracking, alerting, and on-call management.

In a nutshell, PagerDuty collects alerts from a customer’s existing IT monitoring tools and alerts the on-duty engineer if there’s a problem. PagerDuty doesn’t replace any particular monitoring tool. Instead, the system sits on top of existing monitoring systems and aggregates all of the errors generated by these tools in a single place.

incidents PD

PagerDuty allows each engineer to configure his or her own customized notification chain. Engineers can opt to receive incident alerts using any combination of phone calls, SMSes, emails and iOS push notifications. So, for example, you could opt to get a push notification immediately when an incident occurs, then an SMS 2 minutes later, then a phone call 5 minutes after that. PagerDuty also allows the on-call engineer to acknowledge, escalate or resolve a triggered incident directly from his or her mobile phone. The company utilizes multiple redundant data centers and SMS and telephony gateways to guarantee reliable message delivery across more than 100 countries.

Incidents in PagerDuty are routed according to an escalation policy. A policy specifies how incidents should be escalated within each team. For instance, you can configure a sysadmin policy to route incidents to a primary on-call engineer and automatically escalate the incident to a secondary on-call if the primary doesn’t answer within 20 minutes. Escalations are crucial to incident response because they add redundancy and ensure nothing falls through the cracks.

escalations PD

PagerDuty lets you build different on-call schedules for each specialization within the organization. For example, you can create one schedule for your database administrators, and another for your network engineers. Incidents can be easily configured to alert the appropriate on-call specialist, ensuring that problems are always automatically dispatched to those who are on-duty and best able to handle them. No more spreadsheets!

on call sched PD

Getting customer feedback on PagerDuty proved to be very easy, as it turned out that a large majority of our portfolio companies were using the product — and they were overwhelmingly positive about how it has dramatically simplified and improved their IT operations management. In fact, the company already has several thousand paying customers, including web giants such as Microsoft, Electronic Arts, Adobe, Rackspace and Intuit as well as a growing number of enterprise IT organizations. Overall, PagerDuty has achieved a remarkable amount on about $2 million dollars in initial funding, including generating a substantial and rapidly growing amount of recurring revenue. With a market of almost 10 million infrastructure and application specialists worldwide and multiple ways to expand within the multi-billion dollar IT Service Management segment, this company has a lot of potential.

In closing

The world’s inexorable transition to cloud computing and modern large-scale mission-critical IT systems is creating the opportunity for an exciting new generation of software companies like PagerDuty to play a critical role in its enablement. Many Andreessen Horowitz portfolio companies, for example GitHub, MixPanel, GoodData, CipherCloud and Snaplogic, are members of this class.

As veterans of IT systems management and automation ourselves, we are excited to lead a $10.7 million investment round for PagerDuty and welcome them to the a16z family.

Living in silos

As software continues to eat the world, we spend an ever-increasing portion of our time online. Worldwide, we pass over 35 billion hours a month in the digital world, with US Internet users spending an average of 32 hours online monthly. As the Web has evolved, more and more of that online time is spent in specialized venues such as Facebook, Instagram, Twitter, Pinterest, LinkedIn and Foursquare. While these are fantastic applications, they have a downside, in that they largely exist as parallel, unconnected containers for our personal data. Trapped in their respective silos, our posts, photos, tweets, pins and checkins are largely inaccessible to us from outside. Moreover, creating useful connections between one application and another is far beyond the average user. Sure, most of the most popular web applications now have APIs, but they’re written for the benefit of developers, not people. (Take a look at Instagram’s API documentation, for example.)

Announcing IFTTT

Andreessen Horowitz’s latest investment, IFTTT, (for If This Then That) is out to change all that. IFTTT (rhymes with “gift”) is a simple yet powerful way to create connections between any two web applications, triggering an action on one every time any event you specify happens on another. For example, when I post on App.net, my post instantly appears on Twitter too, thanks to IFTTT. Every time I post a photo (or am tagged in one) on Facebook, IFTTT downloads it to my Dropbox without my even having to think about it. Here’s what that recipe looks like:

Facebook > Dropbox

Looking for a short-notice ski rental property in Tahoe used to mean checking Craigslist several times a day – this week I just had a simple IFTTT recipe call my cellphone the minute the one I wanted showed up. I no longer check for new movies on Netflix – IFTTT does it on my behalf. While I go about my digital life, IFTTT is in the background, quietly watching out for me.

Like all of a16z’s investments, this one starts with a compelling founder. Linden Tibbets, IFTTT’s co-founder and CEO, started working on IFTTT from his San Francisco apartment in 2010, after three years at design firm IDEO. Struck by how instinctively we know how to use physical objects in creative ways, he set out to enable us to be just as creative with the applications we use in the digital world. To quote Linden, “Much like in the physical world when a 12 year old wants a light-saber, cuts the handle off an old broom and shoves a bike grip on the other end, you can take two things in the digital world and combine them in ways the original creators never imagined.”

“Digital duct tape”

What resulted is IFTTT— described by a recent interviewer as “an idea so alarmingly simple and amazingly powerful… it makes you wonder why nobody thought of it before.” True to Linden’s design roots, IFTTT is visually appealing, approachable and easy to use. In Linden’s words, “IFTTT isn’t a programming language or app building tool, but rather a much simpler solution. Digital duct tape if you will, allowing you to connect any two services together. You can leave the hard work of creating the individual tools to the engineers and designers.”

IFTTT allows people to create “recipes” that connect “channels” (e.g., Instagram, Dropbox, Twitter and 56 other apps) so that “ingredients” on one (e.g., an item with the description “dog painting” appears on Etsy) become a “trigger” for an “action” on the other (e.g. “Add a new line to my Dog Paintings spreadsheet on Google Docs ”).  A glance at some of IFTTT’s channels:

IFTTT Channels

With virtually no promotion, IFTTT has nevertheless achieved remarkable traction since its beta launch two years ago. Its mission statement is to “enable everyone to take creative control over the flow of information.”  People have created over 2 million individual “recipes” to connect their favorite websites and apps in ways that meet their own unique needs. There are tens of thousands of shared recipes for common use cases. Three million recipes are executed every day, addressing individual interests as varied as “Text me if Apple stock drops below $500” and “Post my App.net posts tagged #a16z to Yammer.

Looking ahead

Although Linden and his tiny team of seven have achieved an amazing amount already, the best is yet to come. With Andreessen Horowitz’s investment, 2013 will bring more and simpler recipes and exciting mobile apps. A new developer platform will enable application developers to create services that connect their application to others in new and powerful ways, opening up new functionality to users with virtually no diversion of internal engineering resources. Perhaps most exciting of all is the role IFTTT can play in the emerging Internet of Things. As everyday objects from fridges to shoes to weighing scales become equipped with communicating smart sensors, individuals’ need to create useful connections and information flows between them will far outstrip their developers’ capacity to build them. I don’t know exactly when my fridge will be capable of knowing I’m running low on milk and contacting Safeway to order more, but there’s a pretty good chance an IFTTT recipe will be involved when it happens.