This content is part of the Essential Guide: Use these DevOps examples to reimagine an IT organization

Site reliability engineering kicks rote tasks out of IT ops

DevOps is making inroads into many enterprises old and new, but Google is sharing its site reliability engineering prowess in a book that champions NoOps.

DevOps dominates the ideology around operations and development. But what does it mean that one of the biggest...

technology companies doesn't do ops?

Internet search giant Google released Site Reliability Engineering: How Google Runs Production Systems, a book focusing on its software engineering approach to IT operations. This book covers the gambit of practices and management tips to help any enterprise, big or small, implement site reliability engineering (SRE), which, the authors say, is not taught in schools.

"Google really is entering into a new phase of openness about some of our technology and about many of our engineering approaches," explained Todd Underwood, a director in SRE at Google.

Site Reliability director at GoogleTodd Underwood

Site reliability engineering -- a term coined over ten years ago by Google's vice president of engineering Benjamin Treynor Sloss -- is what happens when you apply software engineering skills to operations problems, explained Jennifer Petoff, global program manager of site reliability engineering education and one of the book's editors. The approach tackles some familiar issues faced by traditional IT operations teams.

In its early days, Google was "frugal and ambitious," Underwood said. The traditional approach to IT operations was too costly and required too many people to work at Google's then-projected scale. IT operators tend to do many repetitive daily tasks and the work can be fairly reactive. Site reliability engineering is about how to automate away those more mundane parts of ops jobs, which included monitoring the network and the operations work on systems and security.

"We try our best to simply not do operations," Underwood said. "When we end up doing operations, we regard that as a set of necessary steps to get to a point where we don't have to do that operational work anymore."

That's the goal of Underwood, Petoff and the more than 70 contributors involved with creating the book. And it isn't just for the Web giants.

"We think that this kind of approach applies to a whole bunch of software that's in production and a whole bunch [of] similar circumstances for lots of other organizations," Underwood said.

Not doing operations may scare IT operations professionals, since this approach could leave them without a job or with fewer companies looking to hire IT ops pros.

"They should embrace [SRE]. We're trying to make their lives so much better," according to Underwood. In fact, when functional aspects are automated, IT ops pros get to do much more interesting stuff.

"The majority of people who [have] titles like SysOps or sys admin or network administrator ... have technical skills and capabilities and interests that far outstrip their day-to-day job responsibilities," Underwood said. Site reliability engineering is a way to organize work, look at problems and build the organization so those people are empowered to do interesting, actual engineering.

global program manager of SRE educationJennifer Petoff

Site reliability engineering isn't just about efficiency. Petoff shared how empowerment through the "blameless postmortem" principle benefits their IT culture.

"If you look at companies or places where this [blameless postmortem culture] doesn't exist, engineers constantly live in fear of making mistakes. It slows you down and it encourages cover ups," she said. What it doesn't do, she added, is make your IT infrastructure better.

Automating some things doesn't mean automating everything, however, and Underwood stressed caution with what is automated. For example, automating a service with a memory leak to restart after an occasionally crash may seem like an easy way to fix a reoccurring problem, but leaving that unresolved software bug could end up causing bigger problems as the service scales.

Chapter 7 of Site Reliability Engineering: How Google Runs Production Systems describes automation's benefits this way:

[W]ithin Google SRE, our primary affinity has typically been for running infrastructure, as opposed to managing the quality of the data that passes over that infrastructure. This line isn't totally clear--for example, we care deeply if half of a dataset vanishes after a push, and therefore we alert on coarse-grain differences like this, but it's rare for us to write the equivalent of changing the properties of some arbitrary subset of accounts on a system. Therefore, the context for our automation is often automation to manage the lifecycle of systems, not their data: for example, deployments of a service in a new cluster.

To this extent, SRE's automation efforts are not far off what many other people and organizations do, except that we use different tools to manage it and have a different focus (as we'll discuss).

Download the full chapter excerpt here to learn more about the evolution of automation at Google, the value of automation, use cases for automation and applying automation to cluster turnups.

Editor's note: This excerpt is from Site Reliability Engineering: How Google Runs Production Systems, edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy, and published by O'Reilly Media, Inc., Apr. 2016, Print ISBN 9781491929124.

Next Steps

Pinpoint common DevOps problems and myths

Tool aims to make single pane of glass a reality

Industry pros' 2016 resolutions include automation push

Learn what the NoOps era looks like

Dig Deeper on Configuration Management and DevOps