This content is part of the Essential Guide: ALM best practices for advanced DevOps organizations

To scale up IT reliability, break apart operations

One strategy toward more IT reliability is to make it everyone's responsibility.

NEW YORK -- When IT operations became a problem for these companies, they found ways to make IT operations everyone's problem.

"Normally, one central operations team supports all the different dev teams. It does work, but it's not scalable," said Hamed Silatani, manager of the performance and reliability engineering team of London-based IG Group. The financial-trading company scaled up its applications from 30 to 300, while adding another day of trading that increased uptime hours and trade volume -- from 2 million to 8 million per month -- which also increased cost per outage. The operations team needed to improve IT reliability, while supporting growth.

Two different approaches can bring stable production into the development and business mindset, Silatani said: Either you're holding a gun to their heads, or you're fostering an environment that makes stakeholders want to play a role in IT reliability and performance. IG operates under the assumption that everyone wants to have reliable, high-performance and operational applications. To enable this, IG created an IT reliability mentorship mentality that Silatani described in a presentation at Velocity here this week.

The first step to make operations scalable was to make it accessible -- share logs, give users self-service options for tests, and otherwise reduce reliance on operations as the sole source and controller of production IT. Awareness was another key factor, Silatani said. He encouraged production IT engineers to discuss the effect of production incidents on revenues, customer satisfaction and corporate resiliency to major market events. Skilled members of the operations team embedded with the development teams can break down operations decisions and concepts into achievable pieces. His group also changed the approach to requisite IT processes, such as compliance audits, to avoid interrupting the flow of business.

The final step to scale up reliable, stable operations was to give the application developers and owners support responsibilities. "If we'd tackled this [step] first instead of last, it would have been a huge problem," Silatani said. IG had to build a trusting relationship with application experts to create a supportive ecosystem for shared dev and ops responsibilities.

Giving product delivery teams operations responsibilities is "a huge change," said Rebecca Moss, program manager for Agile at Advance Digital, a media company headquartered in Jersey City, N.J. Advance Digital is about halfway into its DevOps adoption. They're asking product developers to think about operational considerations, such as architecting the software for resilience on cloud infrastructure and to make the most efficient use of cloud resources, she said.

There is some concern that more time spent on the operations side means less time doing development work, as well as lowered productivity, Moss said, but to the company, this tradeoff is worth it. Before DevOps, operations folks were the first ones paged about issues, even if they didn't know anything about the application they were servicing. Now, with developers and product managers in the on-call rotation, these teams have to think about business continuity and other issues based in IT reliability.

To this end, Silatani's team recognized that developers and product managers needed tools to perform production support, and the team created two sets: one for operating the software and another for troubleshooting it.

The mentor relationship

Embedding a reliability mentor into the teams that they serve helps achieve better software deployments, Silatani explained. He chooses mentors that understand "the pain of support" and suggest ways to approach problems, rather than dictate.

Bringing IT operations into the application teams was an approach also taken by Berlin-based audio distribution platform SoundCloud, and it "make[s] the product team[s] autonomous," said Matthias Rampke, production engineer at SoundCloud.

Advance Digital has a team with mostly ops and systems background that currently provides support, and it can float in and out of different teams to help them take on ops responsibilities, Moss said. The company is working toward a reliability mentor relationship similar to IG's for these operations professionals, considering names such as  engagement managers and ambassadors.

The customers of IT operations and production engineering teams -- developers, application owners and lines of business -- are not monolithic. "We had development teams across three time zones, with different levels of maturity," said Hamed Silatani, manager of the performance and reliability engineering team of IG Group, in terms of continuous integration and monitoring and knowing how their applications functioned on production systems. He found a mentor who's culturally close to the team they embed with, particularly if your company operates internationally, will integrate more quickly.

Embedded, rather than centralized, IT operations isn't easy to get right on the first try. Rampke recalled problems when people involved in a specific build became the on-call support: "It was too much time on call and too specialized to just those people," he said. The teams grouped a bunch of services together to rotate on-call duties with more people, which encouraged the responsible parties to write up clear and easily understood run books.

SoundCloud found certain operations tasks were trickier than others, such as troubleshooting databases, so they formed a group, called the MySQL Collective, with people who have the specialized knowledge to debug problems.

"Product owners want to know how to fix their [own] database and work with it," Rampke said.

The MySQL Collective's goal was to become obsolete as an on-call group, though it still consults and rarely performs advanced troubleshooting. Developers come to the operations experts for advice on big migrations and building complex projects, rather than bug fixes, Rampke said.

For IT operations and engineering professionals, hitching their star to product teams gives them a bigger-picture view, Moss said. The operations support group learns more about what's going on in each area of the company to better inform IT-wide strategy.

When everyone is on the IT operations team

Before IG's operations overhaul, developers followed a flavor of Agile methodology that didn't include IT reliability, performance and operability. When new features were ready, the teams worked to get the demo working smoothly, without raising any concerns about performance and reliability in production.

That changed with mentorship, Silatani said. For example, a new product with rich, user-tailored features will need a lot of data to support it -- has anyone thought about how much load that would put on data centers? With mentors raising awareness and educating developers on the behaviors that encourage reliable, stable operations, IG saw developers start to show performance metrics in feature demos.

SoundCloud no longer uses an IT operations team, having evolved to systems engineering and now production engineering. Nonoperations staffers might not do the right thing every time, but production engineers can step in and help: "I can show them the tools, and I can give them the options," he said.

This supportive ecosystem scales because it doesn't encourage blame or distrust between development and operations, even in a large and regulated company. IG pushes 700 changes to production every day, and requests to production have steadily dropped as teams gain access and take charge of how their apps functioned -- and uptime has improved from 99.94% to 99.99% with these changes, Silatani said.

Another benefit: Good engineers don't leave the company because of what Silatani described as "the horrors of production support."

Meredith Courtemanche is a senior site editor in TechTarget's Data Center and Virtualization group, with sites including SearchITOperations, SearchWindowsServer and SearchExchange. Find her work @DataCenterTT or email her at [email protected].

Next Steps

How IT automation could make the ops team more reliable

Monitoring tools can boost systems management

How to become an IT mentor

6 steps to reduce SRE toil

Dig Deeper on Application Maintenance on Production Systems