olly - Fotolia

Uber hits bumps in the road with microservices challenges

Uber execs reveal the tradeoffs they've made in building their business upon hundreds of microservices, balancing the advantages of speed with challenges of system reliability.

NEW YORK -- Web-based tech startup Uber Technologies Inc. is blazing a trail in DevOps, and its message to enterprises...

that would follow it into microservices is simple: Buckle up.

Microservices challenges are present, like any new technical innovation, said Susan Fowler, site reliability engineer for San Francisco-based Uber, in a presentation this week at the Velocity conference.

The advantage to microservices is "we get to create new features and new products at an insanely high rate," Fowler said. "It's been a magical thing in a lot of ways."

Each small microservices team is like a little startup that can deploy multiple times a day, roll back changes and push out features, which allows Uber to grow fast. However, with more than 1,300 microservices backing its ride-sharing app, Uber's site reliability engineering team has its work cut out for it.

"Developer philosophy comes at a cost," Fowler said, and that cost comes in the form of imperfect design, poor communication, technical debt, more ways for the system to fail, and many outages and incidents to resolve and review each week.

A three-pronged approach to site reliability

Uber site reliability engineers have taken a threefold approach to mitigate those tradeoffs and cut down on sprawl: production-readiness standardization; process management through production-readiness reviews; and a third process Fowler described as "evangelizing, teaching and sticking to it."

Standardization is difficult to impose on such a vast environment of independent pieces, but it is possible at least at a high level. The Uber site reliability engineering team has eight requirements for production-readiness for each microservice: stability, reliability, scalability, performance, fault tolerance, catastrophe-preparedness, monitoring and documentation.

"You have to take higher-level principles, and then find quantifiable requirements that correspond to each of those and produce measurable results," Fowler said.

Under the fault-tolerance requirement, for example, each microservice is put through its paces with load testing and chaos testing by internally developed utilities, called Hailstorm and uDestroy.

During fault-tolerance testing, "We actively push the system to fail in production," Fowler said.

The second practice adopted by Uber's site reliability engineers is process management, in the form of quarterly production-readiness reviews and blameless outage reviews.

These meetings can take from 30 minutes to over two hours. The participants architect the microservice on a whiteboard, talk about its dependencies, figure out single points of failure and walk through a checklist of production-readiness standards. Uber has also automated the checklist process with a utility that scores each microservice weekly.

"When you get development teams into the room and you ask all these questions, there's no one person who knows all the answers," Fowler said. "There's no team-level understanding because of how sprawled out these things are."

After this exercise, developers understand their service and infrastructure, and how, if someone makes a request on their phone for an Uber, their particular service fits in. This cuts down on tech debt and keeps architecture and infrastructure current.

This is also where the evangelism piece comes in. Many developers see such meetings as interruptions to the development process, and Fowler and her team have worked to counteract that notion.

"When you have software development without any principles, without any guidance, it's lost," Fowler said. "All this process management stuff ... that's a guiding force. It's not an interruption."

Anatomy of an outage: Uber shares lessons learned

Not all outages can be prevented, of course. And so when an inevitable failure occurs, the Uber site reliability engineering team has a few simple rules for how to handle it. These were described in a separate keynote presentation by Tom Croucher, also a site reliability engineer for Uber, this week at the Velocity conference.

Rule No. 1 is, "Always know when it's broken," which seems obvious, but isn't always being considered as monitoring data floods into engineers' dashboards, Croucher said.

"It's great to be a nerd and just obsess about, 'Look at all this data, look at all this stuff that I can fiddle with,'" Croucher said. "That's not knowing when it's broken. Knowing when it's broken means knowing when your users are affected and when your users cannot do something that they need to do."

Croucher walked the audience through a real-world post-mortem on a recent outage. Someone made a simple command syntax error in an attempt to update the firewall service with a set of standard rules and a standard configuration -- but incidentally caused all services to deny all traffic by default.

This led Croucher to rule No. 2: Avoid global changes.

"If you have global changes, you're putting the software at risk," he said. In this case, the team got lucky that a developer forced the firewall change to one small cluster within the infrastructure and noticed the problem about 20 minutes before Puppet would run and refresh all systems.

The problem was compounded, however, when a reboot caused the firewall problem to recur in some machines. So, the team moved to rule No. 3, which is it's easier to move traffic than to fix things on the fly.

"Ideally, your software is in two sections," Croucher said. "But if you can't afford multiple availability zones and multiple data centers, split your clusters in two and just deploy them in two separate bits, because I can almost guarantee that your own bad software releases are the thing[s] that are causing most of your outages."

Unfortunately, in moving the traffic after the firewall command outage, another bug was uncovered in a Varnish user cache at the alternate data center location that locked up all traffic in that data center.

That brought up rule No. 4: Mitigations must be part of the normal deployment process going forward and rigorously tested so systems don't go without needed updates and cause recurring problems.

"If you move traffic around -- and you should -- you have to test it, you have to practice and you have to make sure that you practice at peak times, that you practice it without letting anybody know," so they can't artificially defend themselves, Croucher said.

Ultimately, Croucher said he was proud of the way the Uber team handled this cascading incident and resolved it quickly.

"If Uber doesn't work, people don't get places," he said. "They don't earn their living because of the impact that we have on their lives."

Beth Pariseau is senior news writer for TechTarget's Data Center and Virtualization Media Group. Write to her at [email protected] or follow @PariseauTT on Twitter.

Next Steps

What developers will love and hate about microservices

Where DevOps, containers and microservices intersect

Containers and microservices' effect on IT

Learn about the state of microservices

Dig Deeper on Deploying Microservices