Phil Dibowitz, systems engineer at Facebook, led its Chef team through a three-year process to rethink configuration management to scale on par with the Web-scale company's data centers. SearchDataCenter.com spoke with Dibowitz after all of Facebook's infrastructure, as well as its backend IT, moved to Chef and the team turned off CFEngine servers.
Dibowitz and his team helped individual service owners take on their own cookbooks and guided the company-wide conversion to Chef. It took three years, Dibowitz said, not because of the technological change, but because of the structural change, with software engineers taking on ownership of the ops side.
With the Chef DevOps migration complete, Dibowitz and his team are working on operating systems at Facebook. The problem is that OSes are set up once and never truly owned -- Dibowitz wants to change how OSes are installed and managed. He calls it a "natural dovetail with Chef." The tools, models and workflows hammered out with Chef will come into play for OS speed and management improvements.
What tools for configuration management and automation have you added, or gotten rid of, as your DevOps model matures?
Phil Dibowitz: We do exclusively systems configuration on Chef. App configuration is a different system. We had to draw a hard line and say app stuff stays here; systems stuff stays here. We already had great systems for deploying and configuring apps that was Facebook-aware. Apps have different requirements and different testing needs. While Chef can do one, the other, or both, it was better for us to target one and do it really well.
We released a bunch of tools [in early 2014] on GitHub and RubyGems: Taste Tester for testing, [a rewritten version of] Grocery Delivery to distribute cookbooks to Chef Servers and Chef Server Stats [a small utility to pull monitoring information from Chef servers]. They're for how we use Chef and show why we do things that way.
I was also the first external committer to the Chef code base. Chef now offers maintainership to the community. So, I became an official Chef maintainer. I contributed more stuff for Chef Client, such as a feature to install multiple packages in one resource.
We released a few cookbooks that Facebook uses internally. They're there as a good example for people to touch and play with. I'd like to release a bunch more as soon as I can get them cleaned up and ready for the world.
How does Facebook benefit from open source community contributions?
Dibowitz: For Grocery Delivery and Taste Tester, we got bug fixes and feature enhancements from the community. That's really helpful; we gained additional support features for example. When we talk to the community, there's a shared language. You can discuss ideas.
Facebook has become a fairly big name amongst Chef users, and I try to make it really clear that the things we do are applicable to shops of any size. We had to solve these problems, and you can use what we learned to solve them at smaller scale for you too. We work with Yahoo!, but also tons of small companies, banks, big enterprises, Web 2.0 startups, Chef enthusiasts with home deployments. ... We have the perspective of several different ways to use Chef and do configuration management.
How has the concurrent evolution of Open Compute Project shaped how you do things at Facebook?
Dibowitz: Early on, when talking about the possibility of building our own switching stack, our pie-in-the-sky scenario was being able to treat them as much like a server as possible, but with a real switch and as fast as that. How do we handle all the stuff that isn't the switch itself, so it all just works? How do we make networking work like it hasn't before?
If there was going to be a part of the device that looks like a Linux box, then you have to maintain all the stuff, like syslog. So with Chef on those boxes immediately that problem is solved.
Since the dawn of time, the problem with managing network devices is that it's really hard to automate configuration. The OS fundamentally assumes a lot of state in the admin's head. If you're adding six rules between firewalls, your interim state might not be safe, might lock you out, etc. It's slow, hard and error prone. Then there's the typical Linux process where you write out the config file, it reads it and you're done. ... That's hard on a legacy environment of how networking has always worked.
We don't use Chef for the the actual switch to push commands down to the ASICs, but the concept is very similar to the way Chef works -- more open source to come there. We built this the same way we build everything else: Start with a small base that works and iterate. FBOSS and Wedge provide a base. We now have a basic set of utilities for routing and
At least meet the minimum, and strive for the ideal. This is a DevOps model. ... For the Chef migration, we supported the commonality, and people sent feature requests. We prioritized the most standard projects then did harder or unique and smaller migrations. That same model works no matter what you're building.
How are the IT organizations joining you in DevOps different today than they were 2 years ago?
Dibowitz: There are more companies, and different kinds of companies. I couldn't have imagined that three years ago at my first ChefConf that there'd be massive banks asking about continuous delivery. There are a lot of banks, investment firms, old-world traditional companies, aerospace companies and so on.
I still get the same technical questions, such as "How do you implement a cookbook to do X?" Or, "How do you train people?" But I also get crazy new ones, like "How does Chef interplay with audits?" And, "How does your PCI auditor feel about it?"
Now there's also a lot of talking about the people side. How do you build the right team, foster the right attitude and ensure that people are being responsible and reliable while making all these changes? This might be because more big companies are embracing DevOps.
What are the avoidable mistakes you wish you'd known when starting to work as a DevOps company?
Dibowitz: I'd break them down into technical and cultural mistakes.
Cultural: Everyone should do code reviews. Someone has to review and accept your code before you can commit it. Big companies already buy into that model. GitHub users buy into it as well. But tons of companies say, "Oh, we're too small" or "We don't have time." And what happens is that you spend far more time backing out of things and working around processes. With code reviews, you collaborate more and you gain better perspective on the simpler or more repeatable path. Use GitHub or Phabricator to do that.
Technical: Use a correctness tool to find errors, syntax problems. Foodcritic is one correctness tool, or use Rubocop linter. With a human checking syntax and style, there are too many rounds of reviews and missed errors. Those tools find the nitpicky things and you focus just on making the change you want to make. Human can focus on approach and overall code quality. That feedback is what humans are good at and people are excited to receive it.
I wish we had used those sooner. I would have delayed the project to do it, because it makes the experience so much better.
In 2013, you worried about inadvertent modifications -- unintended consequences from changes you made in Chef -- and wanted more transparent testing for it. How has that changed?
Dibowitz: That's where Taste Tester came from. [Taste Tester evolved from a small utility called Chef Test, which was highly specific to Facebook rather than open]. The way we built with Chef was that you propose a cookbook change, someone reviews it, you commit it and it's available to the entire global infrastructure in under an hour. How do you see what it will do before you commit it?
Taste Tester now uses Chef Zero, a lightweight test server [rather than Private Chef]. You pick a small group of machines from production and point your code repository at that test group. You can watch the code working in production and inspect the changes and what it affects live. Change a system – tunable, for instance -- and see if it actually fixes CPU usage like you expected. Or test a change on different types of servers: Web server, database server and so on. So that you can't strand machines, the test expires after an hour. If you're satisfied with the small scale production test, you can commit it and have it go live.
We have also always done sharding, where you take a consistent hash of a host name and put it into a percentile. Then you can tell the system, "If I'm in X percent of machines, do this. Otherwise, do the standard." You can get a change live but control the roll out -- over many days instead of 30 minutes, for example. It helps if you want to remove something that may have interdependencies, or do A/B testing, or roll out a really big package that will hit the disk hard.
The traditional way [of testing] is to go through dev, then QA, then production; but that test is less valid. If you haven't seen it in production, then you don't know how it's going to work.
What is on your wishlist?
Dibowitz: I've gotten multi-package support, which was on my wish list. It hasn't rolled out in a meaningful way yet, but it's there and it's upstream so we can start using it. That was a big pain point for the longest time: Each package started a new transaction, downloaded the code, etc., etc., etc. You could download them all at once and save a ton of time provisioning.
Sometimes you need two packages to be installed on one transaction or it won't work. This was a terrible hack before [multi-package support].
Meredith Courtemanche is the site editor for SearchDataCenter. Follow@DataCenterTT for news and tips on data center IT and facilities.