FotolEdhar - Fotolia
The site reliability engineer role can be both challenging and rewarding for IT pros. To stand out in a competitive job market, aspiring SREs must understand exactly what organizations look for in a candidate.
SRE is a relatively new IT role that involves the automation of operations tasks. The role can be a good fit for systems engineers looking to improve programming skills, as well as developers seeking to manage large-scale infrastructures. Candidates with demonstrated strength in IT systems, software and automation have a competitive advantage during the interview.
Any SRE interview will present a candidate with an array of questions or hands-on exercises intended to evaluate their knowledge of key site reliability skill sets. While these questions or tests can vary dramatically depending upon the specific needs of the hiring organization, an SRE candidate can expect to see a smattering of interview questions across four major domains: software development, monitoring and troubleshooting, networking, and infrastructure and operations.
Read on to learn more about likely SRE interview questions, and what hiring managers look for in a response.
The first interview questions in this domain usually explore a candidate's basic knowledge of the programming languages -- such as Perl or Java -- the organization uses, along with data structures such as queues, stacks, heaps and algorithms. This portion of the interview might entail a candidate's review of poorly written code to identify errors, inefficiencies and places where the code might fail or produce undesirable results. Other SRE interview questions related to software development might involve major applications and interfacing, such as databases.
A hiring organization will rarely ask a candidate to actually code, but the evaluation might include architectural discussions of using code to address certain problems.
Monitoring and troubleshooting
SRE interview questions in this category typically examine the candidate's understanding of monitoring principles and their practical knowledge of specific tools or practices. For example, the interviewer might ask how to monitor database query times -- a central element of performance monitoring -- or how to parse a log file to create a CSV of specific events or processes.
In other cases, an interviewer might present a candidate with a list of monitoring alerts from a tool and ask them to rate the alerts in terms of priority or severity. Such questions gauge the candidate's ability to prioritize and manage time appropriately.
Troubleshooting discussions can also include various anecdotal scenarios. These explore how a candidate might resolve certain problems -- ranging from a failed VM to a full-blown disaster. An interviewer might also ask about the most serious problems the candidate has encountered in vital areas such as servers, networks or services, and how they resolved those issues.
These questions range from extremely easy to extremely difficult. For an easy question, the interviewer might ask a candidate to define or describe basic networking concepts such as DNS, Dynamic Host Configuration Protocol or TCP. But networking questions can quickly become more granular and detailed. For example, an interviewer might ask how to calculate the number of usable IP addresses on a /23 network, or about the nuances of a TCP connection setup.
SRE interview questions related to networking can also prompt architectural discussions, such as how to identify single points of failure in a basic network map or how to locate potential network bottlenecks.
Infrastructure and operations
An SRE candidate will usually face an array of infrastructure and operations questions -- and some of the basic ones will typically involve OSes and security. For example, an interviewer might ask what happens when the ps command is entered into a UNIX prompt. Candidates might have to explain how to secure a container image, or the difference between RAID 0 versus RAID 5 and when to use one over another. Other basic questions might involve the difference between a service level agreement and service level indicator, or the difference between virtualization, containers and Kubernetes.
Infrastructure questions can become increasingly complex. For example, candidates might be asked to explain how they would scale certain elements of IT infrastructure, such as a vital service. A candidate's approach to this task can reveal a lot about their expertise and confidence as an SRE. Another complex example might involve distributing 1 TB of data to 5,000 different nodes, and then keeping those nodes up to date. If it takes two hours to copy the data to just one server, the interviewer might ask how the candidate would approach that task so that it doesn't take 10,000 hours to update all the servers, and so that the data transfer wouldn't be corrupted.
Unsurprisingly, SRE interview questions often involve workflow and process automation. These questions might be as simple as, "how do you create and review an automation script?" But they can also require the candidate to demonstrate knowledge of prominent IT automation tools, such as Ansible, Datadog, Puppet and Vagrant. Fortunately, such detailed knowledge requirements are typically outlined in SRE job postings, but even when specific tools are not involved, candidates should be ready to discuss basic automation approaches and practices.