olly - Fotolia
DevOps eliminates the traditional silos that separate software development and operations teams. But the rapid and iterative process of modern agile development leaves a reliability gap -- and teams can end up deploying new, yet unreliable, services at a quick pace.
A site reliability engineer (SRE) builds and implements quality software that enhances the reliability, repeatability and flexibility of production services and systems in a DevOps environment. Essential SRE skills span the software stack, from code creation and improvement to deep technical troubleshooting.
An overview of SRE roles and responsibilities
The notion of an SRE started at Google in 2003, as a means to make large-scale data centers more reliable, scalable and efficient. Software reliability engineering eventually matured into its own domain intended to automate operations tasks from capacity planning to disaster response.
An SRE essentially substitutes automation for human labor. To achieve this, SREs typically build self-service tools -- including those for automatic provisioning and test environment configuration -- for developers. An SRE team addresses and improves the performance, availability, latency, efficiency, monitoring, troubleshooting and planning of production software and services.
SREs are both developers and troubleshooters. They often split their time evenly between software development for better site performance and availability, and IT operations and support tasks, such as addressing help desk escalations. In development tasks, SREs actively consult with project teams to ensure the emerging software conforms to business requirements for availability, security, maintainability and performance. SREs work with the operations side to ensure delivery and deployment pipelines run smoothly.
Critical SRE roles and responsibilities include:
- Develop the software and processes needed to maintain services. The tools developed to maintain services typically involve data collection and extensive monitoring.
- Capture and analyze major metrics, such as availability, mean time between failures and mean time to repair, and develop new metrics and KPIs as necessary. Add these metrics to monitoring dashboards and reporting systems.
- Use detailed monitoring to improve the availability and performance of applications, services, systems and infrastructure. Create new alerts to find anomalies and understand the root cause of system failures.
- Create and deploy automation, alerting, self-healing architectures and other technologies to make the environment more maintainable.
- Monitor, manage and troubleshoot regular processes to improve processes and workflows.
- Create and maintain documentation for processes, automation, infrastructure, resources and services.
- Act as a subject matter expert and coach to mentor developers and engineers, as well as assist junior developers with software troubleshooting and debugging.
Necessary SRE skills and experience
As with many DevOps roles, there is rarely a single, well-defined educational or career path to become an SRE. This means an organization can consider many different types of candidates for an SRE role, but the job requirements might involve vast differences in education and expertise. In terms of education and overall experience, an SRE candidate should expect to have a bachelor's degree in Computer Science, but equivalent experience or another technical degree might certainly be acceptable.
The real yardstick for an SRE is in experience and expertise. A candidate will likely need more than five years of experience supporting scalable service environments, and should possess at least three years of software development experience that involved major languages such as Java and Python.
A typical SRE role demands a broad -- and proven -- skill set. As an example, an SRE candidate should bring a strong knowledge of major operating systems, such as Linux, and their administration, as well as of networking, load balancing, protocols such as TCP/IP and services like DNS. Knowledge of other technologies, such as servers, storage, virtualization and network monitoring tools, such as Nagios, Splunk and Grafana, is also important.
An SRE must be an excellent software developer that can create tools for infrastructure management and automation and is familiar with DevOps engineering practices and diverse technical problems. Development demands a comprehensive knowledge of important CI/CD pipeline tools, including Jenkins, GitLab and SonarQube.
IT troubleshooting, root cause analysis and mitigating production outages are also critical SRE skills. In many cases, the SRE must triage multiple issues simultaneously under the extreme pressure of a critical production environment. Knowledge of log analytics techniques and tools, such as Loggly and Splunk, can offer a leg up on other job candidates.
The softer side
Essential SRE skills are not just technical in nature.
For example, an SRE must be well-organized and comfortable operating in a mission-critical or high-volume production environment -- often in industries that are subject to regulatory compliance and security requirements. Large hiring organizations might look for experience supporting applications with a 24/7 service-level agreement and providing on-call support in a network operations center.
Finally, an organization often calls on its SRE to make presentations and create documentation for a variety of audiences. This means the SRE should be an expert communicator with excellent written, verbal and virtual collaboration skills -- especially when working with less technically inclined individuals.