The work of an SRE is no longer just an IT function. These engineering teams are vital to modern businesses as customers who experience sustained innovation and continuous service assurance will generate value.
In our fast-growing digital economy, most of the world expects to transact, interact and purchase online at their convenience. Site reliability engineers (SREs) are largely responsible for these frictionless digital experiences. They maintain the reliability, availability, and performance of increasingly large and complex IT infrastructures.
But, given the complexity of these modern systems, many SRE teams use automated technologies like intelligent observability to better understand system performance and ensure the availability of apps and vital services.
This begs the question: if automation is the way of the future, do we still need humans in SRE roles?
While intelligent observability helps SREs quickly identify and fix issues that affect system performance, human operators still need to implement these solutions. SREs also need to go beyond fixing technical issues. They need to continuously integrate and deliver new technologies using the kind of ingenuity that doesn’t yet exist in artificial intelligence (AI).
Let’s further explore the question, looking at the role of an SRE and how human teams can collaborate with AI-driven observability tools to increase productivity and innovation.
See also: The Role of AIOps in Continuous Availability
The rise of the SRE
Google invented the SRE role in 2003 to maintain the company’s growing IT infrastructure while improving its user experience. Other large tech companies like Facebook and Netflix soon followed, hiring SRE-specific teams as they recognized the value of the user experience and started to experience similarly sprawling infrastructures.
This focus on the user experience differentiates SREs from traditional IT Operations teams. While both manage reliability, availability, and performance, SREs focus on these aspects from a user experience perspective. IT Operations teams view them from a systems perspective, essentially finding and solving service-disrupting problems.
But IT Operations teams can hinder innovation by solely focusing on a system’s operational state. After all, changes to an organization’s technology stack can up-end systems, which is in direct opposition to the protect-at-all-costs mandate of IT Operations teams. SREs, on the other hand, see innovation as part of their core responsibility to improve the user experience.
But SREs walk a fine line between achieving continuous development and maintaining increasingly diverse architectures.
The role of intelligent observability
Intelligent observability can help SREs balance the tradeoff between system reliability and customer-delighting innovation. Despite these benefits, only 53% of SREs use observability tools.
Many teams likely rely on legacy monitoring tools with the misconception that they are the same as automated observability tools. But just as SRE isn’t a new-fangled word for IT Operations, observability isn’t a rebranding of monitoring. Monitoring determines how well an IT infrastructure is performing based on predetermined rules. This systems-centric method was effective in the old days of mostly static environments. But fast forward to today, where our cloud-native architectures are in a state of continuous change. There aren’t enough rules in the world to predict what will go wrong in these distributed, complex and ephemeral environments.
Observability tools can handle today’s modern-day and ever-complex systems because they don’t rely on a rigid set of rules. Rather, observability measures a system’s internal state based on its external outputs, specifically its events, logs, metrics, and traces. These tools add visibility to networks, systems, and applications and provide SRE teams with valuable data that is correlated, contextualized, and actionable.
Automation takes the data to the next level and helps SREs keep up as innovations roll out with increasing velocity. AI automates data collection and analysis and provides suggestions for problem resolution. With AI-driven observability, SRE teams can monitor applications, detect anomalous events, identify the root cause and provide data insights that suggest a fix.
In this way, observability increases uptime, helps SREs manage their error budgets, and reduces expensive downtime. But it also goes beyond traditional ROI to offer a return on innovation. Because intelligent observability reduces the toil associated with incident management and cause analysis, it enables SREs to let go of risk-averse attitudes and helps them focus on high-value tasks. This kind of work transcends AI’s abilities.
The role of human SRE teams
Innovation is one of the reasons why automation cannot replace human SRE teams. Automation can’t yet replicate the human creativity and ingenuity that lead to customer-captivating updates and innovations. And these tools can’t foresee problems that include various systems and stakeholders.
Automated tools also work best when the system is behaving normally. But when systems are unpredictable, automation can’t always handle its preprogrammed task. SREs may need to step in and override automated processes with manual work. Additionally, unpredictable behavior could result from multiple, layered issues and require human intelligence to untangle. This is the very essence of the Automation Paradox (originally referred to as the “Ironies of automation” when first conceived by Lisanne Bainbridge as early as 1983).
But this is where collaboration with the automated observability tools can help. While automation can’t replace a human SRE team, these teams also rely on AI-driven observability to follow through on proactive, high-level work. Automated tools take boring, repetitive tasks off of the SRE teams’ plates, allowing them to shift from a reactive firefighting mode to a more proactive posture where they can work on value-driving strategic initiatives.
For best results, SREs should collaborate with automated observability to add visibility to systems, respond to incidents and prevent new ones while maintaining full control over the ops environment. It could look something like this:
- Intelligent observability detects a significant incident
- The tool ranks the incident according to its importance
- The observability tool notifies team members, suggesting next steps
By only pulling in relevant SREs and giving each team member specific directives, the intelligent observability platform streamlines processes for those involved while leaving the rest of the team to focus on forward-thinking projects.
The work of an SRE is no longer just an IT function. These engineering teams are vital to modern businesses as customers who experience sustained innovation and continuous service assurance will generate value. But the answer to delivering continuous development and superior system performance doesn’t lie in automated tools or human intelligence. Improving the comprehensive user experience requires collaboration between both automation and human operators. And those businesses that leverage the strengths of each will be those that succeed in our fast-paced digital economy.