AI-based observability for ITOps, DevOps, and SREs allows teams to focus on developing better services with superior customer experience.
IT Operations, DevOps, and Site Reliability Engineers (SREs) must move beyond the antiquity of rules-based solutions and put the modern machine-learning of AI-based observability to work, sooner, not later.
AI-based observability accelerates incident resolution, improves service assurance, simplifies cloud management, and manages digital transformation. It liberates IT professionals from rules-driven workflows that are repetitive mental drudgery, providing the opportunity to advance new skills, knowledge, and productivity.
Why the need for AI-based observability and why now?
I grew up in England, close to the birthplace of the Industrial Revolution. The wrenching, cataclysmic social changes in the transition to the new steam-powered and automated manufacturing processes between 1780 to 1830 were the prelude to a new world. An agrarian and subsistence economy was revolutionized by machines used to automate the production of textiles and other goods at a mind-boggling scale. You could say the drive to mechanize became the philosophical core of twentieth-century life. Orderly systems. Cogs in a wheel. Predictable behaviors. Pre-determined results. All are governed by rules. All beautifully captured by Charlie Chaplin in Modern Times.
Moving beyond rules
The days of the “Spinning Jenny” are long gone. But rules are still everywhere. As we grapple again with the turmoil of change in the twenty-first century, it’s a good time to ask why. This question is particularly germane as many people wonder if AI and machine learning (ML) is about to trample life as we know it – just like the Luddites did during the Industrial Revolution.
The rules I speak of are not the guardrails of everyday life like obeying traffic laws, complying with health regulations, and paying your taxes (or else).
AI-based observability platform is about improving IT, Dev, and Site Reliability operations at enterprises which – despite being run globally at scale on esoteric virtual cloud systems – typically bet their performance and uptime on rules while at the same time trying to reduce toil.
There are serious limitations with rules-based solutions and why they are insufficient to effectively manage IT, Dev, and SRE teams.
- Rules have the illusion of simplicity but instead, add exponential complexity because they are brittle and easy to break.
- Rules are expensive to maintain and carry hidden costs.
- Rules are unpredictable in complex environments due to their tiny scope.
- Rules are undecidable in real-world failure scenarios, which renders them deficient for continuous service assurance.
AI and machine learning are new enablers
In a very real sense, AI and ML liberate teams from each and every limitation of rules. The fundamental difference is how AI/ML uses observability and monitoring data. A rules-based system separates the infrastructure you’re trying to understand from the data it produces. To predict trouble, it applies pre-built logic to the alerts generated by system events. This doesn’t always work as advertised.
AI/ML takes the opposite approach. It does not treat data as separate from the system. You cannot go from state to event, but you can go from event to state. This approach assumes there is a signal in the noise produced by alerts, which is interpreted mathematically by statistical machine learning to infer the existence of issues worth investigating. And if you think that notion is an oddity, this relationship of the system to the observer has been common currency for centuries in everything from Eastern mysticism to quantum mechanics.
In the practice of different types of operations and teams having to work together, we are just now realizing you cannot separate the two.
AI-based observability allows you to discover incidents previously not detected by a rules-based solution. Statistical ML can use the same algorithm to infer one type of instance from other types. Algorithms themselves are error-resistant and don’t need to have all of the data to make reliable conclusions. Algorithms are deterministic – meaning that they always produce the same output, regardless of input. Algorithms work no matter the order in which the data is processed. Because AI-based observability is a mathematical ML approach, a single algorithm can replace the logic of hundreds of thousands of rules in fractions of a second.
All of this sounds so straightforward, and it’s almost magical. So why isn’t everyone using observability?
Breakthroughs and impediments for change
For one thing, the application of statistical ML for observability entails fairly new techniques. The concepts have been around since Alan Turing’s breakthrough 90 years ago. The foundations for modern AI algorithms were laid during the 1960s, ’70s, and ’80s. The earliest commercial applications were stock trading systems, then followed by others such as fraud detection and handwriting recognition.
Three recent breakthroughs have finally facilitated the wide adoption of AI/ML for IT operations (“AIOps”) and now for broader use case applications of observability.
First, statistical ML requires very powerful computers, which have become common only in the last decade. Second, statistical ML requires lots and lots of data. The ease of storing, accessing, and using Big Data is finally practical thanks to the global cloud, and of course, thanks to the ironclad continuity of Moore’s Law on storage and compute. Third, the knowledge that was once the exclusive preserve of academic computer science is now spreading to the wider operations community.
What may be the only remaining barrier to the pervasive use of AI-based observability? Resistance by IT, Dev, and SRE professionals!
We’re only human. People tend to be suspicious of new ML approaches because they don’t understand them. The topic is certainly complex. It’s easier for people to wrap their heads around Boolean logic, which is an old and familiar way of thinking. Their comfort zone is stretched when you talk about neural networks, similarity as a range, not a Boolean, back propagation, high-level calculus, category theory, homology, and probability – all advanced, mathematical terms unfamiliar to many IT professionals.
AI-based observability brings a future of change and hope
Despite these issues, there’s great hope for AI-based observability because it is a small part of what I call Industrial Revolution 2.0 – which encapsulates all of the innovations springing from AI/ML. The first Industrial Revolution lifted people from a life of misery, hunger, and poverty. In Revolution 2.0, people will see positive albeit unpredictable improvements in their work and life.
The Revolution of the 19th Century produced unpredictable social innovations that permanently changed society. Technology inventions included textile manufacture, iron production, steam power, machine tools, chemicals, cement, gas lighting, glass making, agriculture, mining, and transportation. Social effects included the factory system, improved standards of living, clothing and consumer goods, urbanization, better life for women and families, and safer labor conditions.
Powered by AI/ML, Revolution 2.0 will also change our lives. I am optimistic these changes will be positive. For example, if you are an IT Operations professional, consider your current work life. Frankly, the rules-based world mostly provides workflows that are mental drudgery: menial, repetitive, and low-pay. AI/ML will automate that miserable life and provide the opportunity for technology professionals to up-level the application of their knowledge in a more productive and enjoyable fashion.
We cannot ignore the massive changes that AI will bring to job roles and career opportunities. The World Economic Forum predicts the use of AI will cause 75 million jobs to be displaced by 2022. However, 133 million new ones will be created – a net increase of 58 million. With this change is the requirement that today’s workers “reskill” for these new opportunities. The 2018 study projects that “no less than 54 percent of all employees will require significant reskilling and upskilling.” If you work in IT, it is very important to begin planning for this change NOW.
Moogsoft has many large enterprise customers that are already reaping huge benefits from AIOps and observability use cases. Our new approach is enabling them to accelerate mean-time-to-resolution of operational incidents, improve service assurance for customers, simplify the management of cloud infrastructure, and more effectively manage digital transformation initiatives.
It’s time for operations teams to move beyond the antiquity of rules-based legacy solutions and put the modern machine-learning of observability to work. It is a better approach for delivering continuous service assurance to the enterprise.