Observability enables enterprises to go beyond monitoring to precisely understand behavior of modern applications and systems.
Articles about observability often focus on a list of data types that a monitoring system must ingest to support a modern IT environment. Dwelling on logs, metrics and traces is an unfortunate topical demotion, for it misses the point of monitoring.
The ability to ingest these three data types is insufficient to meet an organization’s goal of getting monitoring capability that is suitable for a range of use cases by ITOps, DevOps, and Site Reliability Engineering (SRE) teams.
Nor is it even necessary. Since observability is deeply intertwined with the concept of causality, the real goal should be deploying monitoring systems that lead to effective causal analysis of the systems being observed. A modern observability tool should help enterprises to precisely observe system behavior to determine causality. With this understanding, teams can effectively respond to issues for rapid remediation, or even act in advance to prevent issues from happening at all.
Understanding business process performance
Let’s consider a new, more useful approach to monitoring, one that is intended to perform at a technical level far more appropriate to the complexities, velocities, and criticality of modern IT in supporting a digitally transformed business.
First, the concerns. ITOps, DevOps and SRE teams are recognizing the limitations of traditional monitoring technologies. Some pundits and industry thinkers say we should replace the goal of monitoring IT systems with the goal of observing IT systems. As mentioned, many define observability in terms of ingesting three data types: logs, metrics and traces. Using a broad array of data types is a good thing. However, ingesting these data will not bring Ops teams closer to understanding the actual sequence of system state changes that underpin the execution of a digital business process.
Ingestion of these (often redundant) data reveals little insight about the health of the business process or the underlying IT system supporting that process.
How observability addresses modern IT systems
The concept of observability first arose in the context of mathematical Control Theory. It starts with the idea that we are interested in being able to determine the actual sequence of state changes that either a deterministic or stochastic mechanism is going through during a given time period.
In many cases, we do not have direct access to the mechanism. We cannot directly observe the state changes; this prevents sketching out the sequence. Instead, we must rely on data or signals generated by the system, and perhaps the surrounding environment. We also need a technical procedure to infer the state-change sequence from the data.
Note that the ability to go from data set to the state-change sequence is a property of the mechanism itself or, at worst, the mechanism and its environment. A mechanism is observable precisely if it and its environment generate a data set for which there exists some procedure to infer the correct state-change sequence that executes with generation of the data set. This is the only way to achieve true observability on modern systems.
Limitations of legacy monitoring tools
Traditional monitoring systems have not incorporated a modern definition of observability. Their focus is capturing, storing, and presenting data generated by underlying IT systems. Legacy tools generally force human operators to make any inferences regarding what the data reveals about the underlying IT systems. (Alert fatigue, anyone?)
Toil by humans rises with associated concurrent use of topology diagrams, configuration management database (CMDB) data models, and other attempts to represent the IT environment. In most cases, these models were either developed manually or generated via some other procedure that was independent of the data being ingested. This creates gaps in monitoring and understanding the actual environment – especially as virtual enterprise resources spin up and down in microseconds.
In the best of circumstances, these system representations provide interpretive context for the data captured by the monitoring system. The data could also lead to modifications of the system representations, but legacy tools provide no algorithms or processes that link the data to the system representations.
Domain-centric tool providers are aware of these limitations for modern systems. Some providers are offering newer technologies to address the limitations. For example, after ingesting target data, some newer tools actively seek patterns and anomalies in the data but fall short of enabling true observability. Why? Because the target patterns and anomalies are statistical properties of the data sets themselves. They do not incorporate analytics of the system that generated the data. This requires the use of a domain-agnostic observability solution that can work across a diverse modern environment.
Put another way, the patterns and anomalies have to do with normalities of correlation and occasional departures from those normalities. They do not capture the causal relationships that support the actual state changes within the IT system itself. For ITOps, DevOps, and SRE teams, understanding causality is the bottom line for observability.
Clarifying the meaning of causality
Causality is statistical confirmation that “X” caused “Y” to occur. Causality differs from correlational normality and how all of this is connected to IT system state changes. Think of two events captured by two data items, — say, for example, the usage of a CPU standing at 90% utilization and end user response time for a given application clocked at three seconds. When one occurs, the other occurs. When both events always accompany one another, we call this a correlational normality.
Correlational normality is not necessarily a causal relationship. For causality to exist in this scenario, it would allow lowering the level of CPU usage to, perhaps, 80%, which would impact the response time – perhaps shorten it to two seconds. In other words, a causal connection between two events is demonstrated by showing that an intervention changing one event will result (without further intervention) in a change to the other event.
This is a challenge to establishing causal relations that link IT system events and digital business process events. Management will not tolerate “conducting experiments” on a production system just to establish causality.
Using observability data to determine causality
Fueling analytic tools with observability data is your best path for determining causality in a non-disruptive way. Specifically, this is the operational data generated by a system, which in turn, fuels algorithms for statistical analysis of potential causality. System state changes are events linked by causality; a given sequence of such changes is a causal relationship.
When a mechanism moves an IT system from one state to another, there must be some way of modifying the first event so that the second event will also be modified as an automatic consequence. Hence, if causality can be established or at least approached closely, an understanding of system-state changes will have been obtained. In other words, the system will have been observed and not just monitored.
For these reasons, I encourage you to take a close look at your organization’s legacy monitoring tools and determine what they offer Ops teams for observability on modern systems. Upgrading observability capability may be necessary to move beyond monitoring for better insight and control of your modern systems.