Why Real-time Businesses Need “Flight Data Recorders”

PinIt

Whether you’re just beginning your real-time business initiatives or are well into it, having insights into the behavior and context of your real-time stream processing programs is critically important.

If you’re a data-driven business, you’re collecting data, cleansing it, analyzing it, and setting strategies that the data leads you to believe will result in positive outcomes. If you need to do a post-mortem on the strategy, the data is still there. You can revisit it, combine it with newer data, re-crunch it, and iteratively course correct. Easy peasy.

But with data-driven real-time business…not so easy peasy.

With real-time data processing power come great challenges

There are two challenges with real-time data processing. Firstly, it results in immediate action, such as an automatic fraud alert, an AI chatbot response, vehicle re-routing, online shopping recommendations, or financial trade. And secondly, the data driving the real-time automation is often ephemeral. In other words, by the time it has triggered some action, the data may have been discarded or transformed.

Real-time data is not a static data set that can be sliced and diced. Real-time data such as credit card swipes at the point of sale, GPS readings from a connected vehicle, or clicks on an eCommerce site flow in streams. Stream processing programs, often developed with Apache Flink, analyze the data as it flows and take action on it. The programs accept streams as inputs and look for patterns or trends in the data, detect anomalies, calculate aggregate values, or enrich the streams by adding other data to the streams.

The data that flows into a stream processing program may be very different from what flows out. And what happens inside when it runs in production, is often a black box, making it difficult for data infrastructure teams to respond swiftly and correctly when real-time systems go sideways. In the same way that variations in altitude, weather, or autopilot software can cause an aircraft to perform unpredictably, variations in streaming data format and order and stream processing code can cause real-time business processing to go awry.

Real-time business requires a “flight data recorder”

Unless the plane, or your stream processing program, has a way to record and analyze behavior over time in a production (or production-like) environment, it becomes very difficult to understand exactly when and why something unexpected happened. We know that planes solve this by carrying flight data recorders on board. And through decades of use, they have helped the airline industries accelerate advances in flight reliability, safety and efficiency.

Real-time businesses require the same “flight data recorder” tooling for the stream processing programs that drive them. When things go wrong with the real-time data infrastructure, such as application errors, unexpected AI responses, system crashes, and cyberattacks, the business expects a swift response from data ops and data infrastructure teams. Unfortunately, most data infrastructure leaders I know admit to not having adequate visibility into stream processing. It’s a new scenario for them, and it’s more challenging than observing more traditional infrastructure like databases and file systems due to the volume and ephemeralness of streaming data.

Observability lessons from military intelligence

The concept of a flight data recorder for real-time data processing was one my team developed during our years in military field intelligence. We processed all sorts of intelligence data streams in real time. Downtime, missed signals, false positives and negatives, were not tolerable. As data leaders, not only were our jobs at stake; peoples’ lives depended on the correctness and availability of our real-time data infrastructure.

Every system had an owner; a single point of contact who is responsible for the flawless execution of the system. All military systems had end-to-end observability. If things went south, the root cause would be clear to the team and able to be assigned to the team member most qualified to fix it. There was no “blame game” or time wasted disputing the cause or ownership of the issue. Data infrastructure owners in the enterprise have the same accountability and require the same insight into stream processing performance in order to keep operational systems operational.

Creating your real-time business flight data recorder

There are a number of ways you can gain the required insights into your real-time business systems. To help simplify the discussion, let’s focus on observability for Apache Flink since it’s such a popular stream processing platform. Unfortunately, popular systems monitoring tools only reveal what flows in and out of Flink stream processing programs and often only expose generic monitoring metrics. Metrics are great for understanding if something is about to, or did, go wrong, but not WHY something went wrong. To fully understand the context of the system’s health and behavior, you must collect traces, logs, and metrics along the same timeline.

To record what happens within a Flink program at runtime, you can build a bespoke observability solution or employ a turnkey solution on the market (search for “Apache Flink observability” to find commercial options). If you decide to go the bespoke route, which is fine when you’re first starting out, the following guidelines should help you define your solution:

Actually build it!

“The road to hell is paved with good intentions.” Because it doesn’t directly address end-user requirements, observability infrastructure development plans often get pushed back because the resources are required to meet new business application requirements. Until disaster strikes, which is indeed hell for streaming data infrastructure teams who lack visibility into the data and stream processing. Delaying or descoping your real-time observability infrastructure is a trap you can avoid.

Trace all the real-time things

Your Flink application developers must build observability tooling into their programs. Their real-time processing programs will need to output application traces of event and state, metrics, and logs to storage, in a correlated timeline and with a format that can be analyzed (e.g., OpenTelemetry). I see many organizations capture only system monitoring metrics, but you must go beyond that. Full observability makes it clear what specific code, data sources, or infrastructure were related to an issue. This allows the system owner to not only initiate an investigation but also assign the resolution to the most informed team member, along with relevant data to help them troubleshoot the problem. When downtime or data quality is involved, the efficiency of that process may save the loss of money and reputation.

Bespoke observability platforms often rely on developers to add logging and trace output logic to their code. It’s easy to do, but the completeness and consistency of the recorded data may vary from developer to developer. You’ll want to put standards in place and enforce adherence to them to ensure you capture real-time performance in an actionable way.

Secure your recording data

If your real-time data contains sensitive information, you should consider masking PII data in the recordings to adhere to data privacy policies and regulations. Role-based access control to recording data is also an appropriate safeguard.

Decide recording retention policies

Real-time stream processing pipelines can process thousands of data items per second. That’s a lot of data to store! You should devise a data retention policy in order to help keep infrastructure costs under control. Keep the most recent several days of recording data accessible in warm storage—however long you think it would take to realize a real-time business process was misbehaving. As data ages, you can move it to cold storage to save money.

Prioritizing a phased approach

If you’re phasing in observability on a pipeline-by-pipeline basis, you should prioritize pipelines that support regulated processes highest because you’ll benefit from greater auditability. Behind those, prioritize pipelines with the greatest impact on profitability, then customer experience; in other words, pipelines where downtime would cost you the most in lost revenue or customer goodwill.

Concluding thoughts

Whether you’re just beginning your real-time business initiatives or are well into it, having insights into the behavior and context of your real-time stream processing programs is critically important. It will help you ensure that your real-time processes are fully auditable and achieve their expected benefits, and minimize downtime and security risks. You can build your own real-time business flight recorder or choose a commercial solution depending on your need and the availability of resources to build, maintain, and support internally vs. buy. Launch your real-time business and keep it airborne with Flink observability!

Ronen Korman

About Ronen Korman

Ronen Korman is the Co-founder and CEO of Datorios, a venture-funded Apache Flink developer.

Leave a Reply

Your email address will not be published. Required fields are marked *