Perspective on technology trends for acquiring, analyzing, and acting on streaming data from Roger Rea, Product Manager for IBM Streams.
Drawing on examples from complex event processing in financial services to motor vehicle monitoring, Rea explains the importance of a platform for integrating disparate data streams.
Adrian Bowles: Roger, you’re Product Manager for Streams, right?
Roger Rea: Yes, that’s correct, Adrian.
AB: I’ve been with you long enough today that I know you’re really passionate about Streams. Maybe you can take us back to the beginning. How did you get there?
RR: My boss came to me one day and said, hey, there’s this thing coming out of research, I don’t know what it is. Execs decided we’ll turn it into a product, you interested in taking it over? I said, sure, I need more to do. What is it?
I got a hold of the development manager, they sent me the programming manual and I read it and I said, this is an amazing technology. They spent five years in research with it, with the Department of Defense and a few early customers in healthcare and space weather prediction. The key behind it was a language that allowed you, at a high level, to string together a flow graph or a pipeline, a series of operations to bring in data, compile that and then spread it across a cluster of computer.
AB: Okay.
RR: At that time, all high-performance computing kinds of things, you had to manually decide, this is how much computation I can do on this computer, and if I can’t do the whole job, I manually have to spill it over to some other computer. Which was the state of the art, but of course, that means there’s a lot on the individual designing the architecture and the applications.
What do you do if suddenly the volume changes? It’s a stock trading day and suddenly the volume goes up. What you used to fit on one computer doesn’t fit anymore. This was a technology that would go look at your application, the way you’d specified the flow of data, those kinds of algorithms you would use, and automatically distribute that across a clustered runtime.
So I said, I’d really love to have this, let’s take this over. So we’ve worked to really evolve that technology, understand the marketplace it fits in, and really shift the marketplace.
AB: That’s a lot of data.
RR: That’s a lot of data. That’s a part of the deep analytics definition is it’s all different kinds of data. It’s structured data. It’s unstructured data. It’s geospatial data. It’s distance calculations around the globe to try and understand where are you, when will you arrive, how many minutes will that be based upon your current course and speed and other things about the landing pattern.
We do things in call centers where we do speech to text, so it’s totally unstructured data. To be able to convert speech into words, into sentences, and then be able to do natural language processing to understand sentiment, to understand intent, to understand root-cause analysis. What’s the problem, why is this person is calling in? So we can help that customer care agent more quickly solve the
customer’s problem.
AB: Are there specific industries that are really early in the adoption using this type of technology?
RR: Yes, there’s really two industries that come to mind. The early part of the industry was more described as complex event processing. A lot of that was in finance industry, algorithmic trading kinds of things.
The velocity you mentioned is important because, in some of those kinds of applications, responding very quickly is of utmost importance. But frankly, in some ways that’s a misnomer for this streaming marketplace as it is today because many applications don’t require the low latency. A good example is healthcare.
Your body will change all the time, but I only need to know once every five minutes or so how your physiological changes will impact what will happen to you in the future. Or even in geospatial applications. A car moving on a freeway at 60 miles an hour, not at rush hour, of course. But moving at 60 miles an hour is 88 feet per second. Well, a millisecond is a thousandth of a second. That means
every millisecond you move an inch. I might be able to process and react to telemetry telling me about a car moving an inch, but I really don’t need to take any action that fast. If I can take an action in several seconds, sometimes several minutes, that’s important.
And what’s different about these applications is the continuous ingest of all different kinds of data, and then taking an action at the right time and the right channel. For many of my customers that are in customer care, it’s exactly that. I might learn because you’re on my website that you’re disgruntled, but I probably don’t want to text you in under a second because you might get upset about that. I might have past history with you as an individual customer, “I only want to talk to the bank’s employees when I go to the branch or when I log onto their website. Please don’t send me anything.” So you need to take that into account, and even though you would like to respond to them right away, you know all you’ll do is make them angrier and angrier and more likely to leave. So you need to understand and gain insights in real time, but your action might not need to be in real time.
We have rules, a language that we use from IBM Operational Decision Manager and compiled to go run on Streams so that you can specify actions or other things to take in regular rules. We support many different machine learning capabilities. We have about 40 native machine learning models where, if you don’t need a training dataset, you can just learn as you go and build up the model, make predictions, detect anomalies. Or, we can use models that have training datasets that you learn offline, such as from neural networks for the speech-to-text technology that we use, or many of the others that are there for nearest neighbor or Holt-Winters seasonal predictive algorithms. Train them offline with training data.
Just do the scoring in real time to decide a confidence level and action to take.
AB: So if I can try to net out the advice, you can look at applications in your portfolio that are currently using databases and see if it needs to be stored, if you can actually process as it’s coming through. That’s why, whenever I hear Streams, I think of a physical stream.
RR: Yes.
AB: And the saying that you never stand in the same stream twice. If you are there and you’re getting that data as it comes through, you can be more effective. Then the other family of applications would be things that are using new data sources, or sources that you couldn’t process before, like text, videos, et cetera.
RR: Yes, and that’s exactly where most of our customers have been. Let me redo something that’s been around a while, but do it more efficiently, do it faster. Let me do something that’s never been done before because of these new data sources, bringing them together in a unique way.
AB: That brings me to a way to maybe close it out. Getting started. I’m saying that, are most people starting by looking at existing applications? How do you recommend people look at this?
RR: There’s a couple of different ways that people get into this. One is some of them do look at existing applications and say, I need a better way of doing this. Some of those are some clickstream applications that people are starting to do that in the past, a lot of those were batch oriented. How do I create every five minutes, every hour, the right recommendation? No, let’s do that on each click. How do I handle that in real time? Some of those are places where they’re looking at existing applications.
Others are—and we get a lot of business in this fashion too—where an executive says, “I have this problem and I need somebody really smart to go analyze the data and figure it out.” So they hire a data scientist, go look at the data, create some models, which could be machine learning like we think about, but there’s scores of different machine learning algorithms in a lot of different categories. One is just clustering. What’s the cluster of things people buy together at the grocery store on Saturday morning?
You can do those clustering kinds of algorithms, learn some unique things, and then you say, great, I’ve learned this wonderful new thing about my business. How do I operationalize that? How do I put it into production? Do I send a memo out to all the frontline retail people? “Please read this study and start taking action.” No, maybe I’d like to start to build a real-time application that can get some of these different kinds of data—the same data that was used to build the model—and just put it into a process so that it takes action as that transaction comes in. That’s one source of data that comes in, notes from other places and things like that.
AB: I think what I’m taking away from all of this is, one of your comments is going to stick with me—that all data is created in real time, and this enables you, if it makes sense, to actually use it in real time.
RR: That’s what we talk about. Acquire data, analyze the data, act on the data.
AB: Great. Thanks.
RR: Thank you. I appreciate the time, Adrian.