Ruminations of idle rants and ramblings of a code monkey

Where does StreamInsight fit?

StreamInsight | Idle Babbling

I’ve been working with StreamInsight for over 2 years now and this is one of those questions that I get all the time. Over that time, I’ve refined where I see StreamInsight fitting into an enterprise architecture … and that has included significantly expanding potential use cases. Typically, when people look at StreamInsight – or other CEP tools from other vendors – they think of monitoring sensors or financial markets. Of course, StreamInsight can do this and it’s very good at that but there’s a lot more that it can do and value that the technology can provide to the enterprise. Based on the number of forum posts and the increasing variety of users posting on the forums, it seems that others are beginning to experiment in this area as well and that adoption is picking up. So, in the past couple of months, I’ve really put a lot of thinking into where StreamInsight fits outside of the traditional use cases and wanted to share that.

The Paradigm Shift

StreamInsight looks at and handles data in a fundamentally different way than we, as developers, are used to. This is something that everyone getting into StreamInsight struggles with, myself included. Traditionally, we look at data that is in some kind of durable store … whether that be a file, a traditional RDBMS or an OLAP cube. We look at what has happened and was recorded for posterity. It’s stable and static. Time, in traditional data, is an attribute, a field value, that is descriptive of the data that we are looking at but not an integral dimension of the data. Time doesn’t inherently impact how our joins work, how we calculate aggregates or how we select unless we use it in our WHERE clause. It’s not a part of the SELECT or FROM clauses that actually define the shape and structure of the data set. It’s static, relative to the dataset, and references some time in the past, much like a history book’s timeline.

For StreamInsight, it’s very different. In SI, time is an integral dimension of the data, a part of the FROM clause that we are familiar with. You don’t specify this in any of your LINQ queries but it’s there, an invisible dimension that impacts and affects everything that you do. It’s also the thing that’s the hardest for developers to get their heads around because it is so radically different. Many of the query-focused questions on the forums deal with trying to understand how all of this temporality works and how timelines, CTIs and temporal headers interact with events and queries. The things that this allows you to do are difficult in traditional systems. Certainly, they can be done but not without a TON of code that navigates back and forth, keeping track of time attributes and processing in a loop. Even WINDOW functions don’t come close (I’ve been asked this) and, while they may provide some capabilities to do things like running averages, doing something like “calculating the 30 minute rolling average every 5 seconds” – which is very easy in StreamInsight – is pretty difficult to accomplish. Native and inherent understanding of the order of events (previous vs. current vs. next) or holes in the data is also difficult – that’s going back to the cursoring and order by clauses with a whole lot of looping in the mix as well. Yet, with StreamInsight’s temporal characteristics, these things are relatively simple to do. More sophisticated things, like deadbands and rate of change, are even more difficult with traditional data stores but absolutely doable in StreamInsight with extension points like user-defined operators and aggregates.

One comparison I like to make is to talk about driving. The traditional data paradigm would have you driving with a digital camera and taking a picture every x amount of time and then using the display of the picture to navigate and drive. Could you actually do this? Maybe. Probably, if you were good and careful, your camera was fast enough, traffic wasn’t heavy and people actually drove intelligently. But you’d miss a whole lot of things that happen in between snapshots, you’d have a more difficult time understanding where things are going and there’d be a latency in your reaction time. StreamInsight, however, is more similar to how we actually drive and take in our surroundings … our senses provide our brains with a continuous stream of information in a temporal context. We are constantly evaluating the road, other vehicles, their relationship to our current position and where we are going. StreamInsight does similar things with data, though not quite as efficiently as we do without even thinking about it. Our brains are, essentially, a massively parallel CEP system on steroids.

Beyond that, StreamInsight’s understanding of time isn’t necessarily tied to the system clock, another thing that took me some time to get my head wrapped around. Instead, the clock is controlled and moved forward by the application, independent of the system clock. This allows use cases where you can use the temporal capabilities to analyze stored data – essentially replaying the dataset on super-fast-forward. An example of this was a POC that we did for a customer. They had a set of recorded sensor data, with readings for 175 sensors every 5 minutes, that represented about 3 months of data before and shortly after an equipment failure event. They gave us the data and some information about the equipment involved and asked us to find patterns that were predictive of an impending failure. Analyzing the dataset using traditional SQL queries got us nowhere … but when we starting doing some (relatively) basic analysis by running the dataset through StreamInsight, several of the patterns quickly became apparent. In doing this, we used the original timestamps but enqueued the data every 50ms – so 50ms of real-world time equaled 5 minutes of application time. In doing this, four months of data could be compressed down to less than a half hour to process. Now, if we had more powerful laptops than the dual-core i7’s with 8 GB of RAM that we were using at the time, we could have done it even faster. Our real limitation wound up being the disk – we were reading the data from a locally installed Sql Server instance and writing to local CSV files to look at the results in Excel. In the end, we were able to determine that, by looking at 2 different values and their relative rates-of-change and variability over a specific time period, we could eliminate false alerts for things like equipment shutdown and pick up the impending equipment failure about a month before it actually happened. If we had a better understand of the physics and engineering involved, we could probably have increased the warning time – but that wasn’t too bad for a couple of developers that didn’t have the full specs of the equipment, very little (or no) engineering background, basically shooting in the dark. In tests, we’ve pushed about 30,000 events per second – randomly generated and without any analytics – through StreamInsight on our laptops and over 100,000 events/second (with analytics, remote Sql Server data source) on a commodity server-class machine (dual quad-core XEON with 24 GB RAM) with an average of 35% CPU utilization.

The Three V’s – Big Data

“Big Data” is a very hot topic these days. Everyone’s all excited about the new capabilities provided by technologies like Hadoop/MapReduce and Massively Parallel Processing (MPP) and with good reason. These are ground-breaking technologies that allow us to more effectively get information from large amounts of data. But these are still technologies in the traditional paradigm of data – capture, store, retrieve and process. There is a latency involved with this that simply can’t be overcome due to the store/retrieve part of the cycle. No matter how fast the capture and process steps are, the disk is the bottleneck of the system. While SSD’s reduce this latency, they can only do so much and are still the slowest part of the entire system.

When talking about Big Data, the “three V’s” often come up … Velocity, Volume and Variety. Hadoop and MPP deal – very well – with the massive volumes of data and Hadoop adds capabilities around variety. But they have trouble – because of the paradigm – with velocity – or the frequency with which data is generated and captured. And, let’s face it, velocity is a critical piece these days. Ten years ago, we talked about “moving at Internet speed” and the agility that the fast past of change required businesses to have. Today, what we used to call “Internet speed” seems a snail’s pace. We’ve even coined new terms to describe it; “going viral” comes immediately to mind. I’ve come to call it “moving at Twitterspeed” and enterprises need to become even more agile to keep up, especially when it comes to marketing. The impact of social media – particularly Facebook and Twitter – has really driven this fundamental change in the market and companies have, more than once, found themselves completely blindsided by viral explosions across Facebook and Twitter. This velocity, coupled with an understanding of increasing or decreasing velocity, combined with the sheer volume is becoming a critical business capability that companies are struggling with and, in some cases, failing spectacularly.

With StreamInsight, handling the velocity of “Twitterspeed” and understanding how things are trending is absolutely do-able. Imagine a corporate marketing department being able to hook into Twitter and other social media streams, analyzing for specific keywords (or hashtags) and highlighting increasing (or decreasing) trends in these keywords … as they are happening. Within minutes, they can then begin to get on top of trends as they are just beginning to “go viral” and formulate an intelligent, coherent response while there’s still time to get ahead of it. It used to be that these trends weren’t readily apparent for days or weeks but now it’s down to hours or minutes when things are going at Twitterspeed. Now, add in geo-location analytics and customers can begin to understand not only what is going on, but where. From here, we can get into more effective and meaningful targeting of marketing messages that may have relevance in one area but not another.

Outside of social media, we also have the increasing interest in the “Internet of Things” – smart devices that capture and report data. These have the potential to take both volume and velocity to a whole new level that makes even “Twitterspeed” look sluggish. Even now, as the IoT is in its earliest stages, there are billions of devices participating, ranging from the smart phones that we carry with us everywhere to RFID, smart meters, smart roads and other device sensors to shoes and wristbands and everything in between. We are just entering an age of truly ubiquitous computing and connectivity, allowing us to capture data from a broad range of sources, both traditional and non-traditional. In many of these cases, even if the velocity isn’t fast, the volume is simply mind-boggling. With an estimated 8 billion or so connected devices today, volumes get very big, very fast, even if they aren’t changing rapidly. And the number of these devices is increasing exponentially, with a projected 50 billion devices by 2020.

StreamInsight is designed to handle both volume and velocity. Because it doesn’t require storage of data but, instead, does analytics in memory, it’s bound by CPU and memory speeds, not by disk. As a result, it can handle data velocity and volume that would simply overwhelm disk-oriented systems. This is especially the case when data is required to be continuously updated and analyzed … to do this with traditional technologies you have to poll the data store. You’ll have to be really careful doing this because you’ll very quickly overwhelm the system. But because StreamInsight pushes all the way through, polling … and the latency and scalability issues associated with it … isn’t a significant problem (unless that’s how you get your source data but that’s a completely different issue). You will, however, want to downsample the data before you send it to a sink/output adapter and in the vast majority of cases, this is actually desirable. Storing every piece of data from the Internet of Things is, quite simply, cost-prohibitive from a storage perspective.

That brings us to the third “V” – Variety. This is something that is a mixed story with StreamInsight. Individual input sources must have a strongly-typed schema; this is your payload. This limits what you can do with the unstructured data that is becoming more prevalent these days. That said, StreamInsight is very good at bringing together multiple (strongly-typed) sources, synchronizing them within the application timeline and then performing analytics across all of them within a temporal context. Take, for example, real-time web server analytics (I’m doing a presentation on this at Sql Saturday Baton Rouge, by the way). One one hand, you have performance counters – we’ve done a good job with these and there’s tools a-plenty to monitor them. But how do they relate to the executing pages? What pages, what parameters, are executing when the CPU spikes? Are there specific pages that take too long to execute and wind up causing our requests to queue? This requires not only perfmon counters but also some hooks into the ASP.NET pipeline. From there, StreamInsight can take these two very different (and differently structured) data sources and merge them together, synchronizing in time. BUT … the individual data sources are still highly structured.

Bringing it Together

Don’t take any of this to mean that I’m discounting traditional data paradigms. They are … and always will be … very important. They provide a view into the past that CEP technologies (like StreamInsight) just won’t be able to do – and really aren’t the right tools for anyway. And these traditional paradigms, with their historical information, can and should be used as “reference data” (or metadata) that further inform real-time analytics. It’s the old axiom … to understand the present, you also need to understand the past and how it relates to the present. So it’s not an either-or discussion but how these technologies fit into the continuum of data and analytics. There’s a lot of focus on Big Data from a traditional paradigm but there’s also a significant amount of value to be found in the data that’s on its way to the storage, at the capture point of the process. Downsampling at this stage can also optimize storage costs and overall read performance from Big Data stores as well as providing analytics in near-real-time. StreamInsight expands our capabilities for business intelligence and shortens the timeframe for getting actionable information from the volume of rapidly changing data that is becoming increasingly important – even critical – to business faced with things moving at Twitterspeed.