Ruminations of J.net idle rants and ramblings of a code monkey

Where does StreamInsight data come from?

StreamInsight

I got this question at Sql Saturday BI Edition in Dallas not too long ago. My answer was the usual one … “Well, you write an input adapter or (in 2.1) a source in .NET …”.  “Yes, I know that,” the asker replied, “but where does it come from?” “Ummm …” I was getting a little flustered. “Anything you can get to with .NET.” “You aren’t helping,” he replied … and proceeded to tell me that I’m giving him the same answer that everyone has given him and it’s not really an answer. Now, this wasn’t just some muffin-eater that you see at free conferences sometimes where you find yourself wondering if they should be allowed to use a computer at all. No, it came from fellow MVP and Microsoft Certified Master Sean McCown of Midnight DBA fame. I’d sat in on Sean’s sessions before and knew that he was a little smarter (at least) than your average bear. So I knew I was missing something. I scratched my head. And then went into a discussion … and whiteboarding … of how OSISoft’s PI System Adapters work.

It’s been some time since then and I’ve been meaning to write this follow up. You see, it finally dawned on me as Sean was ranting about the answer not being an answer that, well, it wasn’t really an answer. The whole getting-data-to-StreamInsight thing has been something like the Underwear Gnomes from South Park. Step 1: Get Data Source. Step 2 … … … Step 3: Write Adapter! What’s missing is step 2 … how does the data get from your data source into the adapter/source code. In reality, it does vary pretty widely, depending on what your data source is.

Determining the Data You Want

Your first step is to determine the data that you want to get into StreamInsight. You’ll have two primary types of data: fast, stream data for analysis and slow-moving, reference data. For your stream data, you will typically want to get it as close to the original source of this data as possible – within reason. As we get further into this, it’ll make more sense what I’m talking about here. You also want to keep this data as lightweight as possible. There’s no point in sending extra bits over the wire that you really don’t need with every update. Other data, the stuff that doesn’t change (for example, with a pressure sensor you may have units of measure, device type, subsystem, etc.) is reference data. You’ll still need it but you won’t want it with every single individual update. Depending on the data rates and the number of individual data points that you want, it’s entirely conceivable that you’ll run out of network bandwidth before you run out of StreamInsight processing power. You’ll also want to have a minimum number of sources/input adapters as possible so if you can get reading from multiple items/devices in a single adapter, all the better.

With that said, let’s talk about some of the options for data sources.

Raw Device Connections

Yup, you can do this. Well … as long as you have a way to get to it. Connecting directly to the device will be the fastest way to get the highest speed data possible. Now, exactly how you do that depends on the device. Some process control system sensors will transmit over dedicated and special-purpose protocols like Modbus or Profibus. They may ride on TCP, UDP, serial ports or even dedicated network connection require special hardware. Or you may have a completely custom and proprietary protocol. At any rate, you’re getting data direct from the device in its most raw form. Depending on the device and its capabilities, this may be a push or it may be a poll. And, as a result, you’ll need to take the bytes off the wire and deserialize them into events for StreamInsight. Let me give you an example of something that we’re currently working on. In this system, we’re actually getting data from building sensors via a push … things like fire alarms and door access devices. There is a common protocol for these devices but it’s all binary. The first byte gives you a header that tells you what kind of message it is and a timestamp. From there, you can determine how to read the rest of the bytes. The structure is documented and it’s a matter of taking the bytes, rearranging them and then enqueuing them. It’s not hard but it is tedious and you really have to pay attention to little details since you don’t have a nice object API to work with.

Device Aggregation Systems With “Push”

No, that’s not the formal name but it’s about the best “bucket” that I can come up with. Again, this covers a wide variety of types of systems that will aggregate raw device signals and then provide notifications of updates. It’s a layer above the raw device connection so you’ll lose a little bit of latency but the advantages are huge. Since you can aggregate multiple devices into a single connection, you minimize the number of adapters that you need to have running at any one time. Nor will you have to worry about fiddling with byte arrays and deserializing them into usable data – these systems will have APIs that do a lot of the heavy lifting for you. I’d put things like OPC and OSISoft’s PI Adapters for StreamInsight into this bucket. OPC has the advantange of being very widely deployed and accepted. Unfortunately, the most common implementation is the (very) old OPC-DA, which is based on (the horror!) DCOM. However … it does work very well. With OPC-DA, you create a connection to the OPC DCOM server and then subscribe for callbacks via old school COM events. As new readings come in, the OPC server sends your adapter/source the updates, which are then enqueued. OSISoft’s PI Adapters are similar except they are all .NET. Now … if you are familiar with PI, you also know that it’s a historian – it stores event data long-term and, to do this efficiently uses compression algorithms that amount to downsampling. (You can, in fact, use OLEDB to get to the PI archive for historical data.) However, the adapters don’t read from the archive. As the data comes into PI from the individual sensors, it goes into the Snapshot. Every bit of data that the PI Server gets goes into the Snapshot, whether it’s actually stored or not. It’s after the Snapshot that it goes into the compression algorithms for efficient storage. The PI adapters connect to the Snapshot and register for notifications; as data comes into PI and goes into the Snapshot, it is then sent to the input adapter and then to StreamInsight.

Polling a Store (Store-And-Forward)

You don’t want to do this if you can avoid it. But you can’t always avoid it. Sometimes you will want to sit alongside an existing system that collects data in a database and analyze it with StreamInsight temporal operators or correlate it with other real-time events. Sometimes the architecture and nature of the systems involved really require this. Standards like WITSML and PRODML come to mind what mentioning this. You will want to balance the amount of polling that you do with your latency requirements and you’ll have to keep in mind that there will be latency involved. As with other data, you want to only get the least amount of data so your source will need to have some way for you to select all since last poll. It’s actually the easiest to do but the least desirable from a StreamInsight perspective. SQL Server Service Broker is one area that may make some of this a little better from a Sql Server perspective and one that, honestly, I need to really spend some time looking at.

Reference Data

Your reference data is going to come from some sort of durable store like Sql Server, Oracle or web services. This is the metadata that you use in your queries but that doesn’t change very often. You’ll want to set this up as a poll but with a very long interval. Exactly how long will vary based on your use cases … every few minutes or every hour or even longer. Depending on the amount of data that you have, you may also want to look at only getting changes since last poll … again, minimizing the amount of data being pulled across the wire is always a Very Good Thing™. The pattern for reference data is pretty well defined and hasn’t changed at all … I still refer folks to Mark Simms’ post of 2 years ago. With StreamInsight 2.1, some things have changed with importing CTIs (which I will look at again later) but the end result is still the same.

So … hopefully I’ve fleshed out and clarified that mysterious step 2. In the end, it really depends on how you can get to the data that you want to analyze. The aggregation with push is what I try to lean towards … it’s relatively low latency and gives you a lot of bang for the buck. In the real world, you don’t want to be writing adapters for every device under the sun, especially when you can connect with OPC or OSISoft PI and leverage the extensive connectivity that you’ll have with these established platforms.