Ruminations of J.net idle rants and ramblings of a code monkey

Race Conditions in StreamInsight Query Startup?

StreamInsight

A bit of explanation is due here. The platform that we are building at Logica is intended to be reusable, configurable and flexible … from adapters to queries (or, more properly sets of related queries). A part of the architecture is what we call the StreamingQueryProvider. This is the centerpiece of everything; after all, everything in StreamInsight revolves around the queries. Initial startup instantiates and starts configured StreamingQueryProviders based on configuration. We could do this serially and, in a lot of ways, this would be MUCH easier but there is a potentially huge cost in startup time when multiple query providers need to be started. This is exacerbated when the input adapters have an expensive startup as well … something that see regularly when starting our OPC-DA adapter. This is because a) our adapter is connecting to and registering for 3000+ item subscriptions and b) OPC-DA is the old standard and is based on (the horror!) DCOM. It’s not something that we can really optimize at the adapter level; it’s something that is inherent in the beast. Each adapter typically takes somewhere between 20 and 45 seconds to initialize, register and start. When you have multiple input adapters registering different subscriptions, you are beginning talk about some serious startup costs. In a real-world deployment scenario, we expect that there could be several OPC sources with even more than 3000 subscriptions each. Now, once the events get into StreamInsight, it handles them like a charm. It’s just the startup. We’ll be working on OPC-UA and, possibly, OPC .NET 3.0 (OPC-Xi – a server-side COM wrapper for OPC-DA/HDA/AE that exposes a WCF-based interface to clients) that I suspect will perform better but here will still be limitations due to the sheer number of subscriptions – a typical offshore rig can have up to 70,000 individual sensors, depending on age of the platform and the operator. It is likely that the number of sensors will grow even more after the Macondo accident.

To help optimize startup, we’ve multithreaded the query provider startup – and the queries within each query provider – using the .NET thread pool (System.Threading.ThreadPool::QueueUserWorkItem). This has helped to cut startup time significantly. With our typical test setup, we’ve gone from 1:30 to 2 minutes startup to about 20 seconds – a big and very noticeable improvement. But, in doing so, we’ve also run into some race conditions that require some careful handling.

To understand this, I’ll explain what StreamInsight does when a query and input adapter are started. If you look at the interface for the Microsoft.ComplexEventProcessing.Application class, you’ll see several collections. The ones that are of particular interest are InputAdapters, OutputAdapters and EventTypes. These collections cache instances of your adapter factories and the definitions of your CepEventTypes. When using Microsoft.ComplexEventProcessing.Linq.CepStream::ToQuery, the StreamInsight engine will first check these collections and, if they are not there, will create the cached objects – 1 per unique type.

When multi-threading your startup queries, you can run into a race condition where you have two threads checking and creating the cached items. Between the check and creation on 1 thread, the other thread creates the cached instance. You will then get one of my absolute favorite exceptions: the TargetInvocationException. Depending on where the condition happens in the code path, it may be the InputAdapter, the EventType or the OutputAdapter. The exception message gives you the necessary details to understand what went wrong but … what do you then do about it? You can trap the exception and retry the operation – which then succeeds without exception but trapping exceptions and using them for control flow is just nasty-ugly. And it’s not like it’s something that you can simply reproduce at will – it’s a classic race condition that sometimes happens but mostly doesn’t. We’re currently using a dependency list to help deal with this issue – making sure that this race condition doesn’t happen but, as I write this entry, it occurs to me that there may be some better ways to handle and avoid these conditions. I won’t detail these possible resolutions until I do some further testing on them.