thoughts on Design, code, code management, database design and more

Reaching into the TPL

At my current employer, we have a series of Windows Services that handle receiving notifications of external system changes(Adds,Updates,Deletes), parsing those notifications to store in database table structures, and then publishing the notifications internally for consumption by applications within the enterprise.

The first of those services is using IBM MQ Messaging, and was written to use Event Based notifications, which have multiple Tasks listening for events from MQ. This type of mechanism effectively hook into the MQ Broker, and provides a callback into the connection, by using an AutoResetEvent to signal when there was an event to process, so that when MQ has an event, it calls into the "Listener" to invoke the message capture. The capture is critical, and once you read these messages, they do not stay around, so persisting these messages, and not exposing any sensitive information is critical. The Process tries to store the event into a SQL Server database, and if there are any exceptions it does a Rollback to MQ, otherwise a successful store of the message should commit the message on MQ so that we do not receive the message a second time; with Multiple tasks listening - this service is effectively doing parallel processing within the context of a task.

The other two services had been single threaded to process the notifications, first to reformat the event data for creation of database lookup keys, and then a service that build new messages to publish on an internal MQ broker to allow the business processes to receive the notifications of the Add/Update/Delete events as they need. That process of being single threaded was fine for over a year, and was called into question on its ability to serve one of the lines of business when there is a massive set of events.

Over the weekend there were over 60K events that hit and when I got called in to look at the issue, there was a backlog of 37K events - the 27K or so events that had processed had been very slow in the eyes of the business starting to use this system. So - I was tasked to provide a throughput that would need to surpass the publishing source so that these processes were not a bottleneck. 

I found that there was an exception occurring within the events, which are XML bodies, and that a particular node was often having an Apostrophe which would cause an exception, which was being caught and retried successfully - but an Exception was throwing the parsing logic out of its process loop midstream; it would then have to reload a batch that included the unfinished set of data (reading the same data twice plus the exception handling performance hit). So - I first added code to detect the type of issue earlier than the database inserts and allow the XML to be have the right codes to avoid the exception and streamline the batch of work that was being done.

Then I saw that the logic was using an index into an array of the batch of data -to populate the insert statements being built.  I considered the idea of SQLBulkCopy -but if there were exceptions in that it would complicate the code more - and I was being given a short time to do all the work. So - I looked at taking the loop - that was building a SQL Connection and command and added in a Parallel For Loop; which on first try left the connection outside the loop but that was causing connection close exceptions, and I had to move the connection inside the parallel for loop - which is a lambda delegate that can be done in parallel within the .Net framework.

The production processing time frame had been estimated and timed for the old process. One of the errors that was occurring during the production run with the new line of business was they had a field that was exceeding the size of the table field we parsed the data into - and we had to restructure the database to allow for their new length on this field (3 hours to do that). The processing of the remaining records was in terms of hours with the single threading. The existing code had timing logic build for the Debug version and setting the timing to 100 records per batch I got the single threaded time. I tested again after changing to the parallel For loop and my measurements showed up to an 86% improvement in throughput.  I took the 37K records in a restored version of the PROD database from the restructure and processed the records in under 10 minutes.

The vendors highest publishing per hour was about 17K and based on the 37K that I processed - the new throughput for the parsing should be over 200K per hour so the parsing was not going to be a bottleneck.

The Notification program existing code to publish to MQ was using a foreach loop - and I was able to modify that to use a Parallel ForEach - as well as using the CancellationToken for the service to pass into the options for the parallel Foreach so that the process would stop when the service stopped.  The throughput was measure in single threaded and then in parallel and showed a 76% improvement in throughput.

I have Console versions of the Windows Services for just this kind of testing - and using a Console.ReadKey to set the cancellation token that would be used in the service - at the Task level and now re-used in the Parallel Options for the Foreach.

It was satisfying to use the Knowledge that I had gathered in the last couple of years on Concurrent programming and apply it to the programs here - and solve the business concern of the processing speed.

Task Parallel Library - was introduced in Visual Studio 2010 with .Net 4. and has been improved over the following versions. Parallel For and ForEach are a little different in their application use, but the results are very satisfying.