DesignAndCode.net

thoughts on Design, code, code management, database design and more

Cross Apply XPath and XML

In my previous post, I mentioned that we had an issue with an XML column within a table from a vendor we use.

From the vendors point of view - each customer would probably need different data that they would want as metadata stored for a document in their database, and so using an XML structure probably made sense.

In the query issue we had there was a "cross apply" used for the XML, and it left an assumption that any reader just knew what that meant and why a "Cross Apply" was even used here.

A number of years ago - I was attending a software conference and someone told me that they knew some people who were going to put up a new website for question and answers for technical questions, that was to be called Stackoverflow; I was an early adopter and have mentioned it to anyone learning software development. For this subject of "cross apply" there is a good answer of "why cross apply is needed when using xpath queries".  The  .nodes() function returns a rowset - which the FROM clause needs - and the XPATH can shorten the amount of XPath type names needed in the select.

It is that additional rowset query part - that made the vendor query seem like a table scan within a table scan and was performing so badly.

The previous post calls out the user defined function to extract out a single node into a computed column; and then putting an index on that column. Taking the query time from 4.5 minutes to subsecond response time. As the computed column eliminated the need for any cross apply - and a subquery to a an XML column's rowset.

SQL Server XML Column Performance

Recently at work, we had a stored procedure that was timing out. The stored procedure involved used joins across three vendor tables, and a "cross apply" into an XML column in two of the tables.

There is a line of thinking that I have often encountered, while working within an enterprise, that you normally would not consider modifying a vendors table structure, but the performance of using a "cross apply" for specific XML nodes within the XML Schema - in that XML Column was taking over 4.5 minutes to return a result - with only 845K rows in the main table.

One of my co-workers suggested using temporary tables within the stored procedure - which is actually a better way to go that what was there; which came originally from the vendor. The stored procedure mechanism extracted records from the main table that contained the specific XML nodes being looked for - and then joined the temp tables together reducing the overall table scan and then the XML nodes. This did result in a query that dropped the time down to about 1.4 minutes; still not really all that acceptable.

Response times over 30 seconds hit the default SQL Server timeout - and without a better option, we would have to add a timeout parameter into the command to retrieve the records. It is not a great user experience to have to wait more than 3 seconds. 

To review the details of what was eating up the time - beyond the execution plan, i got curious about performance of table variables vs temp tables - and verified that temp tables were the better option in a stored proc.

One article I came across  to measure the performance used

SET STATISTICS profile ON

SET STATISTICS time  ON

XXX <some Query being measured>

SET STATISTICS time OFF

SET STATISTICS profile OFF

which surfaced very detailed aspects of what was going in using the Cross Apply for the XML - it was like a table scan within a table scan - as it has to scan through all the nodes of the XML within each row.

That led me to research for performance tips on an XML Column.

That has the idea of writing a User Defined function (UDF) that surfaces the value of the XML Node you are after as a Computed Column value - and then putting an Index on that Computed column.

I wrote out the needed function and applied the index in our development system.  I altered the query structure from the stored proc that had been taking 4.5 minutes to use this computed column and now I got subsecond response time.  Without measuring the millisecond values - it was roughly 275 time faster. We need to see if it has any impacts on the database in other ways - but so far this was our best solution.

Only because the computed column is relying on values in the row itself that this can make sense - I do not think a computed column over lots of rows would perform well; this specific case though did.