Xml shredding performance

  • Follow


Hi,

we are having some performance issues regarding xml shredding.

At this point we are extracting data from xmls from nearly 60 different 
companies - and therefore 60 different xml structures. The total amount 
of xml is about 350MB and we are trying to extract the data as fast as 
possible.

Our current system extracts, transforms and loads the data in about five 
minutes. We would however like to do this in about one minute to be pleased.

We use the "nodes/cross apply"-technique to shred the xmls into our 
internal format.

This is how we shred the data.
------------------------------

1) Load xml into a temporary table (#XmlTable)
2) Set an xml index
3) Query (like below)

INSERT INTO #TransformedData
SELECT
     T0.T.value('asasd', 'asdadd')
     T1.T.value('asasd', 'asdadd')
FROM
     #XmlTable
CROSS APPLY
     data.nodes('asd') AS T0(T)
     T0.T.nodes('level1') AS T1(T)

DROP #XmlTable

4) Pass the temporary table #TransformedData into the common/shared 
transformation procedure

EXEC LookupData

-------------------------------

This is very I/O intensive and it makes the system slow. Are there any 
other good ways to parse the xmls in the sql server? Should we perhaps 
move the shredding outside the SQL environment into, for instance, a C# 
method which bulk loads the data?

Regards,
Johnny
0
Reply Johnny 3/8/2010 9:37:12 AM

Johny

http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx


http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx



"Johnny Persson" <a@a.a> wrote in message 
news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
> Hi,
>
> we are having some performance issues regarding xml shredding.
>
> At this point we are extracting data from xmls from nearly 60 different 
> companies - and therefore 60 different xml structures. The total amount of 
> xml is about 350MB and we are trying to extract the data as fast as 
> possible.
>
> Our current system extracts, transforms and loads the data in about five 
> minutes. We would however like to do this in about one minute to be 
> pleased.
>
> We use the "nodes/cross apply"-technique to shred the xmls into our 
> internal format.
>
> This is how we shred the data.
> ------------------------------
>
> 1) Load xml into a temporary table (#XmlTable)
> 2) Set an xml index
> 3) Query (like below)
>
> INSERT INTO #TransformedData
> SELECT
>     T0.T.value('asasd', 'asdadd')
>     T1.T.value('asasd', 'asdadd')
> FROM
>     #XmlTable
> CROSS APPLY
>     data.nodes('asd') AS T0(T)
>     T0.T.nodes('level1') AS T1(T)
>
> DROP #XmlTable
>
> 4) Pass the temporary table #TransformedData into the common/shared 
> transformation procedure
>
> EXEC LookupData
>
> -------------------------------
>
> This is very I/O intensive and it makes the system slow. Are there any 
> other good ways to parse the xmls in the sql server? Should we perhaps 
> move the shredding outside the SQL environment into, for instance, a C# 
> method which bulk loads the data?
>
> Regards,
> Johnny 


0
Reply Uri 3/8/2010 10:06:49 AM

hi Johnny,

On 08.03.2010 10:37, Johnny Persson wrote:
> This is very I/O intensive and it makes the system slow. Are there any
> other good ways to parse the xmls in the sql server? Should we perhaps
> move the shredding outside the SQL environment into, for instance, a C#
> method which bulk loads the data?
You may take a look at 'Performing Bulk Load of XML Data (SQLXML 4.0)':

   http://technet.microsoft.com/en-us/library/ms171993.aspx

Depending on your files, I would at least set some simple C# samples to 
compare performance. You may consider implementing it as CLR stored 
procedure if it performs better.

   http://technet.microsoft.com/en-us/library/ms131094%28SQL.90%29.aspx


mfG
--> stefan <--
0
Reply Stefan 3/8/2010 10:08:41 AM

Johnny Persson (a@a.a) writes:
> 1) Load xml into a temporary table (#XmlTable)
> 2) Set an xml index
> 3) Query (like below)
> 
> INSERT INTO #TransformedData
> SELECT
>      T0.T.value('asasd', 'asdadd')
>      T1.T.value('asasd', 'asdadd')
> FROM
>      #XmlTable
> CROSS APPLY
>      data.nodes('asd') AS T0(T)
>      T0.T.nodes('level1') AS T1(T)

Assuming these are top-level nodes, change this to 

      data.nodes('/asd') AS T0(T)
      T0.T.nodes('/level1') AS T1(T)

There can be huge performance gain by using correct node addressing. 
(Unfortunately, the syntax for the best addressing form can be quite
convuluted in some cases.)



-- 
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
0
Reply Erland 3/8/2010 11:34:54 AM

Thank you for your answer,

we use XML as column data type so the first article does not seem to 
affect us. It is however interesting how the query optimizer work - or 
not work :)

The second article had a really interesting part about typed/untyped 
xmls. We have had no xml schema for any xml so I tried that..

I created an xml schema for an xml file (~5MB) and compared the 
performance between a stored procedure which use the xml schema and a 
stored procedure which does not.

The result is to me a bit strange. When we use a "typed xml column" the 
execution time is ~15s and the subtree cost is 428. When we use the 
normal, untyped xml column, the execution time is the same BUT the 
subtree cost is however 866!

Do you have any thoughts about the result and why the execution time 
isn't affected?

Regards,
Johnny

On 2010-03-08 11:06, Uri Dimant wrote:
> Johny
>
> http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx
>
>
> http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx
>
>
>
> "Johnny Persson"<a@a.a>  wrote in message
> news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
>> Hi,
>>
>> we are having some performance issues regarding xml shredding.
>>
>> At this point we are extracting data from xmls from nearly 60 different
>> companies - and therefore 60 different xml structures. The total amount of
>> xml is about 350MB and we are trying to extract the data as fast as
>> possible.
>>
>> Our current system extracts, transforms and loads the data in about five
>> minutes. We would however like to do this in about one minute to be
>> pleased.
>>
>> We use the "nodes/cross apply"-technique to shred the xmls into our
>> internal format.
>>
>> This is how we shred the data.
>> ------------------------------
>>
>> 1) Load xml into a temporary table (#XmlTable)
>> 2) Set an xml index
>> 3) Query (like below)
>>
>> INSERT INTO #TransformedData
>> SELECT
>>      T0.T.value('asasd', 'asdadd')
>>      T1.T.value('asasd', 'asdadd')
>> FROM
>>      #XmlTable
>> CROSS APPLY
>>      data.nodes('asd') AS T0(T)
>>      T0.T.nodes('level1') AS T1(T)
>>
>> DROP #XmlTable
>>
>> 4) Pass the temporary table #TransformedData into the common/shared
>> transformation procedure
>>
>> EXEC LookupData
>>
>> -------------------------------
>>
>> This is very I/O intensive and it makes the system slow. Are there any
>> other good ways to parse the xmls in the sql server? Should we perhaps
>> move the shredding outside the SQL environment into, for instance, a C#
>> method which bulk loads the data?
>>
>> Regards,
>> Johnny
>
>
0
Reply Johnny 3/8/2010 2:03:26 PM

Johnny
Unfortunately I  have very limited knowledge about using XML in SQL 
Server..., if Erland jumps in .....



"Johnny Persson" <a@a.a> wrote in message 
news:%23qXjhgsvKHA.5940@TK2MSFTNGP02.phx.gbl...
> Thank you for your answer,
>
> we use XML as column data type so the first article does not seem to 
> affect us. It is however interesting how the query optimizer work - or not 
> work :)
>
> The second article had a really interesting part about typed/untyped xmls. 
> We have had no xml schema for any xml so I tried that..
>
> I created an xml schema for an xml file (~5MB) and compared the 
> performance between a stored procedure which use the xml schema and a 
> stored procedure which does not.
>
> The result is to me a bit strange. When we use a "typed xml column" the 
> execution time is ~15s and the subtree cost is 428. When we use the 
> normal, untyped xml column, the execution time is the same BUT the subtree 
> cost is however 866!
>
> Do you have any thoughts about the result and why the execution time isn't 
> affected?
>
> Regards,
> Johnny
>
> On 2010-03-08 11:06, Uri Dimant wrote:
>> Johny
>>
>> http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx
>>
>>
>> http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx
>>
>>
>>
>> "Johnny Persson"<a@a.a>  wrote in message
>> news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
>>> Hi,
>>>
>>> we are having some performance issues regarding xml shredding.
>>>
>>> At this point we are extracting data from xmls from nearly 60 different
>>> companies - and therefore 60 different xml structures. The total amount 
>>> of
>>> xml is about 350MB and we are trying to extract the data as fast as
>>> possible.
>>>
>>> Our current system extracts, transforms and loads the data in about five
>>> minutes. We would however like to do this in about one minute to be
>>> pleased.
>>>
>>> We use the "nodes/cross apply"-technique to shred the xmls into our
>>> internal format.
>>>
>>> This is how we shred the data.
>>> ------------------------------
>>>
>>> 1) Load xml into a temporary table (#XmlTable)
>>> 2) Set an xml index
>>> 3) Query (like below)
>>>
>>> INSERT INTO #TransformedData
>>> SELECT
>>>      T0.T.value('asasd', 'asdadd')
>>>      T1.T.value('asasd', 'asdadd')
>>> FROM
>>>      #XmlTable
>>> CROSS APPLY
>>>      data.nodes('asd') AS T0(T)
>>>      T0.T.nodes('level1') AS T1(T)
>>>
>>> DROP #XmlTable
>>>
>>> 4) Pass the temporary table #TransformedData into the common/shared
>>> transformation procedure
>>>
>>> EXEC LookupData
>>>
>>> -------------------------------
>>>
>>> This is very I/O intensive and it makes the system slow. Are there any
>>> other good ways to parse the xmls in the sql server? Should we perhaps
>>> move the shredding outside the SQL environment into, for instance, a C#
>>> method which bulk loads the data?
>>>
>>> Regards,
>>> Johnny
>>
>> 


0
Reply Uri 3/8/2010 2:26:05 PM

Johnny Persson (a@a.a) writes:
> The result is to me a bit strange. When we use a "typed xml column" the 
> execution time is ~15s and the subtree cost is 428. When we use the 
> normal, untyped xml column, the execution time is the same BUT the 
> subtree cost is however 866!
 
The subtree cost is just an estimate, so take it for what it's worth.

Schemabound XML may have some advantages, if you extract a lot from the
same document, but it can also cut the other way, as schema vaildation
is quite expensive.

As I said in another post, you should look at how you address the nodes.
There is a lot to win in that area.


-- 
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

0
Reply Erland 3/8/2010 11:05:40 PM

"Johnny Persson" <a@a.a> wrote in message 
news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
> INSERT INTO #TransformedData
> SELECT
>     T0.T.value('asasd', 'asdadd')
>     T1.T.value('asasd', 'asdadd')
> FROM
>     #XmlTable
> CROSS APPLY
>     data.nodes('asd') AS T0(T)
>     T0.T.nodes('level1') AS T1(T)

Just adding to what Erland said, if your XML data looks like this:

<level1>
<asd>...</asd>
<asd>...</asd>
</level1>

Consider using the '//asd' path.  Also consider putting a numeric predicate 
on your .value() function calls like this 'asasd[1]'.  It would help to have 
some representational XML data if you can supply some (and some expected 
results).  The XML schema can give the optimizer some hints that can help 
performance, but you have to weigh the benefit against the extra cost of 
validating your data against it.  Another option is to perform the shredding 
outside of SQL Server in your application.  If you don't want to bother with 
SQLXML Bulk Loading you can use .NET to shred the data and bulk load it.

-- 
Thanks

Michael Coles
SQL Server MVP
Author, "Expert SQL Server 2008 Encryption" 
(http://www.apress.com/book/view/1430224649)
----------------

0
Reply Michael 3/9/2010 2:31:27 AM

Hi Erland,

Yes, below is a real example of one of our parsers. As far as I 
understand the node addressing is correct.. Or am I wrong? It has 
happened before :)

	SELECT
		-- /A/B
		TG.T.value('@name', 'nvarchar(255)'),
		TG.T.value('(ElementX)[1]', 'nvarchar(255)'),
		TG.T.value('(ElementY)[1]', 'nvarchar(255)'),

		-- /A/B/C
		TOS.T.value('@type', 'nvarchar(255)'),
		TOS.T.value('@name', 'nvarchar(255)'),

		-- /A/B/C/D
		TOO.T.value('@name', 'nvarchar(255)'),
		TOO.T.value('@id', 'nvarchar(255)')
	FROM
		#XmlTable
		CROSS APPLY Data.nodes('/A/B') AS TG(T)
		CROSS APPLY TG.T.nodes('C') AS TOS(T)
		CROSS APPLY TOS.T.nodes('D') AS TOO(T)

Regards,
Johnny

On 2010-03-09 00:05, Erland Sommarskog wrote:
> Johnny Persson (a@a.a) writes:
>> The result is to me a bit strange. When we use a "typed xml column" the
>> execution time is ~15s and the subtree cost is 428. When we use the
>> normal, untyped xml column, the execution time is the same BUT the
>> subtree cost is however 866!
>
> The subtree cost is just an estimate, so take it for what it's worth.
>
> Schemabound XML may have some advantages, if you extract a lot from the
> same document, but it can also cut the other way, as schema vaildation
> is quite expensive.
>
> As I said in another post, you should look at how you address the nodes.
> There is a lot to win in that area.
>
>
0
Reply Johnny 3/9/2010 7:03:50 AM

Hi,

I have now tried to shred the data in a C# console project. The 
performance gain is huge.

Which is the best option to pass the data to the SQL server? Bulk load 
into a table variable or "send the data" through a table valued CLR method?

Regards,
Johnny

On 2010-03-09 03:31, Michael Coles wrote:
> "Johnny Persson" <a@a.a> wrote in message
> news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
>> INSERT INTO #TransformedData
>> SELECT
>> T0.T.value('asasd', 'asdadd')
>> T1.T.value('asasd', 'asdadd')
>> FROM
>> #XmlTable
>> CROSS APPLY
>> data.nodes('asd') AS T0(T)
>> T0.T.nodes('level1') AS T1(T)
>
> Just adding to what Erland said, if your XML data looks like this:
>
> <level1>
> <asd>...</asd>
> <asd>...</asd>
> </level1>
>
> Consider using the '//asd' path. Also consider putting a numeric
> predicate on your .value() function calls like this 'asasd[1]'. It would
> help to have some representational XML data if you can supply some (and
> some expected results). The XML schema can give the optimizer some hints
> that can help performance, but you have to weigh the benefit against the
> extra cost of validating your data against it. Another option is to
> perform the shredding outside of SQL Server in your application. If you
> don't want to bother with SQLXML Bulk Loading you can use .NET to shred
> the data and bulk load it.
>
0
Reply Johnny 3/9/2010 7:07:55 AM

Johnny Persson (a@a.a) writes:
> Yes, below is a real example of one of our parsers. As far as I 
> understand the node addressing is correct.. Or am I wrong? It has 
> happened before :)

"Correct" and "best" are two different things. 
 
>           CROSS APPLY TG.T.nodes('C') AS TOS(T)
>           CROSS APPLY TOS.T.nodes('D') AS TOO(T)

Here you should have /C and /D for best performance.
 
If you only specify C, XPath will return any node named C, not just top
nodes. And finding those takes more time.

-- 
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
0
Reply Erland 3/9/2010 8:13:51 AM

Johnny Persson (a@a.a) writes:
> I have now tried to shred the data in a C# console project. The 
> performance gain is huge.
> 
> Which is the best option to pass the data to the SQL server? Bulk load 
> into a table variable or "send the data" through a table valued CLR
> method? 
 

SqlBulkCopy or table-valued parameters if you are on SQL 2008.


-- 
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
0
Reply Erland 3/9/2010 8:15:27 AM

Ok,

thank you for all your answers. I will parse the xmls inside a 
C#-project and pass the data through a table-valed parameter.

Thank you all, once again.

Regards
Johnny

On 2010-03-09 09:15, Erland Sommarskog wrote:
> Johnny Persson (a@a.a) writes:
>> I have now tried to shred the data in a C# console project. The
>> performance gain is huge.
>>
>> Which is the best option to pass the data to the SQL server? Bulk load
>> into a table variable or "send the data" through a table valued CLR
>> method?
>
>
> SqlBulkCopy or table-valued parameters if you are on SQL 2008.
>
>
0
Reply Johnny 3/9/2010 8:26:27 AM

12 Replies
589 Views

(page loaded in 0.235 seconds)

Similiar Articles:



















7/20/2012 9:43:33 PM


Reply: