Hi,
we are having some performance issues regarding xml shredding.
At this point we are extracting data from xmls from nearly 60 different
companies - and therefore 60 different xml structures. The total amount
of xml is about 350MB and we are trying to extract the data as fast as
possible.
Our current system extracts, transforms and loads the data in about five
minutes. We would however like to do this in about one minute to be pleased.
We use the "nodes/cross apply"-technique to shred the xmls into our
internal format.
This is how we shred the data.
------------------------------
1) Load xml into a temporary table (#XmlTable)
2) Set an xml index
3) Query (like below)
INSERT INTO #TransformedData
SELECT
T0.T.value('asasd', 'asdadd')
T1.T.value('asasd', 'asdadd')
FROM
#XmlTable
CROSS APPLY
data.nodes('asd') AS T0(T)
T0.T.nodes('level1') AS T1(T)
DROP #XmlTable
4) Pass the temporary table #TransformedData into the common/shared
transformation procedure
EXEC LookupData
-------------------------------
This is very I/O intensive and it makes the system slow. Are there any
other good ways to parse the xmls in the sql server? Should we perhaps
move the shredding outside the SQL environment into, for instance, a C#
method which bulk loads the data?
Regards,
Johnny
|
|
0
|
|
|
|
Reply
|
Johnny
|
3/8/2010 9:37:12 AM |
|
Johny
http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx
http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx
"Johnny Persson" <a@a.a> wrote in message
news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
> Hi,
>
> we are having some performance issues regarding xml shredding.
>
> At this point we are extracting data from xmls from nearly 60 different
> companies - and therefore 60 different xml structures. The total amount of
> xml is about 350MB and we are trying to extract the data as fast as
> possible.
>
> Our current system extracts, transforms and loads the data in about five
> minutes. We would however like to do this in about one minute to be
> pleased.
>
> We use the "nodes/cross apply"-technique to shred the xmls into our
> internal format.
>
> This is how we shred the data.
> ------------------------------
>
> 1) Load xml into a temporary table (#XmlTable)
> 2) Set an xml index
> 3) Query (like below)
>
> INSERT INTO #TransformedData
> SELECT
> T0.T.value('asasd', 'asdadd')
> T1.T.value('asasd', 'asdadd')
> FROM
> #XmlTable
> CROSS APPLY
> data.nodes('asd') AS T0(T)
> T0.T.nodes('level1') AS T1(T)
>
> DROP #XmlTable
>
> 4) Pass the temporary table #TransformedData into the common/shared
> transformation procedure
>
> EXEC LookupData
>
> -------------------------------
>
> This is very I/O intensive and it makes the system slow. Are there any
> other good ways to parse the xmls in the sql server? Should we perhaps
> move the shredding outside the SQL environment into, for instance, a C#
> method which bulk loads the data?
>
> Regards,
> Johnny
|
|
0
|
|
|
|
Reply
|
Uri
|
3/8/2010 10:06:49 AM
|
|
hi Johnny,
On 08.03.2010 10:37, Johnny Persson wrote:
> This is very I/O intensive and it makes the system slow. Are there any
> other good ways to parse the xmls in the sql server? Should we perhaps
> move the shredding outside the SQL environment into, for instance, a C#
> method which bulk loads the data?
You may take a look at 'Performing Bulk Load of XML Data (SQLXML 4.0)':
http://technet.microsoft.com/en-us/library/ms171993.aspx
Depending on your files, I would at least set some simple C# samples to
compare performance. You may consider implementing it as CLR stored
procedure if it performs better.
http://technet.microsoft.com/en-us/library/ms131094%28SQL.90%29.aspx
mfG
--> stefan <--
|
|
0
|
|
|
|
Reply
|
Stefan
|
3/8/2010 10:08:41 AM
|
|
Johnny Persson (a@a.a) writes:
> 1) Load xml into a temporary table (#XmlTable)
> 2) Set an xml index
> 3) Query (like below)
>
> INSERT INTO #TransformedData
> SELECT
> T0.T.value('asasd', 'asdadd')
> T1.T.value('asasd', 'asdadd')
> FROM
> #XmlTable
> CROSS APPLY
> data.nodes('asd') AS T0(T)
> T0.T.nodes('level1') AS T1(T)
Assuming these are top-level nodes, change this to
data.nodes('/asd') AS T0(T)
T0.T.nodes('/level1') AS T1(T)
There can be huge performance gain by using correct node addressing.
(Unfortunately, the syntax for the best addressing form can be quite
convuluted in some cases.)
--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
|
|
0
|
|
|
|
Reply
|
Erland
|
3/8/2010 11:34:54 AM
|
|
Thank you for your answer,
we use XML as column data type so the first article does not seem to
affect us. It is however interesting how the query optimizer work - or
not work :)
The second article had a really interesting part about typed/untyped
xmls. We have had no xml schema for any xml so I tried that..
I created an xml schema for an xml file (~5MB) and compared the
performance between a stored procedure which use the xml schema and a
stored procedure which does not.
The result is to me a bit strange. When we use a "typed xml column" the
execution time is ~15s and the subtree cost is 428. When we use the
normal, untyped xml column, the execution time is the same BUT the
subtree cost is however 866!
Do you have any thoughts about the result and why the execution time
isn't affected?
Regards,
Johnny
On 2010-03-08 11:06, Uri Dimant wrote:
> Johny
>
> http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx
>
>
> http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx
>
>
>
> "Johnny Persson"<a@a.a> wrote in message
> news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
>> Hi,
>>
>> we are having some performance issues regarding xml shredding.
>>
>> At this point we are extracting data from xmls from nearly 60 different
>> companies - and therefore 60 different xml structures. The total amount of
>> xml is about 350MB and we are trying to extract the data as fast as
>> possible.
>>
>> Our current system extracts, transforms and loads the data in about five
>> minutes. We would however like to do this in about one minute to be
>> pleased.
>>
>> We use the "nodes/cross apply"-technique to shred the xmls into our
>> internal format.
>>
>> This is how we shred the data.
>> ------------------------------
>>
>> 1) Load xml into a temporary table (#XmlTable)
>> 2) Set an xml index
>> 3) Query (like below)
>>
>> INSERT INTO #TransformedData
>> SELECT
>> T0.T.value('asasd', 'asdadd')
>> T1.T.value('asasd', 'asdadd')
>> FROM
>> #XmlTable
>> CROSS APPLY
>> data.nodes('asd') AS T0(T)
>> T0.T.nodes('level1') AS T1(T)
>>
>> DROP #XmlTable
>>
>> 4) Pass the temporary table #TransformedData into the common/shared
>> transformation procedure
>>
>> EXEC LookupData
>>
>> -------------------------------
>>
>> This is very I/O intensive and it makes the system slow. Are there any
>> other good ways to parse the xmls in the sql server? Should we perhaps
>> move the shredding outside the SQL environment into, for instance, a C#
>> method which bulk loads the data?
>>
>> Regards,
>> Johnny
>
>
|
|
0
|
|
|
|
Reply
|
Johnny
|
3/8/2010 2:03:26 PM
|
|
Johnny
Unfortunately I have very limited knowledge about using XML in SQL
Server..., if Erland jumps in .....
"Johnny Persson" <a@a.a> wrote in message
news:%23qXjhgsvKHA.5940@TK2MSFTNGP02.phx.gbl...
> Thank you for your answer,
>
> we use XML as column data type so the first article does not seem to
> affect us. It is however interesting how the query optimizer work - or not
> work :)
>
> The second article had a really interesting part about typed/untyped xmls.
> We have had no xml schema for any xml so I tried that..
>
> I created an xml schema for an xml file (~5MB) and compared the
> performance between a stored procedure which use the xml schema and a
> stored procedure which does not.
>
> The result is to me a bit strange. When we use a "typed xml column" the
> execution time is ~15s and the subtree cost is 428. When we use the
> normal, untyped xml column, the execution time is the same BUT the subtree
> cost is however 866!
>
> Do you have any thoughts about the result and why the execution time isn't
> affected?
>
> Regards,
> Johnny
>
> On 2010-03-08 11:06, Uri Dimant wrote:
>> Johny
>>
>> http://sqlblog.com/blogs/adam_machanic/archive/2010/01/12/t-sql-tuesday-002-is-it-xml-or-not.aspx
>>
>>
>> http://sqlcat.com/msdnmirror/archive/2010/03/01/performance-tips-of-using-xml-data-in-sql-server.aspx
>>
>>
>>
>> "Johnny Persson"<a@a.a> wrote in message
>> news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
>>> Hi,
>>>
>>> we are having some performance issues regarding xml shredding.
>>>
>>> At this point we are extracting data from xmls from nearly 60 different
>>> companies - and therefore 60 different xml structures. The total amount
>>> of
>>> xml is about 350MB and we are trying to extract the data as fast as
>>> possible.
>>>
>>> Our current system extracts, transforms and loads the data in about five
>>> minutes. We would however like to do this in about one minute to be
>>> pleased.
>>>
>>> We use the "nodes/cross apply"-technique to shred the xmls into our
>>> internal format.
>>>
>>> This is how we shred the data.
>>> ------------------------------
>>>
>>> 1) Load xml into a temporary table (#XmlTable)
>>> 2) Set an xml index
>>> 3) Query (like below)
>>>
>>> INSERT INTO #TransformedData
>>> SELECT
>>> T0.T.value('asasd', 'asdadd')
>>> T1.T.value('asasd', 'asdadd')
>>> FROM
>>> #XmlTable
>>> CROSS APPLY
>>> data.nodes('asd') AS T0(T)
>>> T0.T.nodes('level1') AS T1(T)
>>>
>>> DROP #XmlTable
>>>
>>> 4) Pass the temporary table #TransformedData into the common/shared
>>> transformation procedure
>>>
>>> EXEC LookupData
>>>
>>> -------------------------------
>>>
>>> This is very I/O intensive and it makes the system slow. Are there any
>>> other good ways to parse the xmls in the sql server? Should we perhaps
>>> move the shredding outside the SQL environment into, for instance, a C#
>>> method which bulk loads the data?
>>>
>>> Regards,
>>> Johnny
>>
>>
|
|
0
|
|
|
|
Reply
|
Uri
|
3/8/2010 2:26:05 PM
|
|
Johnny Persson (a@a.a) writes:
> The result is to me a bit strange. When we use a "typed xml column" the
> execution time is ~15s and the subtree cost is 428. When we use the
> normal, untyped xml column, the execution time is the same BUT the
> subtree cost is however 866!
The subtree cost is just an estimate, so take it for what it's worth.
Schemabound XML may have some advantages, if you extract a lot from the
same document, but it can also cut the other way, as schema vaildation
is quite expensive.
As I said in another post, you should look at how you address the nodes.
There is a lot to win in that area.
--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
|
|
0
|
|
|
|
Reply
|
Erland
|
3/8/2010 11:05:40 PM
|
|
"Johnny Persson" <a@a.a> wrote in message
news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
> INSERT INTO #TransformedData
> SELECT
> T0.T.value('asasd', 'asdadd')
> T1.T.value('asasd', 'asdadd')
> FROM
> #XmlTable
> CROSS APPLY
> data.nodes('asd') AS T0(T)
> T0.T.nodes('level1') AS T1(T)
Just adding to what Erland said, if your XML data looks like this:
<level1>
<asd>...</asd>
<asd>...</asd>
</level1>
Consider using the '//asd' path. Also consider putting a numeric predicate
on your .value() function calls like this 'asasd[1]'. It would help to have
some representational XML data if you can supply some (and some expected
results). The XML schema can give the optimizer some hints that can help
performance, but you have to weigh the benefit against the extra cost of
validating your data against it. Another option is to perform the shredding
outside of SQL Server in your application. If you don't want to bother with
SQLXML Bulk Loading you can use .NET to shred the data and bulk load it.
--
Thanks
Michael Coles
SQL Server MVP
Author, "Expert SQL Server 2008 Encryption"
(http://www.apress.com/book/view/1430224649)
----------------
|
|
0
|
|
|
|
Reply
|
Michael
|
3/9/2010 2:31:27 AM
|
|
Hi Erland,
Yes, below is a real example of one of our parsers. As far as I
understand the node addressing is correct.. Or am I wrong? It has
happened before :)
SELECT
-- /A/B
TG.T.value('@name', 'nvarchar(255)'),
TG.T.value('(ElementX)[1]', 'nvarchar(255)'),
TG.T.value('(ElementY)[1]', 'nvarchar(255)'),
-- /A/B/C
TOS.T.value('@type', 'nvarchar(255)'),
TOS.T.value('@name', 'nvarchar(255)'),
-- /A/B/C/D
TOO.T.value('@name', 'nvarchar(255)'),
TOO.T.value('@id', 'nvarchar(255)')
FROM
#XmlTable
CROSS APPLY Data.nodes('/A/B') AS TG(T)
CROSS APPLY TG.T.nodes('C') AS TOS(T)
CROSS APPLY TOS.T.nodes('D') AS TOO(T)
Regards,
Johnny
On 2010-03-09 00:05, Erland Sommarskog wrote:
> Johnny Persson (a@a.a) writes:
>> The result is to me a bit strange. When we use a "typed xml column" the
>> execution time is ~15s and the subtree cost is 428. When we use the
>> normal, untyped xml column, the execution time is the same BUT the
>> subtree cost is however 866!
>
> The subtree cost is just an estimate, so take it for what it's worth.
>
> Schemabound XML may have some advantages, if you extract a lot from the
> same document, but it can also cut the other way, as schema vaildation
> is quite expensive.
>
> As I said in another post, you should look at how you address the nodes.
> There is a lot to win in that area.
>
>
|
|
0
|
|
|
|
Reply
|
Johnny
|
3/9/2010 7:03:50 AM
|
|
Hi,
I have now tried to shred the data in a C# console project. The
performance gain is huge.
Which is the best option to pass the data to the SQL server? Bulk load
into a table variable or "send the data" through a table valued CLR method?
Regards,
Johnny
On 2010-03-09 03:31, Michael Coles wrote:
> "Johnny Persson" <a@a.a> wrote in message
> news:%23Ht3vLqvKHA.1796@TK2MSFTNGP02.phx.gbl...
>> INSERT INTO #TransformedData
>> SELECT
>> T0.T.value('asasd', 'asdadd')
>> T1.T.value('asasd', 'asdadd')
>> FROM
>> #XmlTable
>> CROSS APPLY
>> data.nodes('asd') AS T0(T)
>> T0.T.nodes('level1') AS T1(T)
>
> Just adding to what Erland said, if your XML data looks like this:
>
> <level1>
> <asd>...</asd>
> <asd>...</asd>
> </level1>
>
> Consider using the '//asd' path. Also consider putting a numeric
> predicate on your .value() function calls like this 'asasd[1]'. It would
> help to have some representational XML data if you can supply some (and
> some expected results). The XML schema can give the optimizer some hints
> that can help performance, but you have to weigh the benefit against the
> extra cost of validating your data against it. Another option is to
> perform the shredding outside of SQL Server in your application. If you
> don't want to bother with SQLXML Bulk Loading you can use .NET to shred
> the data and bulk load it.
>
|
|
0
|
|
|
|
Reply
|
Johnny
|
3/9/2010 7:07:55 AM
|
|
Johnny Persson (a@a.a) writes:
> Yes, below is a real example of one of our parsers. As far as I
> understand the node addressing is correct.. Or am I wrong? It has
> happened before :)
"Correct" and "best" are two different things.
> CROSS APPLY TG.T.nodes('C') AS TOS(T)
> CROSS APPLY TOS.T.nodes('D') AS TOO(T)
Here you should have /C and /D for best performance.
If you only specify C, XPath will return any node named C, not just top
nodes. And finding those takes more time.
--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
|
|
0
|
|
|
|
Reply
|
Erland
|
3/9/2010 8:13:51 AM
|
|
Johnny Persson (a@a.a) writes:
> I have now tried to shred the data in a C# console project. The
> performance gain is huge.
>
> Which is the best option to pass the data to the SQL server? Bulk load
> into a table variable or "send the data" through a table valued CLR
> method?
SqlBulkCopy or table-valued parameters if you are on SQL 2008.
--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
|
|
0
|
|
|
|
Reply
|
Erland
|
3/9/2010 8:15:27 AM
|
|
Ok,
thank you for all your answers. I will parse the xmls inside a
C#-project and pass the data through a table-valed parameter.
Thank you all, once again.
Regards
Johnny
On 2010-03-09 09:15, Erland Sommarskog wrote:
> Johnny Persson (a@a.a) writes:
>> I have now tried to shred the data in a C# console project. The
>> performance gain is huge.
>>
>> Which is the best option to pass the data to the SQL server? Bulk load
>> into a table variable or "send the data" through a table valued CLR
>> method?
>
>
> SqlBulkCopy or table-valued parameters if you are on SQL 2008.
>
>
|
|
0
|
|
|
|
Reply
|
Johnny
|
3/9/2010 8:26:27 AM
|
|
|
12 Replies
589 Views
(page loaded in 0.235 seconds)
|