Merging Xml Files

Hi all,

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
and I need to perform a merge operation on them. However, because each
individual document has it's own XML node and root element, I need to
find the most efficient way of stripping those elements out where
appropriate so that the relevant data can be merged into on large file.

On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been
attempting to read the relevant files into memory and to text
manipulation to eliminate the header and footer elements that I don't
need, but I'm not having much luck. I'm beginning to wonder if it would
be better to try and use some of the XML Apis in .net rather than trying
to treat it as purely a text manipulation and file io problem.

Thanks to anyone who can advise on a possible approach

Best regards 

Simon
0
simon
2/15/2010 10:32:30 PM
dotnet.languages.csharp 1931 articles. 0 followers. Follow

5 Replies
837 Views

Similar Articles

[PageSpeed] 11

On 15-02-2010 17:32, simon wrote:
> I'm hoping that someone could help me with a relatively simple problem
> I'm having. I have a number of quite large XML files - say 150meg each,
> and I need to perform a merge operation on them. However, because each
> individual document has it's own XML node and root element, I need to
> find the most efficient way of stripping those elements out where
> appropriate so that the relevant data can be merged into on large file.
>
> On average, there will be three files to be merged in this fashion. Can
> anyone advise on what the best approach to this would be? I've been
> attempting to read the relevant files into memory and to text
> manipulation to eliminate the header and footer elements that I don't
> need, but I'm not having much luck. I'm beginning to wonder if it would
> be better to try and use some of the XML Apis in .net rather than trying
> to treat it as purely a text manipulation and file io problem.
>
> Thanks to anyone who can advise on a possible approach

Sounds as if either just treating them as text files and
use StreamReader & StreamWriter would work. For the same
XML aware use XmlTextReader and XmlTextWriter.

Arne
0
UTF
2/15/2010 10:51:50 PM
simon wrote:
> Hi all,
> 
> I'm hoping that someone could help me with a relatively simple problem
> I'm having. I have a number of quite large XML files - say 150meg each,
> and I need to perform a merge operation on them. However, because each
> individual document has it's own XML node and root element, I need to
> find the most efficient way of stripping those elements out where
> appropriate so that the relevant data can be merged into on large file.

To be clear, do you mean this:

File A:

<document>
   <dataA />
   <dataB />
   <dataC />
</document>

File B:

<document>
   <dataD />
   <dataE />
   <dataF />
</document>

Output:

<document>
   <dataA />
   <dataB />
   <dataC />
   <dataD />
   <dataE />
   <dataF />
</document>

?

Are there possibly duplicated elements between the input files?  If so, 
is it okay for them to be duplicated in the output, or does the 
duplication itself need to be merged somehow?

> On average, there will be three files to be merged in this fashion. Can
> anyone advise on what the best approach to this would be? I've been
> attempting to read the relevant files into memory and to text
> manipulation to eliminate the header and footer elements that I don't
> need, but I'm not having much luck. I'm beginning to wonder if it would
> be better to try and use some of the XML Apis in .net rather than trying
> to treat it as purely a text manipulation and file io problem.
> 
> Thanks to anyone who can advise on a possible approach

I agree with Arne that it seems like simply reading the files and 
writing a new one would be fine.  I'd prefer the XML-specific 
reader/writer approach, to avoid any possibility of breaking the 
structure of each XML document (which it sounds like you are already 
having trouble with).

If you can be more specific about what you believe it means to "merge" 
two or more XML documents, it's possible you could get even better, more 
specific advice.

Pete
0
Peter
2/16/2010 1:50:36 AM
Hi guys,

Thanks for that. Pete - you're right - the output of the merge should be:

<document>
   <dataA />
   <dataB />
   <dataC />
   <dataD />
   <dataE />
   <dataF />
</document>

Duplicates shouldn't occur - it should be as simple as snapping the 
<xml> and root nodes off and then slapping them together.

The issue I'm having is in solving the problem whilst not having to read 
the whole file into memory, because they are too large.

The bit I'm struggling with is where I need to remove the last 
</rootNode></xml> node from any given document, as I can't find an 
effective way of reading to that part of the file and snapping the end 
off. If I read the whole document in, I run out of memory. If I use some 
sort of "chunking" mechanism (which I think is exactly what I want), I 
cant figure out how to detect the </rootNode> portion then snap it off.

If anyone can advise on an easy approach - I'd be very greatful.

Thanks

Simon
0
Simon
2/16/2010 10:08:32 AM
Simon wrote:
> [...]
> The issue I'm having is in solving the problem whilst not having to read 
> the whole file into memory, because they are too large.
> 
> The bit I'm struggling with is where I need to remove the last 
> </rootNode></xml> node from any given document, as I can't find an 
> effective way of reading to that part of the file and snapping the end 
> off. If I read the whole document in, I run out of memory. If I use some 
> sort of "chunking" mechanism (which I think is exactly what I want), I 
> cant figure out how to detect the </rootNode> portion then snap it off.
> 
> If anyone can advise on an easy approach - I'd be very greatful.

Arne's suggestion to use XmlTextReader/Writer should work fine.  Have 
you looked at those classes?  It should be no more difficult than:

   – for the first file, read and copy the outer-most document element
   – for every other file, read past the outer-most document element 
without copying it to the output
   – for every file, read and copy every bit of content within the 
outer-most document element
   – for all files except the last, read past without copying the 
element end for the outer-most document element
   – finally, for the last file, read and copy the element end for the 
outer-most document element

If you have tried the above and cannot get it to work, you should post a 
concise-but-complete code example showing what you've tried and how it 
doesn't work for you.  Then some specific advice with respect to your 
attempt can be offered.

If you have not tried the above, well…you should.  :)

Pete
0
Peter
2/16/2010 10:21:56 AM
"simon" <noyhanks@hotmail.com> wrote in message 
news:1305664661287965379.168304noyhanks-hotmail.com@news.microsoft.com...
> Hi all,
>
> I'm hoping that someone could help me with a relatively simple problem
> I'm having. I have a number of quite large XML files - say 150meg each,
<<>>
> On average, there will be three files to be merged in this fashion. Can
> anyone advise on what the best approach to this would be? I've been

I would be inclined to look at sql server and ssis myself.
Maybe that's not an option for you but severs and big files and batch 
overnight processing kind of go together in my mind.

SSIS is pretty efficient and can do some dead clever stuff..
Large files are often one of those things that kind of arrives overnight or 
some time tomorrow is fine.
May not be appropriate for some reason, but just thought I'd run it past 
you.

0
Andy
2/16/2010 11:25:16 AM
Reply:

Similar Artilces:

access denied on import of prior outlook.pst file
I reinstalled windows onto a newly formatted existing hard drive. But, first I copied everything from my old C drive to a newly installed hard drive. I am unable to import or open my previous outlook.pst file and get msg "accessed denied". I believe I have permissions properly specified. Help plz 'Take ownership' see win help, of the folder containing your old OL data files. Copy those files to My Documents Folder. Within OL, File>Open>Data File........browse to that location "Martin" <martynielson@bellsouth.net> wrote in message news:B697860C-...

How do i retrieve a deleted excel file?????
Unfortunately, if you don't have a backup, or a disk utility that can recover deleted files, you don't. thank you for the bad news. :o) "JE McGimpsey" wrote: > Unfortunately, if you don't have a backup, or a disk utility that can > recover deleted files, you don't. > Do you have a Recycle Bin? Windows puts all deleted files in there. Just a thought. HTH Otto "Julie Semer" <Julie Semer@discussions.microsoft.com> wrote in message news:7C45EA83-69AC-4AF4-9924-DB955562999F@microsoft.com... > You might have success with a pro...

How to combine two Money Files?
I have two separate money files, one with all my stock info, and the other with checking, savings accounts, etc. I would like to combine the two into one file. Is there a way to do this? (I'm using Money 2003 small business) Decide which one of the two you want to make your main one. From the "other" one export all accounts in QIF format. Load the new "main" file and import the QIF files. Make sure to import them all together to avoid any transfers being double counted (use CTRL + Click when asked which file to import). -- Regards Bob Peel, Microsoft MVP - M...

Delete Files where-object CreatedDate
Hi Guys i have the following script. This is meant to get all files within a directory and delete it if it older than 31 days can someone assist me as this is currently not working as this script deletes based on LastWriteTime and Not Create Date. Can someone tell me what I’m doing wrong is this script below, or perhaps provide something that works All i need is a script that checks a specific directory that I supply, for example D:\FTP\Public\ And Deletes all Files and Folders Older (Create Date) than 31 days #Date Created 28/01/2010 #Owner Marcus Van Wyk vanwykm@gess...

Exchange 2007 CCR and transaction log files clean
Hi to u all, If i will backup the passive node, how the transaction log files will be deleted from the active node ? should I need to run Backup or script (which delete uncommited log files) or eseutil /mk every X time on the active node ? Thanks, Didi No, you don't need to do anything to delete log files manually from the active copy - it's taken care of by the Replication service. -- Bharat Suneja MVP - Exchange www.zenprise.com NEW blog location: www.exchangepedia.com/blog ---------------------------------------------- "Did" <didi10000@walla.co.il> wrote ...

Corrupt File? #2
I have a workbook which is the data source for MS Access (Linked table). The workbook has some basic macro coding (changes text names to cost centers, etc). While working with the data yesterday in Excel, an error message appeared, closed the workbook, and now I can't reopen it. I can pull the data in by linking it to another workbook but not through access. The dialog I receive when attempting to link in excel is: Links to 'Workbook.xls' were not updated because 'Workbook.xls' was not recalculated before it was last saved. To update links with current values i...

Reading Vista files from XP
We have just started receiving Excel files from others that were created using a MS Vista version of Excel. The file extension is XLS. When we try to open the files using our XP version of Excel, the program crashes. I am wondering what can be done so that Excel files can be exchanged between the 2 operating systems? Clearly this must be a well known problem, but I have had no success stumbling on the answer. Thanks..................Dave The operating systems have nothing to do with the files. The files are created by Excel. There is no MS Vista version of Excel. There are severa...

How do I measure area under a line chart drawn in Excel file?
Can anyone please let me know "How do I measure area under a line chart drawn in Excel file?" See examples on my website using Trapezoid & Simpson's Rule. Bernard www.stfx.ca/people/bliengme/ExcelTips "kp" <kp@discussions.microsoft.com> wrote in message news:F3BF4B4B-6030-45E1-B17E-50005FD9FBAB@microsoft.com... > Can anyone please let me know "How do I measure area under a line chart > drawn in Excel file?" I'm not aware of such a feature in the various chart menus, but you can calculate the area directly from the data used for th...

Default PST File from registry info
Lately I've been trying to make a little software tool that would back up several information, on wich PST files are included. I know I may find cases on which there could be more than one profile and each of them having one or more .pst files. I searched everywhere but couldn't find some certain info on where the information about .pst files is stored. However, I found this vague information: There's a registry key in HKCU\Software\Microsoft\Windows Messaging Subsystem\Profiles There is a key value named "Default Profile" with the name of the default profile (d'...

saving semicolon-delimited files
Version: 2008 Operating System: Mac OS X 10.5 (Leopard) Processor: Intel Is Excel able to save files as semicolon-delimited? In article <59bae6aa.-1@webcrossing.JaKIaxP2ac0>, sjguglielmo@officeformac.com wrote: > Version: 2008 > Operating System: Mac OS X 10.5 (Leopard) > Processor: Intel > > Is Excel able to save files as semicolon-delimited? Whatever app it is that wants semicolon-delim files: delete it and get something which conforms to universal standards. -- Team EM to the rescue! http://www.team-em.com sjguglielmo@officeformac.com wr...

Import multiple files to CRM
I'm new to CRM. Our company attaches many files to the notes section on each account (e.g., signed, scanned business plans, incentive amendments, etc.). My question: is there either 1) a way to drag and drop the attachment to the account notes section easily or 2) another way to more quickly add the attachments to the account, other than opening each account's notes and adding the file? Thanks, Cindy No these are the only ways this can be done using out of the box features. Hi Cindy, CRM's standard document management features are pretty poor - it's not really design...

Can PST file be import with OE V6?
I'm trying to import file from OUTLOOK 2000 (*.PST) into my new laptop with Outlook Express (V6). But whenever I choose OUTLOOK for import I get an error messages "Cannot import messages from MAPI client". This is even before I need to tell OE what&where the file is located. Does anybody know how I can import *.PST file into OE V6? Thanks, I no longer have the Outlook programe disk anymore (only *.PST file backup). Are there any other solution than to install Outlook to this new laptop? :) If not I guess I'll just have to buy Outlook again just to fix it :(...

Merging 2 files
Hi all, I have 2 workbooks that are similar to the structure below. Workbook 1 Workbook 2 ColA Col B ColC || ColA Col B ColC A Desciption of A 1 || A Desciptio of A 3 B Desciption of B 2 || C Desciptio of C 2 C Desciption of C 1 || D Desciptio of D 3 Is there any way that I can combine the two of these to resemble th following. ColA Col B ColC ColD A Descip...

Mail Merge with Access Report (with grouping levels)
Hi, I have two tables in Access 2003 with a one-to-many relationship, in an access report the grouping works, how do I make this work in a Word 2003 mail merge? The following are not my tables but an example of what I am trying to achieve, where the relationship is 1 to many from Orders to OrderItems: Table 1 - Orders OrderID Date SalesPerson Table 2 - OrderItems ItemID Description QTY Value *OrderID On the 1 Page word document I want to print display as follows This is your <Order ID> on <Date>. Your Salesperson was <Salesperson>. Here is a list of your Items: <...

CVS file only inserts commas for 15 lines of spread
Hi, I hava an user in Europe that is following the exact steps I am to covert a file to CSV format. When we open her file in word pad, only the first 15 lines of data have comms delimted. It happens every time regardless of how much data is in our spread. Any ideas on how to correct this? Thanks much! S I've never seen anything like this. If they reopen the .csv file in Excel, does it look ok? If they open the file in NotePad (not wordpad), does it look ok? My silly guess: I'm wondering if there is any special character in the data that is being used as an end of file marker....

Control for nested XML Data
I want to show the content of the xml with a control which allows to show the nodes in a nested way. I don't want to use DataGrid which shows the nodes in a tabular form. Is there any suitable control for me? Thanks. -- My C++ and C# ( Traditional Chinese ) Web Site : www.franzwong.com/Home.php Can you elaborate on your requirements? Is this for a web application or a desktop application? If it is for the web, then one way might be to display the XML in an IFRAME. The XML will appear just as it does when you load up an XML document using IE. You might also bind the XML to a treevi...

Sync mail files
I have a small network of 3 computers. Can I syncronize the Outlook.pst file so that any of the 3 machines will have access to all of the mail folders? Any clues will be welcomed. Gill gaffer <gaffergill@bellsouth.net> wrote: > I have a small network of 3 computers. Can I syncronize the > Outlook.pst file so that any of the 3 machines will have access to > all of the mail folders? See if something here helps: http://www.slipstick.com/outlook/sync.htm -- Brian Tillman ...

How to serialize correct XML headers
All, I am trying to generate the following header when serializing a class to XML <FpML version="4-0" xsi:type="TradeAffirmation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.fpml.org/2003/FpML-4-0 fpml-main-4-0.xsd" xmlns="http://www.fpml.org/2003/FpML-4-0"> </FpML> The class is currently declared as: <System.Xml.Serialization.XmlTypeAttribute([Namespace]:="http://www.fpml.org/2003/FpML-4-0"), _ System.Xml.Serialization.XmlRootAttribute([DataType]:="TradeAffirmation&...

Saving multiple worksheets as independent files
I have a multiple worksheet workbook that on a regular basis would like to save as independent workbooks. Is there an easy way to do this vs. cutting & pasting into another workbook? Right-click the tab name, select Move or Copy, pick new book from the To Book dropdown, and either leave the Create a Copy checkbox unticked (move) or tick it (Copy). -- HTH RP "Hewlett" <Hewlett@discussions.microsoft.com> wrote in message news:074AFD41-B2FB-448E-9ED9-44F103BB2EDF@microsoft.com... > I have a multiple worksheet workbook that on a regular basis would like to > save ...

Recover dbt file
Dear All, I'm using outlook express 6 . I've accidently deleted the sent items.dbx file and what i have now is a sentitems.dbt file is there any way to recover from it the dbx file Regards, A.El Gazzar Try posting this in an Outlook Express group - this group supports Outlook from Microsoft Office. Outlook Express is a part of Internet Explorer. You can also try http://insideoe.tomsterdam.com for some good OE troubleshooting tips. --� Milly Staples [MVP - Outlook] Post all replies to the group to keep the discussion intact. Due to the (insert latest virus name her...

linked values not displayed unless source file open
i have an excel sheet with cells linked to cells in another seperate excel file. Excel will not show the values unless I have the source file open. this always used to work with the source file closed. Now the target file shows #value. this is a major nuisance if there are 7 or 8 linked files! please help. Sandy There are some worksheet functions that don't work with closed workbooks. =indirect(), =countif(), =sumif() are a few. If you share your formula, you may find that there's an alternative that you can use. Sandyc wrote: > > i have an excel sheet with cells l...

All of the sudden my file wont open cause the font is missing!
Ok, so this moring i got into my office and tryed to open my 71 page publisher document. When i left yesterday everything was perfect. Now today it says my font i was using Helvetica is no longer available and it asks if i'd like to change the font. I can to another font, but this is a communication going out to stores for the retail compnay i'm with and i'd like to use helvetica. Now when i open a new publisher document or word file, helvetica is indeed there. Why is it missing from this one file? I've NEVER seen this error. Helvetica was never installed with a Micros...

Header file help
Sir, I am presently doing an application work in VC++.I am doing the coding for "Virtual CD ROM Drive".In this I am not getting one headerfile i.e "#include<ntddcdrm.h>".In msdn you mentioned the path windows nt DDk which I am not having.I am kindly requesting you to provide this header file through mail. thanking you your faithfully Kiran. Hello, You downloads ntddcdrm.h from the following link, http://cvs.winehq.com/cvsweb/~checkout~/wine/include/ntddc drm.h?rev=1.3&content-type=text/plain Regards, R.Selvam >-----Original Message----- >Sir,...

Outlook dbx file
My boss saved his outlook express messages to a file with a dbx extension on a memory stick off his old computer. How do I get outlook express to recognize the file on his new computer or restore it on his old one? He wants his messages back and I don't know how to open the dbx file and I can't seem to copy it back to it's original folder. Thanks. 1. Email Messages: From the Outlook Express menu select Tools | Options | Maintenance and click the StoreFolder button. You see a dialog with the name of the directory that has your mail files. If you look in that directory ...

Open Level 2 files
In Outlook 2002, is there anyway to to open level 2 files from outlook? It is a real PITA to save it, find it, then open it. Thanks. ************************************************ Flip the words on the domain name to send email. ************************************************ http://www.slipstick.com/outlook/esecup/getexe.htm should help you out. AFAIK, no, but reading this page might help. --� Milly Staples [MVP - Outlook] Post all replies to the group to keep the discussion intact. After searching google.groups.com and finding no answer: Curt Bates <crbates@footbig.com> aske...