Reading large csv-file and removing duplicates

  • Follow


I've large csv-file (4 Gb) with a lot of duplicates. The csv-files contains 
two fields.
If there is a duplicate, is that both fields.
So i need to read this large csv-file, remove the duplicates and write a new 
csv-file with only the unique values.
I used StreamReader and ReadLine to read the csv-file and Split to split the 
fields

After reading i added the line to a HashSet

     HashSet<string> DataRows = new HashSet<string>();

Check for duplicates (only first field for saving memory) before write to 
new csv-file.

     bool duplicate = DataRows.Add(DataColumn[0]);

But everytime an exception of type 'System.OutOfMemoryException' was thrown.

Any advice is welcome.


Thanks in advance,

Robertico 


0
Reply Robertico 12/5/2009 10:06:01 AM

"Robertico" <Robertico@nospam.notvalid> wrote in message 
news:uuq7MKZdKHA.2596@TK2MSFTNGP04.phx.gbl...
> I've large csv-file (4 Gb) with a lot of duplicates.
> [...] I used StreamReader and ReadLine to read the csv-file [...] But 
> everytime an exception of type
> 'System.OutOfMemoryException' was thrown.

    Reading a 4GB text file into memory is likely to require 8 GB of memory 
(plus overhead for the data structures), since .Net stores the strings as 
Unicode, requiring 2 bytes for each character. In order to address that much 
memory, you would need to run a 64-bit version of your program, and of 
course make that amount memory available to it (even if it is virtual 
memory).

    If you don't have the memory resources, you will need to resort to using 
the hard disk as a temporary storage to classify the data as you are 
searching for duplicates. One simple way to do this would be to import the 
data into a database, and then let the database do the work of sorting out 
the duplicates for you. Otherwise, if you want to do the work yourself, you 
will need to sort the data on disk. There are various ways to do this, for 
instance, read about Merge Sort.

0
Reply Alberto 12/5/2009 10:51:15 AM


On 5 Dec, 11:06, "Robertico" <Robert...@nospam.notvalid> wrote:
> I've large csv-file (4 Gb) with a lot of duplicates. The csv-files contains
> two fields.
> If there is a duplicate, is that both fields.
> So i need to read this large csv-file, remove the duplicates and write a new
> csv-file with only the unique values.
> I used StreamReader and ReadLine to read the csv-file and Split to split the
> fields
.... and so on

IF you can read the whole thing into memory then why not do a loop and
delete all exept one of occurances, on the other hand
if you are lazy it can be done using "any" database just import to a
table using both fields as a combined key and not allowing duplicates
1NF (Minimal Form), a constraint, would do it.

//CY
0
Reply CY 12/5/2009 11:01:11 AM

>    Reading a 4GB text file into memory is likely to require 8 GB of memory 
> (plus overhead for the data structures), since .Net stores the strings as 
> Unicode, requiring 2 bytes for each character. In order to address that 
> much memory, you would need to run a 64-bit version of your program, and 
> of course make that amount memory available to it (even if it is virtual 
> memory).

I've only 2Gb RAM available and my OS is 32Bit.

>    If you don't have the memory resources, you will need to resort to 
> using the hard disk as a temporary storage to classify the data as you are 
> searching for duplicates. One simple way to do this would be to import the 
> data into a database, and then let the database do the work of sorting out 
> the duplicates for you. ..

I tried to import the data in a Ms Access 2007 database, but the file is to 
large :-))

> Otherwise, if you want to do the work yourself, you will need to sort the 
> data on disk. There are various ways to do this, for instance, read about 
> Merge Sort.

Can you be clearer about this. It sounds interesting.

Regards,

Robertico. 


0
Reply Robertico 12/5/2009 11:33:23 AM

On Sat, 5 Dec 2009 12:33:23 +0100, "Robertico"
<Robertico@nospam.notvalid> wrote:

>>    Reading a 4GB text file into memory is likely to require 8 GB of memory 
>> (plus overhead for the data structures), since .Net stores the strings as 
>> Unicode, requiring 2 bytes for each character. In order to address that 
>> much memory, you would need to run a 64-bit version of your program, and 
>> of course make that amount memory available to it (even if it is virtual 
>> memory).
>
>I've only 2Gb RAM available and my OS is 32Bit.
Then your 4Gb file is too large.  One way round is to split it into
smaller files.  Create say ten smaller files by looking at the first
letter of the index entry on each line.  Pur A - B in one file, C - D
in the next file and so on.  Then you can remove duplicates from each
file separately and merge the de-duplicated files again at the end.

rossum



>
>>    If you don't have the memory resources, you will need to resort to 
>> using the hard disk as a temporary storage to classify the data as you are 
>> searching for duplicates. One simple way to do this would be to import the 
>> data into a database, and then let the database do the work of sorting out 
>> the duplicates for you. ..
>
>I tried to import the data in a Ms Access 2007 database, but the file is to 
>large :-))
>
>> Otherwise, if you want to do the work yourself, you will need to sort the 
>> data on disk. There are various ways to do this, for instance, read about 
>> Merge Sort.
>
>Can you be clearer about this. It sounds interesting.
>
>Regards,
>
>Robertico. 
>

0
Reply rossum 12/5/2009 12:06:32 PM

> ...Create say ten smaller files by looking at the first
> letter of the index entry on each line.  Pur A - B in one file, C - D
> in the next file and so on.  Then you can remove duplicates from each
> file separately and merge the de-duplicated files again at the end.

The file is not sorted.

Robertico 


0
Reply Robertico 12/5/2009 12:18:25 PM

"Robertico" <Robertico@nospam.notvalid> wrote in message 
news:%23c%23%23A7ZdKHA.4952@TK2MSFTNGP06.phx.gbl...
> I've only 2Gb RAM available and my OS is 32Bit.
> [...]
> I tried to import the data in a Ms Access 2007 database, but the file is 
> to large :-))

     Yes, Access has a file size limitation of 2GB. If you want to go the 
database route, you will need something bigger. Ordinarily, I would 
recommend Sql Server Express, but unfortunately the Express version is 
limited to 4GB per database, so it will still not serve you.

> [...] Can you be clearer about this. It sounds interesting.

     One way in which you could remove duplicates without reading all the 
data into memory is this:

Loop on the file on disk reading all the rows one by one. For each row do 
the following:
- Compute a hash value on the fields that you want to be unique (the 
GetHashCode method should serve you well).
- Use the hash code to store in a hashtable the offset in the file where the 
record that you read initially was located. This hashtable will probably be 
small enough to fit into your available RAM, since it only stores a numeric 
value and not your whole records.
- For each record, first look it up in the hash table. If it exists there, 
Seek to the offset in the file where it is stored, read the data and compare 
them to the record you are currently processing. If they match, then the 
current record is a duplicate and it can be discarded. Otherwise, continue 
comparing the rest of the records that have the same hasch code, since there 
could be more than one. After you have compared them all, if none match, 
save the record hash and offset to the internal table and save the complete 
record into an output file.

0
Reply Alberto 12/5/2009 12:22:45 PM

"Robertico" <Robertico@nospam.notvalid> wrote in message 
news:uFBJLUadKHA.2460@TK2MSFTNGP04.phx.gbl...
>> ...Create say ten smaller files by looking at the first
>> letter of the index entry on each line.  Pur A - B in one file, C - D
>> in the next file and so on.  Then you can remove duplicates from each
>> file separately and merge the de-duplicated files again at the end.
>
> The file is not sorted.

    Once you have split the file into smaller files, so that one of those 
pieces fits in memory, you can easily read it into a SortedList, so that it 
gets sorted in memory. You then dump it into disk again.
    Then you use Merge Sort on the small files. Basically, you open two 
files and read the first record of each one. You compare both records and 
save the smallest one into the output file. You then read the next record 
from the file that contained the one that you wrote, and repeat the 
operation. Of course, in this process you omit the duplicates. This 
mechanism can easily be extended to more than two files, if needed.

0
Reply Alberto 12/5/2009 12:40:35 PM

>
> I tried to import the data in a Ms Access 2007 database, but the file is to
> large :-))
>
> Robertico.

Use NTFS and MySQL (still free? dont know), rember a ISAM text (or
blob) filled only db will need to be configured a bit, the value of
myisam_data_pointer_size  can be from 2 to 7. A value of 4 allows
tables up to 4GB; a value of 6 allows tables up to 256TB. MS Access
might work for your private CD/DVD collection, but not for handling
more data.

There should be a number of C# examples for reading/writing the
database out there.

//CY

0
Reply CY 12/5/2009 1:38:21 PM

On Sat, 5 Dec 2009 13:18:25 +0100, "Robertico"
<Robertico@nospam.notvalid> wrote:

>> ...Create say ten smaller files by looking at the first
>> letter of the index entry on each line.  Pur A - B in one file, C - D
>> in the next file and so on.  Then you can remove duplicates from each
>> file separately and merge the de-duplicated files again at the end.
>
>The file is not sorted.
>
>Robertico 
>
That is not important, you have all the smaller output files open at
the same time:

Pseudocode:

  open original file for input
  open new files for output
  read first input record
  while not end of large file
    write current record to correct output file
    read next input record
  end while
  close all output files
  close input file

You can then eliminate duplicates from each of the smaller files in
turn.  By using the first letter of the index field you can be sure
that all duplicates will share a file.

Finally merge all the smaller files together to give a single final
file with no duplicates.

rossum

0
Reply rossum 12/5/2009 2:27:06 PM

> Use NTFS and MySQL (still free? dont know), rember a ISAM text (or
> blob) filled only db will need to be configured a bit, the value of
> myisam_data_pointer_size  can be from 2 to 7. A value of 4 allows
> tables up to 4GB; a value of 6 allows tables up to 256TB. MS Access
> might work for your private CD/DVD collection, but not for handling
> more data.

Running WAMP (Apache, php and MySQL) for Windows. (myisam_data_pointer_size 
value=6 )
But i'am not familiar with MySQL. How can i import my csv-file?

Thanks in advance,

Robertico 


0
Reply Robertico 12/5/2009 3:41:15 PM

On 5 Dec, 16:41, "Robertico" <Robert...@nospam.notvalid> wrote:
> > Use NTFS and MySQL (still free? dont know), rember a ISAM text (or
> > blob) filled only db will need to be configured a bit, the value of
> > myisam_data_pointer_size =A0can be from 2 to 7. A value of 4 allows
> > tables up to 4GB; a value of 6 allows tables up to 256TB. MS Access
> > might work for your private CD/DVD collection, but not for handling
> > more data.
>
> Running WAMP (Apache, php and MySQL) for Windows. (myisam_data_pointer_si=
ze
> value=3D6 )
> But i'am not familiar with MySQL. How can i import my csv-file?
>
> Thanks in advance,
>
> Robertico

Use a c# program to insert and extract the data, or load it..

LOAD DATA LOCAL INFILE '/importfile.csv'
INTO TABLE test_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(field1, filed2, field3);

is another way, it is more fun to make the C# program, not so fun to
just load the data.

//CY
0
Reply CY 12/5/2009 5:20:16 PM

Alberto Poblacion wrote:
> [...]
> Loop on the file on disk reading all the rows one by one. For each row 
> do the following:
> - Compute a hash value on the fields that you want to be unique (the 
> GetHashCode method should serve you well).
> - Use the hash code to store in a hashtable the offset in the file where 
> the record that you read initially was located. This hashtable will 
> probably be small enough to fit into your available RAM, since it only 
> stores a numeric value and not your whole records.
> - For each record, first look it up in the hash table. If it exists 
> there, Seek to the offset in the file where it is stored, read the data 
> and compare them to the record you are currently processing. If they 
> match, then the current record is a duplicate and it can be discarded. 
> Otherwise, continue comparing the rest of the records that have the same 
> hasch code, since there could be more than one. After you have compared 
> them all, if none match, save the record hash and offset to the internal 
> table and save the complete record into an output file.

IMHO, this suggestion is one of the best offered so far.  It's 
efficient, reasonably simple to implement, and offloads _all_ of the 
actual key data to the file system (minus the 4 or 8 bytes offset into 
the file, of course).  Of course, it adds the cost of a Dictionary 
(preferable) or Hashtable (key and value), as compared to a HashSet (key 
only) so will work best with relatively long keys.

The sorting ones aren't bad either, but of course they get complicated 
if the output needs to remain in the same order as the input (you have 
to add a new sort-order field to the intermediate files, and then 
re-sort the output once the duplicates have been removed; depending on 
the reduction of size of input, this could mean yet again partitioning 
the problem to sort segments of the output and then doing a merge sort 
to reassemble everything).

I don't usually suggest using a database for random file manipulation 
questions, but I think in this case that is also a good suggestion, just 
due to the sheer simplicity of it.  Access and SQL Express aren't the 
only databases out there, and as "CY" points out, MySQL can be 
configured to support larger databases, large enough to support your input.

Of course, the easiest solution would be to switch to a 64-bit OS.  Then 
you can allocate larger amounts of memory in your process, and the error 
would go away.  :)  (With only 2GB of RAM on the computer, you'll 
certainly run into disk swapping performance issues, but all the 
fallback solutions will rely on some kind of disk i/o anyway, so 
performance is going to suffer one way or the other using those techniques).

Another option you might try is to store your column values more 
efficiently.  Rather than using a HashSet<string>, use a HashSet<byte[]> 
and store the raw bytes from the file rather than the UTF-16 of a 
System.String instance (or re-encode them as ASCII or UTF-8, if that 
works better for you).  If you know that the keys contain only 
characters that are in a limited set of characters, you could even 
compress the bytes further, either just by reducing the 
bits-per-character (treat the byte array as a bit-stream), or by using 
something like Huffman encoding (simple to implement and works well on 
ASCII text).

In fact, depending on the data requirements for a file offset (probably 
8 bytes if your file is 4GB or larger, though you could get complicated 
and store fewer than 8 bytes, throwing out the most significant bytes 
that you know are always 0 for the size of the file you're dealing 
with), and the actual length of a key being hashed, this approach could 
work as well as or better than the "hash, storing offsets to file" approach.

Finally, you might try writing your own hash data structure, and 
including the ability to set in advance the exact size of the hash 
table, and maybe even pre-allocate a pool of entry data structures for 
the table.

Setting the size of the table in advance could be very helpful, because 
at least part of your error is likely to be caused by the fact that 
HashSet has to reallocate its internal data structures as it grows, and 
each time that happens, you have two copies of the data in memory at 
once.  The reallocation also _will_ result in fragmentation of the 
large-object heap, which can result in an out-of-memory error even when 
there is theoretically space in the process's virtual address space to 
allocate more memory.

Pre-allocating a pool of entry data structures as an array from which 
you pull indices rather than references to actual objects allows that to 
also be created once (or at least many fewer times, if you decide, for 
example, to allocate arrays of 1000 at a time), reducing overhead in the 
memory manager.

This last suggestion is not necessarily as pointless as it might seem. 
Depending on the nature of each row of data in your CSV file, it stands 
to reason that the _actual_ memory cost of your hashing should be quite 
a bit less than the 4GB the file requires, on the assumption that you're 
checking only a small subset of the entire row (two columns out of how 
many, of what kind of data?).

So, sure...you're definitely pushing the limits of what can be done in 
the 2GB of virtual address space allotted to a 32-bit process (3GB if 
you boot with /3GB and set the /LARGEADDRESSAWARE for your 
application...I don't recall if this is set by default for managed 
apps).  But assuming the columns that are used to determine duplication 
are a less than, say, 1/6th or 1/8th of each row of data, then it seems 
likely to me a 4GB file could be able to be processed even in a 32-bit 
app, without resorting to using the file system as temporary storage 
(other than implicit disk swapping, of course).  "All" you need is a 
more efficient implementation of the data structures.

If performance is very important, it could well be worth the extra 
effort to approach the problem that way, rather than pushing the problem 
out to the file system (which is admittedly simpler in some ways).

Note the various proposals fall into two broad categories:

     -- Will still fail for slightly larger files
     -- Scales well to significantly larger files

All of the memory-based solutions, while they may address your current 
needs, can still be pushed to the breaking point for files even larger 
than what you're dealing with.  The database and file-system-only 
solutions are limited only by the file system itself, and so are likely 
to continue working on MUCH larger inputs, albeit at perhaps a larger 
performance cost than the memory-based solutions.

In the end, it may well be that just adding more RAM and a 64-bit OS is 
the best solution.  :)

Pete
0
Reply Peter 12/5/2009 6:38:58 PM

Everyone warmly thanked. In particular, the detailed explanation.
In this particular case, the solution was a linux script.
Not what you would expect here, but is was efficient and the solution to my 
problem.
I've learned a lot about memory usage and reading large files.
For another problem, I now know what to look out.

Regards,

Robertico 


0
Reply Robertico 12/7/2009 5:52:17 PM

13 Replies
1371 Views

(page loaded in 0.63 seconds)

Similiar Articles:


















8/1/2012 9:26:30 AM


Reply: