Fog Creek Software
Discussion Board




Handling huge text files in windows

Hi all,

I need to periodically (twice a month) import some huge text files into my Oracle 9i Database. The files range between 10 to 100 million records each. (10 GB aprox. average per file)

I plan to use Oracles Sql*loader ( http://www.orafaq.com/faqloadr.htm ) to actually import the data, but the text files come from three different systems (legacy systems from three companies that were acquired by my company).

So, I have *three* different formats. Al file hold the same info, but in two cases the records are character delimited (distinct layouts) and on the third case the records are fixed lenght.

I plan to break up the import process in two stages:

1.- Data mapping: Convert each specific format into a single standart format using some custom-made mapper utility.

2.- Use Sql*Load to import the unified format to the database.

I don't have exact requirements for batch duration times, but I would like to keep them as short as possible.

I have been thinkng about programming each of the three mapping processes in C#, (like the rest of the application).

In past implementations (on UNIX platforms) we have used awk to parse and convert similar files. But this time it's Windows, and I haven't decided yet what to do:

a) I could write my own parser/mapper with manage code (C#).

b) Write it in unmanage code (ughh!)

c) Use a windows awk implementation. Do you know which one is the best?

d) I could bypass completly the mapping process and use distinct Sq*Loader scripts for each file format. I don't want to do this for 'architectural cleaness'.

Any other idea? What would you do?
Of course performance is the biggest issue here.

The server in which this process would run is a dual P4@2GHz with 4GB RAM. Windows 2000 Advanced Server.

Plase don't flame for wanting my problem solved by JOS readers. I *will* do my homework and measure. I just want to see if some can suggest anything else =)

thanks!

.NET Developer
Tuesday, April 06, 2004

'architectural cleaness' - huh?

If it were me I'd just do separate SQL*Loader scripts and be done with it.  There's no sense in processing 10GB twice when once will do.

Doug
Tuesday, April 06, 2004

You are *so* right. I feel the same,

One of the requirements was that the system must specify its own file format so future data sources can be integrated in the process.

But I could bypass the mapping for these three cases altoghether.

thanxs

.NET Developer
Tuesday, April 06, 2004

A good idea regardless of implementation would be if you could process the file in a one-pass/streaming manner (i.e. one program/function reads from the file, another munges the data, then another feeds the database, all at once). The IO overhead of reading in 10GB is probably going to dwarf whatever processing you need to do.

Personally I'd use perl or python for this sort of thing.

Dan Maas
Tuesday, April 06, 2004

I've never dealt with a 10 gig text file, but I'd say that it makes a lot more sense to process the data once and only once. Then, just make the best of the three current formats the new standard and you're golden.

JWA
Tuesday, April 06, 2004

"One of the requirements was that the system must specify its own file format so future data sources can be integrated in the process."

Why design a fourth format when you already have three. Write the three legacy->sql-loader converters and whichever one can be imported the fastest becomes the standard. Or you could just make the sql-loader format the standard -- which on second thought seems to make the most sense.

dmooney
Wednesday, April 07, 2004

Design a standard XML schema for the data.  Convert all the text files to XML.  Write a program to parse the XML files and create a standard format for SQL*Loader.

Management will love it.

;-)

Buzzword, schmuzword
Wednesday, April 07, 2004

Beleive it or not, *THAT* was JUST what management proposed.

I'm not kidding.

.NET Developer
Wednesday, April 07, 2004

The book "Data Munging in Perl' is very good.

http://www.manning.com/cross/

I use Ruby now, but the principles remained the same.

Ged Byrne
Wednesday, April 07, 2004

I'd love to know how large the 10GB file gets when you start throwing all those redundant tags around.

Any bets?

MR
Wednesday, April 07, 2004

Obviously 42GB. ;-)

I agree with the consensus - write the three loaders - choose one to be the standard. Choosing which one will rest on a few issues:
- if one format loads markedly faster than the others, it's a clear choice
- if these three formats are from three different organizations, will there be political pressure as to who ends up being "the standard"? (As a techie I wouldn't let this sway my choice, but be aware of it)
- are any of the formats "cleaner" or easier to read/parse?
etc.

Philo

Philo
Wednesday, April 07, 2004

Especially for Fixed width format file, have you considered using SAS? I think SAS not only makes it easy to process these kind of text files but I have found it to be amazingly fast!


JD

JD
Wednesday, April 07, 2004

Use common sense here and go with three SQL*Loader control files. If the data was different, then create three separate load_* tables or one that is a superset of all 3.

Seriously, you'll be done with the control files in 15 minutes - that tools saves SO much time. The database load table becomes your "unifying format". Why duplicate that in a file?

AEB
Wednesday, April 07, 2004

One more thing I should add about XML...

The bigger the file gets, the increase in memory and processor speed required is EXPONENTIAL.

We got 300MB EDI files and we processed them in Biztalk 2002, converting them to XML before loading them into SQL Server. We needed to buy the fastest XEON with 2 gigs of ram to get in under 5 hours for one file. Of course, the data had complex hierarchies and involved many issues, but this "common XML" approach caused more pain then benefit.

AEB
Wednesday, April 07, 2004

Not sure if this is an option for you, but can you use DTS to put the data into SQL Server, then migrate the data into Oracle by way of a linked server?  I know SQL Server's DTS import (especially Bulk Loads) scream for data imports, and 10GB is a rather large import...

Ray Beckett
Wednesday, April 07, 2004

Thanks everyone. I will try to convince the team to bypass the intermediate 'standard' file.

I will have to justify it, but that will be quite easy with the first tests.

.NET Developer
Thursday, April 08, 2004

"The bigger the file gets, the increase in memory and processor speed required is EXPONENTIAL."

Only if you try to recreate the hierarchical file structure in memory, like DOM parsers do.

But you could also use a lightweight forward-only text reader with just enough knowledge of XML to validate incoming elements, separate names from contents and attributes, and so on. .NET provides such a reader in the XmlReader class (and the corresponding XmlWriter).

Such a reader shouldn't create a significantly higher system load than a plain text reader.

Chris Nahr
Friday, April 09, 2004

*  Recent Topics

*  Fog Creek Home