Fog Creek Software
Discussion Board




Manipulating Very Large Text Files


I'm having to deal with some extremely large (millions of lines) text files.  In most cases these are data dumps from a third party that I am bulk inserting into a SQL database. 

The problem I have is that if the bulk insert fails because of an error on row 460,392, I need a way to examine that data, and either send the details back to the source of the data, or just fix it myself.

There are all kinds of great tools and utilities on unix that I can use to manipulate text files of any size.  I know that there are ports of these tools for windows.  However, I'm curious if anyone is familiar with Windows native tools that really help with this kind of work.

Jason
Monday, March 17, 2003

I've had success using TextPad ( http://www.textpad.com ) for editing ~50 MB files.  It handled this size with ease, so I imagine it could go much larger.  They claim ( http://www.textpad.com/about/specifications.html ) that the 32-bit version can handle "file sizes up to the limits of virtual memory"

Mike McNertney
Monday, March 17, 2003

http://www.freedownloadscenter.com/Utilities/Text_Editors_Q-Z/UltraEdit_32.html

Just me (Sir to you)
Monday, March 17, 2003

perl from
www.activestate.com

Perl
Monday, March 17, 2003

Hi GNU Emacs for WinNT should do:
  I work with 30MB files on windows without anyproblem.


 

Artist
Monday, March 17, 2003

For Comercial Software try SlickEdit
www.slickedit.com
Claims to have No limits on file size except resources of the Computer. I have used it for 10 MB files on a regular bases.

Cost is steep though 300 USD the last time I checked.

A Software Build Guy
Monday, March 17, 2003

http://www.slickedit.com should link.

A Software Build Guy
Monday, March 17, 2003

I can second Slickedit - used it to edit files that are hundreds of megs in size.  Doesn't have any problems with it. And it's available for your favorite OS...Windows/*nix/etc

GiorgioG
Monday, March 17, 2003

There are any number of editors that can deal with large file sizes. What this what you were actually looking for, or do you want something to do automated analysis?

Chris Tavares
Monday, March 17, 2003

I know sed is available for windows.  Sed can edit the file via searching for a particular pattern, etc. 

Mike
Monday, March 17, 2003

I'm currently doing something similar, and Perl has worked perfectly.  (I'm using the Activestate version)

Lee
Monday, March 17, 2003

Try Word Pad comming with Windows.
It may take some time because it is not intended for plane text but it should work.
Just try it.

Boris Yankov
Monday, March 17, 2003

Our company uses V.  I routinely view large data files (greater than 500 MB) and it handles it with no problem.  it "chunks" the data into pieces, then loads it into memory.  so you're never overloading your system.  http://www.fileviewer.com/

nathan
Monday, March 17, 2003

Re-reading your original post I realize you want a pure MS answer.  You could write a script using Windows Script Host,  RegEx object, and the FileObject in VBScript to do a line by line scan of the file ala PERL or SED.  This can work, I have done it for smaller files but requires access WSH 5.6 and VBScript 5.6 and carefully reading the MSDN Libary entry for scripting under the Platform API.  I much perfer ActiveState Perl in an automation case.

A Software Build Guy
Monday, March 17, 2003

I second ultraedit32...

200MB+ oracle data files for the loader, and ultraedit32 was the only app i found that would let me VIEW and SAVE.

codem0nkey
Monday, March 17, 2003

I have been using MultiEdit for a number of years and have been quite happy with its performance. Have open 450MB files with no problem. Use BULK INSERT to load these files into a SQL Server 2000 database. Costs about $139, though. http://www.multiedit.com

Himanshu Nath
Monday, March 17, 2003

I'd second Perl.

David Cross' Data Munging with Perl is an excellent book explaining how to get the job done.

http://www.mannning.com/cross/

Alternatively any scripting language with good stream support.  Python, PHP or ASP (using the FileSystemObject).

Ged Byrne
Tuesday, March 18, 2003


Seriously, Perl.

Matt H.
Tuesday, March 18, 2003

Another vote for Textpad.

I used it last year for doing exactly what you are doing: Looking at a text I was inserting to find out why bulk insert/dts package failed. It handles file sizes in excess of 500mb (post office data), so finding text on line 5million is no problem.

This doesn't solve the real problem, which is that your error handler needs improving, but for one off jobs (like writing exception handling?) its fine.

Justin
Tuesday, March 18, 2003


Thanks for all the suggestions, guys.

Justin - I am using SQL's BULK INSERT command to do these inserts.  What kind of error handling are you thinking of?

Jason
Tuesday, March 18, 2003

*  Recent Topics

*  Fog Creek Home