Fog Creek Software
Discussion Board




Most efficient file i/o for reads

I'm tasked with writing a utility which will read in a series of large plain-text files (avg. size 100 mb), and perform simple calcs  (add, subtract, multiple, divide) on the contents.

I'd like to use perl, but I'm worried about performance -- perl, v5.8.0 available, running on a linux i686 with 2 gb of memory -- is available to me, but is that the best choice for this?

Chill
Wednesday, August 25, 2004

.. and what kind of read times do you get on your disks?

i like i
Wednesday, August 25, 2004

Well, all the Python freaks are going to tell you to use Python and the .Net'ters will tell you to turn the thing into a web app.  etc. etc.

If you're really worried about performance, though, you'll need to write it in something low-level like C.

On the other hand, 100mb isn't THAT big, you should be able to handle simple calculations with reasonable performance on the machine you specified. Why not create a 100mb file full of numbers and use Perl to extract and add them and see how it works? Shouldn't take more than a couple of minutes to get everything working, and as a result you'll have a real benchmark and not anonymous opinions from the Joel on Sofware forum.

Using a language / technology you're comfortable with, should more than make up for the performance loss in your case.

  -tim

a2800276
Wednesday, August 25, 2004

How about some Perl / shell scripting tying together awk et al.  The Unix utilities are already in C and would be fast.

Idot
Wednesday, August 25, 2004

Since you're not changing the contents of the file (or at least you didn't say you were) why not just memory map the file and go from there (assuming you were planning on loading the whole thing into memory first).

Otherwise you can't get much faster than using your OS' read() command -- all your favorite programming languages use it too.

FS Cache
Wednesday, August 25, 2004

Jeez, there are so many ways that you can do this, I'd recommend that you pick a language and framework that you are familiar with and try it out.

Your familiarity with the language will give you a mesaurable advantage, espcially since it's impossible to optimise before you've written any code.


Wednesday, August 25, 2004

==>If you're really worried about performance, though, you'll need to write it in something low-level like C.

Nah -- C is for wimps. <grin>

If you're *really* worried about performance, you'll go right to assembly language for your platform, you'll skip the OS file handling API's and drop right to the BIOS(or equivalent) to read the disk sectors directly.

Now *that's* performance.

<joke>

Sgt. Sausage
Wednesday, August 25, 2004

Premature optimization is the root of all evil.
-- C. A. R. Hoare

Do a quick proof of concept in Perl.

If it's fast enough, use Perl.
If it's not - then look at the alternatives.

RocketJeff
Wednesday, August 25, 2004

I'm willing to bet Perl is fast enough for your tasks...

In my experience the bottleneck for things like this is often the ASCII->binary conversion (scanf() or similar).

Dan Maas
Wednesday, August 25, 2004

How long will it take you to write the perl code? A couple of hours at most?

Give it a try, and see how slow it is. Since you're working on a text file, I'm willing to bet you won't be able to get too much better regardless of what language you use.

If you were using a *binary* file, though, I'd suggest going with C. True story: I was building a device driver, and as part of debugging I was getting out about 15 mb of binary formatted log files, and I needed to turn them into something I could read. I sat down with perl (4.something at the time) and wrote a program that read it, but it looked like it would take an hour or so to crunch through the logs. Tried Python as well (v1.2 I think)... just as slow.

Rewrote in C: The C program was done in 30 seconds.

The difference here is that perl and python are already very good at handling text, but the binary bit-twiddling I needed to do was very inefficient. In the binary file case, C rules as a bit twiddling language.

Chris Tavares
Wednesday, August 25, 2004

*  Recent Topics

*  Fog Creek Home