Fog Creek Software
g
Discussion Board




Algorithm for evaluating a WinZipCompression ratiO


Problem :
-------------

Platform : Windows XP/NT

You've got 15000  text files (EDI Files) in one folder.

This 15000 file represent a total of 800 Mega of size.

Is there a way/technic/algorithm which can compute
how much WinZip will be able to compress without
actually compressing the 15000 files to check the size of the ZIP file ?

~ CycleBurner ~
Friday, March 26, 2004

Platform : NYS Lottery
You've got 54 numbered balls to choose from.

This 54 numbers represent a $17million prize if you can correctly predict which 6 numbers will be drawn.

Is there a way/technic/algorithm which can compute
which numbers will come out before the drawing?

;-)
Friday, March 26, 2004

Stastically and over all possible sets of data, your data will be slightly larger than the origional files.  This is guaranteed by information theory.

Anything else is just guesswork. ;)

Flamebait Sr.
Friday, March 26, 2004

Statistically, you'd expect positive compression if they're text files.  ;)

schmoe
Friday, March 26, 2004

Your best bet is probably just to compress a representative sample of them and extrapolate the ratio.  Keep in mind that large files are likely to compress better than small ones, so a simple ratio isn't quite good enough, depending on the accuracy you want.

If you really wanted to get tricky, you could read up on zip's data modeling, which would let you write something that would analyze each file and tell you about how predictable the inputstream is, although that's a decent part of the way to writing your own zip, and probably not worth the effort.

schmoe
Friday, March 26, 2004

Completely randomly pick 100, zip them, and then times it by 150?  You could get fancy, find the average and standard deviation of the compression of group, and work out a probability curve of how big it will be.

Keith Wright
Friday, March 26, 2004

Check the options of the Winzip command line add-in available in WinZip 8.1 or later. It may be possible to do a dry-run zip with -v or -@list. The Wzzip add-in has to be downloaded separately. Free to licensed users.

It's not going to save you any time though, since it has to go through all the same motions as doing the zip anyway.

There's also source code available for gnuzip. You could examine it to your heart's content and even alter it to suit your needs, i.e. do all the calulations but not write the compressed data. Again, this won't save you much time to speak of. And the LZ compression method used by gnuzip may not be the same as used by WinZip.

old_timer
Friday, March 26, 2004

Does .zip reuse tokens between files in the same archive? If so, the compression is going to be massive, since you're going to have tons of duplication.

As a wild guess, I'd say 75% compression, maybe more.

Philo

Philo
Friday, March 26, 2004

"Does .zip reuse tokens between files in the same archive?"

I think that depends on the compression algorithm, and what you describe is called "solid" compression, as implemented by RAR or JAR. WinZip has something similar in its latest versions but I haven't tested it extensively. The regular zip algorithm does not do this.

Chris Nahr
Friday, March 26, 2004

No. Zip doesnt reuse tokens between files. As a matter of fact, you can compress each file with a seperate compression scheme. Zip files are really nothing more than archives.

Anon-y-mous Cow-ard
Friday, March 26, 2004

Some impressions on this, correct me if I am wrong..

I'd really argue firstly that for 800mb, your bottleneck is NOT the compression algorithm itself, but rather disk IO. Having said that, even reading in and running the decompression algorithm, but sending the output to /dev/null (or nul in Win32, I think) would be fairly fast. This, IIRC, is what WinRAR does when you ask it to "estimate" compression

That being your baseline, I'd sort the files by type. In Win32, that would probably (naively) map to extensions. If you're with 100% text files, for example, then you should be able to make a good "guess" based on a relatively small sample (like 10-50mb)

For every other sort of file, it depends. For example, normal Zip compression doesn't do ANY good with compressed files, so expect almost no gains from image formats (gif, jpg, png) and compressed audio/video (mp3, ogg, aac).

Everything else that is not pure, compressible text (which can be brought down to about 10-15% of the original size) and binary compressed data falls somewhere in the middle :)

What I'm trying to say is, look at the frequency spread of filetypes and extrapolate by computing a small sample of each type (well, you can certainly skip the uncompressible types in that test).

deja vu
Friday, March 26, 2004

Correct.  The rationale is, first of all, that there may be no symbol-frequency correlation between different files in the archive, and second, a client may want to extract a subset of the files in the archive (or add files to the archive) without requiring the zip implementor to read through the entire archive.

In circumstances where these assumptions are incorrect, like JAR files, "solid" compression can give huge gains.

schmoe
Friday, March 26, 2004

Bah. I noticed that they're all text :) Blame my lack of sleep.

deja vu
Friday, March 26, 2004

Solid 7-Zip and/or tar+ Bzip2 will probably give you significantly better compression than WinZip can - and as a plus, they're both free.

The best way is to evaluate is just compress. 800MB is not that much.

Ori Berger
Friday, March 26, 2004

Just for the edification of the audience, the magic token "EDI" means something very, very important in this context. It means that there will be significant repeating tokens between files.

EDI looks like this:
ISA*00*          *00*          *ZZ*180135355      *ZZ*610304941      *030719*0111*U*00303*000056880*0*P*>~
GS*IN*180135355*610304941*20030719*0111*631941739*X*004010~
ST*810*56880~
BIG*20030626*21834*20030618*0335990***PR~
NTE*CMT*DOCID 56880~
CUR*BY*USD~
REF*BM*S71588~
REF*PK*001~
REF*BM*S71588~
N1*BY*CATCO*92*410681545~
N3*2785 LONG LAKE ROAD~
N4*ST PAUL*MN*55113*USA~
N1*SU*GUNITE CORPORATION*ZZ*180135355~
N3*302 PEOPLES AVE~
N4*ROCKFORD*IL*61104*USA~
N1*BT*CATCO*92*410681545~
N3*12525 DUPONT AVE SOUTH~
N4*BURNSVILLE*MN*55337*USA~
N1*RE*GUNITE CORPORATION*ZZ*180135355~
N3*PO BOX 97519~
N4*CHICAGO*IL*60678*USA~
N1*ST*CATCO*92*03~
N3*12525 DUPONT AVE SOUTH~
N4*BURNSVILLE*MN*55337*USA~
N1*SF*GUNITE CORPORATION*ZZ*180135355~
N3*302 PEOPLES AVE~
N4*ROCKFORD*IL*61104*USA~
ITD**3*1*20030726*30*20030726*30~
DTM*011*20030626~
FOB*PP~
IT1*001*30*EA*4750**BP*AS1140*VP*AS-1140*UP*787459043498~
PID*F****ASA W CL CLEVIS ASSY~
IT1*002*8*EA*6194**BP*2997D*VP*2997-D*UP*787459001672~
PID*F****MACHINED BRAKE DRUM~
[snip]

*Immense* repetition, both within a file and between files.

Philo

Philo
Friday, March 26, 2004

Just conmpress them. Takes less time than reading the answers to this thread.

Stephen Jones
Saturday, March 27, 2004

Here's something that irritates me:

You have one file. The file contains just one character "0". Its file size is exactly one byte. If you compress this file , it will be 115 bytes (because of zip headers and compression overhead).

So far, so good.

Now, make 1000 copies of this file. All files will have identical data. It's obvious, from a human-user-perspective, that this data is extremely compressable. But today's most modern compression algorithms don't use cross-document compression. So, compressing my 1000 1-byte files will produce a 118,000 byte archive. That's just dumb.

Benji Smith
Monday, March 29, 2004

Benji, if you use your scheme, to open any of the 1,000+ files you will have to open ALL of the 1000+ files. You cross-index the files, so you can't know whats in any one file without opening all of them.

That's what "Solid Archive" means - all the files are treated as one. It is possible, as mentioned before RAR does it. The tradeoffs are as mentioned, but may be worth it for 1,000 small files where compression size is more important than compression/decompression speed.

www.MarkTAW.com
Monday, March 29, 2004

No, you wouldn't have to open all the files. With the appropriate compression algorithm (ie, not a moving-window type of algorthm), you'd just have to open the cross-file library of common patterns.

Benji Smith
Wednesday, March 31, 2004

*  Recent Topics

*  Fog Creek Home