Fog Creek Software
Discussion Board




Large amount of images in a teeny-weeny directory

Ok, I've searched around on google, and couldn't find anything. perhaps my terminology is not correct?  In anycase, I shall ask the wisdom of this forum.

My problem:  We have a a website running php, apache, and mysql on linux.  We have thousands of products with multiple images for each product with the product->image name mapped in the database.  All the images are in a single directory which seems to greatly hurt performance.

We were thinking of using a hashing function to map a product's image name to a hash named directory (kind of how Microsft IE does it's caching?) and store it in there.  This way, we can spread the images over multiple directories.

Has anyone had any experience with this kind of setup?  Anything to look out for? Or does anyone know of a better solution?

Thanks.

cyang
Wednesday, November 12, 2003

You may use the first 3 or 4 characters of the filename's MD5 as a directory name. Pretty much guarantees an even spread.

Klodd the Insensitive
Wednesday, November 12, 2003

Use this to reshuffle the files

------------------------------------------------------------
#!/bin/sh
for imgfile in *
do
        subdir=`echo -n "$imgfile"|md5sum|cut -c1,3`
        mv "$imgfile" $subdir
done
------------------------------------------------------------

And then, in PHP, use the md5() function and extract the first 3 chars to locate a file's directory based on its name.

If you want 65536 different directories instead of 4096, use 4 instead of 3. Or 2 for 256 directories.

Klodd the Insensitive
Wednesday, November 12, 2003

Another easy way to do this is to store filename.jpg in

f/i/l/e/n/a/m/e/filename.jpg

It don't spread even, but it's very easy to locate the file you want.

Leonardo Herrera
Wednesday, November 12, 2003

Then create a script called md5loc thusly:

#!/bin/sh
subdir=`echo -n "$1"|md5sum|cut -c1,3`
echo "$subdir/$1"

Then look at the file by, say,

ls -l `md5loc smiley.jpg`

Klodd the Insensitive
Wednesday, November 12, 2003

In my original shuffling script, I forgot to create the new directory:

------------------------------------------------------------
#!/bin/sh
for imgfile in *
do
        subdir=`echo -n "$imgfile"|md5sum|cut -c1,3`
        mkdir -p "$subdir"
        mv "$imgfile" $subdir
done
------------------------------------------------------------

Klodd the Insensitive
Wednesday, November 12, 2003

Thanks for the suggestions.  Wow, this forum sure is responsive.

I'm thinking of using leonardo's suggestion even though it's not as mathematically elegant as Klod's simply because this will allow us to manually manipulate the images, and we don't have to worry about applying the hash before we drill down the directories (manually).

cyang
Wednesday, November 12, 2003

You could still only use the first 3 to 4 characters if you wanted to limit the annoyingness of having to go into a million directories per file.  So it would be:

/f/i/l/e/filename

etc.

Master of the Obvious
Wednesday, November 12, 2003

This time I really am confused.

I can see how spreading files across drives might help performance, but how does spreading them across _directories_ help?

HeWhoMustBeConfused
Wednesday, November 12, 2003

> how does spreading them across _directories_ help?

Somebody correct me if I'm wrong, but I think it's due to a "Shlemiel the painter's algorithm" (see
http://www.joelonsoftware.com/articles/fog0000000319.html
).
Basically some layer somewhere, when opening a file, is going through all the entries in a directory before finding the file with the requested filename. Put in another way, there is no index on the filename in the structure containing all the files for a directory.

This makes opening a file with a specific name in a directory containing thousands of files extremely slow.

Yves
Thursday, November 13, 2003

Some filesystems just don't handle big directories well. I've tried several kind of directories, and opening or deleting a file is just sloooow (throw a couple million files in a dir and try opening "the first file in that directory," whatever that means)

Leonardo Herrera
Thursday, November 13, 2003

What's the file system?

Stephen Jones
Sunday, November 16, 2003

*  Recent Topics

*  Fog Creek Home