Fog Creek Software
Discussion Board




Unique file ids for Windows

I need to generate a unique 64-bit integer id for a Windows file. The id needs to stay the same over time, unless the file is deleted and another with the same name is created. It needs to work under both NTFS & FAT. In Unix I could do it with a combination of dev and inode, but with Windows I'm stuck. The best suggestion I've seen is to use a hash of the full path name, and possibly the creation time as well, but that's not absolutely guaranteed to be unique. Does anyone have any bright ideas? Thanks.

ajs
Monday, May 05, 2003

The algorithm that immediately comes to my mind is "enumerate the entire disk.  Sort the resulting list by creation time.  The id number of a file is equal to its ordinal position on the list.  Every time a new file is created, add it to the end of the list." 

However, I know so little about the problem it is hard to say whether that's a reasonable algorithm or not.  It certainly guarantees uniqueness.

Can you describe the reason you need a unique 64 bit file id in a bit more detail?

Eric

Eric Lippert
Tuesday, May 06, 2003

Eric

I'm pottering around with an NT program to serve files to a machine running Plan 9. As part of the 9p2000 protocol used between client and server, the server is required to supply a 13-byte data structure called a qid to the client - the qid represents the server's unique id for the file, and comprises a 1 byte mode, an 8-byte path (the bit I'm having trouble with) and a 4-byte version. I don't think enumerating all the files would be feasible for me - I want to serve up network devices containing large amounts of files, and possibly being updated by other people.

ajs
Tuesday, May 06, 2003

What constitutes a unique file?  If the file content is changed, is it considered the same file?  If the file name is changed, is it considered the same file? If the file is moved to a different directory, is it considered the same file?

I'm not too familiar with Unix but I'm guessing dev and inode are specific to a given machine?  Why can't you simulate this by assigning a unique ID to a file when it first enters your system?  You would save this unique ID and the necessary information to identify the file (for example, the full path) and check for previously added files at each use.

Somebody
Tuesday, May 06, 2003

Doesn't the date the file is created stay with it? That's fairly unique. Combine it with something like a checksum and you should have a fairly foolproof way of generating a unique ID.

www.marktaw.com
Tuesday, May 06, 2003

Uh... I just realized that that's basically what you said wasn't foolproof. What other data can you capture?

I believe Real Player was accused of sending some data back to the mothership that was some sort of unique identifier for that installation of windows. The amount of free space on the disk at the moment of creation?

Something's telling me you should look into how PGP generates it's encryption codes. Each of those are unique.

www.marktaw.com
Tuesday, May 06, 2003

Would MD5 work?  MD5 generates a 128-bit checksum but it may provide a starting point.

Matt Foley
Tuesday, May 06, 2003

Thanks for the replies so far. Actually, I imagine it's theoretically impossible to derive a unique 64-bit identifier for a file from information, such as path name plus creation date, which may be longer than 64 bits, and  I was rather hoping that some one would come up with a way to derive a unique file identifier from one of Microsoft's many APIs du hundred nanoseconds - eg just call uebiUNSIGNEDEIGHTBYTEINTEGER GetTheUniqueFileIdentifierFromTheFileNameConcatenatedWithTheNameOfYourSecondCousinTwiceRemovedAndBobsYourUncle(LPGazillionParameters* ListOfAtMostAGazillionParametersOrPerhapsMoreThanAGazillionSoHardToTellTheseDaysIsntItQuestionMark)

ajs
Tuesday, May 06, 2003

What am I missing?

I think I'm with 'somebody'.  Why isn't a plain counter (sequence of first entry into system - which could just as well also serve as your local filename) adequate?

John Aitken
Tuesday, May 06, 2003

NTFS5 gives each file/directory a unique GUID, but that is 128 bits long, and is used by Distributed Link Tracking. As far as I know NTFS has always had extensible attributes, so you could implement your own GUID scheme in only 64 bits and add it to all the files as well.
I do not know about FAT though. I do not think it has such a scheme, but OS/2 had an implementation for extensible attributes on a FAT volume (storing them in a seperate system file I believe).

Just me (Sir to you)
Tuesday, May 06, 2003

Windows can give a unique file indentifier using the function GetFileInformationByHandle() (See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/getfileinformationbyhandle.asp)

Example: (hFile is a handle to a file)

BY_HANDLE_FILE_INFORMATION FileInfo;

GetFileInformationByHandle(hFile, &FileInfo);
inode= FileInfo.nFileIndexLow | (FileInfo.nFileIndexHigh << 32);
volume=FileInfo.dwVolumeSerialNumber;

According to Micorsoft : "The identifier (low and high parts) and the volume serial number uniquely identify a file on a single computer. To determine whether two open handles represent the same file, combine this identifier and the volume serial number for each file and compare them."

Moez
Tuesday, May 06, 2003

I would just keep a list of file paths, adding new files at the end when necessary, and using the index of a file as the id. This means that if you delete a file and then recreate it, the same id is used, but that doesn't matter for your purposes, I think. You just need to map path names to unique ids. The drawback is that you have to store the list in a file so it is persistent.

Frederik Slijkerman
Tuesday, May 06, 2003

Thinking about it more, since you are programming the gateway and nobody exept the gateway would have any use for these 8byte id's, just keep a persisted fullPath ->  8byteGUID cached dictionary around in your application. Generate fresh 8byteGUID's on the fly for newly discovered files. GC old entries off-line or remove them at directory listing time.

Just me (Sir to you)
Tuesday, May 06, 2003


How about an "RBA" - relative byte address... the offset in bytes from the start of the disk to where the file physically starts?

Joe AA
Tuesday, May 06, 2003

Wouldn't that be ruined by defragmentation?

Just me (Sir to you)
Tuesday, May 06, 2003

Yes it would be ruined by defrag, including by NT's ongoing auto-defrag.  Besides a logical disk need not correspond to a physical disk, and NT doesn't easily let you get so close to the hardware anyway.  I still don't see what I'm missing.  Presumably the entire virtual path, including the 'friendly filename' is maintained on the plan 9 system, which communicates with your server entirely by means of the corresponding qid, right?

Doesn't this make your local (server side) filenames and directory structure entirely arbitrary - so long as you can locate a file again by the qid you returned at the time of the first save?  So, I ask again, what's wrong with a simple counter maintained and incremented by the server (hell, or elsewhere for that matter)?

Must be losin' it.

John Aitken
Tuesday, May 06, 2003

Counter can be a way of implementing the GUID part, but if the server is mutitreaded you would need to synchronize counter access.

Just me (Sir to you)
Tuesday, May 06, 2003

what's wrong with the windows id specified above?

you can also use the creation time, on ntfs i think it's a FILETIME structure which is something like 10^-6 of a second in granularity. (on FAT it's 2^1 seconds in granularity).

but since you haven't specified what determines uniqueness, a hash of the full filepath may be sufficient too.

mb
Tuesday, May 06, 2003

One problem with using the GetFileInformationByHandle method (mentioned a few entries earlier) is that the nFileIndex* values returned by the method are only valid (and unchanging) while the file is opened, and different values may be obtained the next time that the file is opened. Also, volume serial number values are not guaranteed to be unique - some methods of formatting disks do not write a serial number, and cloned drives may have identical serial numbers.

To the original poster: The responses to your similar question in the "comp.os.plan9" newsgroup seem to suggest that using a hash of the full path plus the file creation time is a reasonable approach (with only a very small chance of 2 different files having the same hash value).

Philip Dickerson
Tuesday, May 06, 2003

this is overkill, but how about creating a database table with fields: file_id (p-key), path, file_name, created_date, modified_date, md5_hash.  Have a nightly process check the filesystem for updates and update the table as necessary.  Then use the p-key in the database as the unique file identifier.

Steve H
Tuesday, May 06, 2003

Thanks for the replies. Philip, I agree that hashing is a reasonable solution. It just goes against the grain to use something which isn't absolutely guaranteed to be unique, when key collisions, however unlikely, can cause problems and I've got no means of resolving them.

ajs
Tuesday, May 06, 2003

I would actually work out the odds of a collision if I were you.  Suppose n is the number of files and m is the number of unique hashes, and n is small compared to m.

I seem to recall that an approximation for the odds of collision within the first n is n^2/m.

If that's correct then the odds of a hash collision in the first million is (2^10)^2 / 2 ^ 64 = 1 in 16 million, which is just slightly worse than the odds of winning a standard choose-6-from-49 lottery. 

Note that someone wins such a lottery pretty much every day -- if you distribute your code to sixteen million people who each have a million files, you're going to get collisions all the time.

(If I were you I'd double check that math rather than relying on my flaky memory!)

Eric

Eric Lippert
Tuesday, May 06, 2003

Ah, now we see why GUIDs are 128 bits instead of 64...

Frederik Slijkerman
Tuesday, May 06, 2003

I might consider setting aside a fraction of the total bit size, say 16 bits or so and using it to encode a creation date in the form of a year and day of the year and don't pass this through the hash, just append it to a smaller 48 bit hash value you get from the file name or contents or whatever other method. 

This gets you into a "Y2K" issue but you can easily encode, say, 256 unique year ids which should be enough for your app lifetime and you could even cycle back to 0 after 255. 

The idea here is that you're not likely to see millions of new files per day even if you're dealing with millons and millions of files over the long haul.

In the end, the exact method should depend upon how the app is likely to be used in terms of how often files will be added, how many users there could be, how many total files there might be, etc.  But hashing-and-praying is almost always a good enough approach, assuming a decently large hash size. 

George McBay
Tuesday, May 06, 2003

"what's wrong with the windows id specified above?"

It only exists on NTFS5 and higher, and is twice the lenght of the requested size. The original question asked for things that would run across all NT filesystem, including FAT.

Just me (Sir to you)
Wednesday, May 07, 2003

Hmm

The id needs to follow the file.  It cannot be computed by hashing the file or anything, since the contents of the file can change over time.

I assume it need only be machine-unique.  I'll assume it need not follow renamed/copied/moved files or whatnot.

My approach would be to have a sequence inside my plan9 fileserver, and assign to each file as they are accessed.

Where to store the uid depends on how it is used:

* If you want to take a file and look up it's uid, then a good place to store that uid is with the file (attributes for ntfs, a hidden file in the directory for fat?).

* If you want to take a uid and look up the file, you'll need some kind of central index.  Since the uid is issued in ordinal order, a simple 'index file' with the offset into a names-file for each ordinal might suffice.  Depends upon whether you expect to be deleting files or not, or how many files you are dealing with.  Maybe a little database ought to be embedded?  (Firebird?)

An interesting problem :-)

Nice
Wednesday, May 07, 2003

NTFS supports multiple streams per file. You could arbitrarily generate a qid (it could be a sequential number). You associate the qid with a file by storing it in an alternate stream attached to that file. As long as you don't generate the same qid twice it is guaranteed to be unique and you don't have to rely on a hash  function.

However it's not easy to find a particular file knowing only its qid -- but that would have been a limitation of a hash function anyway (given the hash, it's hard to determine which file it came from).

I know, it won't work for FAT.

Nate Silva
Friday, May 09, 2003

*  Recent Topics

*  Fog Creek Home