Fog Creek Software
Discussion Board

If I ran Gmail, I would ....

1. Compression - benefits on text emails would be enormous boost to cutting down actual usage. Report uncompressed size to users.

2. Keep single copies of emails.
Emails' especially those sent out to multiple users, and those that contain multiple attachments really only need to be saved once.  This too should cut down on actual disk use

3. Keep single copies of individual files
I know they will use it for sharing binary files, so I will only keep one of these too. Might take a bit more effort to figure out two files both called Warez.mp3 and both at 10Mb are different, but my engineers will get there.

Any more suggestions out there?? How could you run this show, both from a business, as well as a technical perspective.

Tuesday, April 6, 2004

This article will help you understanding what Google folks are trying to achieve.

It's a must read for all techies intereted in developing new computing platform using obscure technologies!


Tuesday, April 6, 2004

You could checksum the files to determine equivalency, in additon to size and name etc.

Tuesday, April 6, 2004

It might be interesting to look at how the economics of compression work out. Yes, you save disk space, but at potentially significant cost to CPU cycles. If superior searchability of your e-mail archive is a big goal (which I'd take to be a given in this case), then displaying search results could be an especially big CPU load, because you'd have to decompress potentially dozens of files, I think, just to show the hits in context. Does Google compress HTML today? I have no idea.

Anyway, I don't know nearly enough about google's infrastructure to begin weighing those the cost of disk space vs. CPU time, but I'm quite confident that their engineers modeled the scalability implications and cost model for this service before they announced anything publicly.

As for determining whether two equally-sized files are equivalent or not, I'd think that well-established hashing algorithms would do the job nicely.

John C.
Tuesday, April 6, 2004

John C, I agree with you. What you gain in compression, you lose in CPU cycles.

wonder if Google has ever published how they compress their archives, caches, indexes etc.

I really doubt that even at this time, they keep their data uncompressed.

And if compression is at the kernel level of their heavily customised Linux distro, I think it would be on the fly an imperceptable to most users.

I would argue that the would hit HD limits before they maxed out their CPUs.

Tuesday, April 6, 2004

As a historical note, this paper by Brin and Page, which appears to be from about 1997, describes an early form of Google:

It reads: "The repository contains the full HTML of every web page. Each page is compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib's speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib's 3 to 1 compression."

Of course, 1997 was ages ago. It's amusing to note that "the total size of the repository is about 53GB", when it's undoubtedly in the petabyte range today.

John C.
Tuesday, April 6, 2004

"It might be interesting to look at how the economics of compression work out. Yes, you save disk space, but at potentially significant cost to CPU cycles."

As the link from JD points out, Google has both alot of disk space and alot of CPUs.  They most likely have the cycles to burn.

"then displaying search results could be an especially big CPU load, because you'd have to decompress potentially dozens of files, I think, just to show the hits in context."

Pulling the files off the disk is probably an order of magnatude slower than decompressing the files into memory.  Unless the CPUs are very busy, this probably doesn't have much of an effect.  You also have to remember that in compressing the file -- you have less data to retrieve off of the disk.

Almost Anonymous
Tuesday, April 6, 2004

>What you gain in compression, you lose in CPU cycles.

Not really always true...depends on whey you are called to do the decompression. Remembet to be able to see your email you need to login first so the decompression can occur at that point itself. There is no need to decompress everything only the mail for the users who are actively logged in. This also means to save disk space you have to auto logout users after some period of inactivity and I am sure google will present it as a security precaution.

But all aside I do not think that I for one would like to open an email account where email I delete is not really deleted and could be sold or correlated for profit without my knowledge.  And getting 1 GB of space for email without these privacy intrusions is pretty easy...see

I think this Google has become the right company now to become part of Microsoft.  So much for the "Do no evil" pledge.

Code Monkey
Tuesday, April 6, 2004

"Google has 100,000 servers" it said in JDs link

100 000 servers? I mean, sh*t?
I feels like yesterday they said they had 1000 servers and everyone was impressed?

(Sorry.. just had to say it.)

Eric Debois
Tuesday, April 6, 2004

Here's a paper that they published on their 'Google File System'.  It's highly optimised towards their use, but it makes interesting reading.

Tuesday, April 6, 2004

Code Monkey your post makes sense.

Even then, you only need to unzip the email *when read* because presumably that is when the ads would be displayed as well.

Tuesday, April 6, 2004

MR wrote:

>Even then, you only need to unzip the email *when read* because presumably that is when the ads would be displayed as well.

Well to do a contextual analysis of what ad's to show in real time as the email is unzipped would perhaps be too much even for google to pull off but it does make sense to do it of google is smart enough to extract and link keywords in your email seperately to the ads.

Also I did not quite understand why they would insert ad text inside your email....that would plain take up too much space unless they are thinking of offering POP3 or IMAP access (IMap would be better) and hence assume that users might access email without a browser.

If an when it happens the only place where I would use GMail without worrying about privacy issues is as a destination for newsletters I subscribe to

Code Monkey
Tuesday, April 6, 2004

"2. Keep single copies of emails.
3. Keep single copies of individual files"

Microsoft Exchange does the same thing, of course only within in organization.

Tuesday, April 6, 2004

I would run a Jabber server using the same user authentication database as GMail, and provide chat gateways to all the major messaging services (MSN, Yahoo, AIM).  This way yahoo and MSN will have more to think about than threatening google's core business - search.  Go get 'em, google!

Seun Osewa(
Tuesday, April 6, 2004

Besides, I have a little bit of a rant on this topic on my weblog.  I forgot, however to specifically mention:
- That google doesn't really need to be a 'portal'.  They just need to clone the most popular stuff that portals provide: e-mail and instant messaging?
- In criticising google's decision to provide 1gb space from the first day, i didn't mention what I would have done: Announce 1gb space in the April fool's e-mail, then when everybody feels gmail is a joke and the service is ready, announce the 'real' gmail with 50mb-200mb of space, open to everybody (no vaporware).  Then double the provided space every 6 months until its 1gb!
- implement restrictions on message size and which proportion of e-mails should be pictures, and _relax_ them over time instead of introducing them when people start using gmail to replace online storage services (they need text to display their ads on!)


Seun Osewa(
Tuesday, April 6, 2004

I updated the article to include this two paragraphs:
"Google is wrong to drop the Google Directory, based on the Open Directory Project ( from the home page, without offering a replacement. Google's 'similar pages' feature doesn't work as accurately as finding other links in the same Open Directory category. It has always been nice to be able to conduct a search and go straight to the Directory category that is most closely related to the search.
"In the same vein, Google is absolutely wrong to put Froogle on the home page. There's no such thing as "objective product search"(aka Froogle). Besides, Froogle is at the time of this posting still marked "BETA". Or what do you think?

Seun Osewa (
Wednesday, April 7, 2004

They probably realised that not too many people were using the Dmoz directory.

I would postulate that most people that maintain the Dmoz directory use google anyway to find sites that are related to their branch. I know I would.

Wednesday, April 7, 2004

could you give me an example of a search where "related items" returns something meaningful?  From my experience it just returns similar pages on the same site or .. em ... junk.  There are too many dimensions in which pages could be related and directories are yet another way to sort information into categories.  How can they hope to eliminate the human element?  Sure, the Open Directory can be slow ... but the contribution is significant.  I would rather get more people to know how to use it!  I really miss "related categories"

Seun Osewa (
Wednesday, April 7, 2004

Code Monkey, you've either misinterpreted what you've read about gmail, or you haven't actually read anything about it except fourth- and fifth-hand board postings.

-The ads are NOT going into emails themselves; instead, they show up in the gmail interface, along with 'related links' that are not ads.
-Deleted email IS deleted. It's just that with 1GB, you don't need to delete nearly as many emails. And I fail to see what any of this has to do with privacy.

Wednesday, April 7, 2004

If I ran Gmail....everyone would get a address.  I have a sneaking suspicion they are going to launch with addresses which would significantly slow adoption rate.

Regarding storage, I don't believe the space savings could be justified against system performance, complexity and maintenance.  IMHO, gmail will store all mail, duplicate or not.

Google is much better off spending time on the spam problem itself rather than working on efficiently storing spam.  If they do a great job I hope they provide a blacklist feed to ISP's and those that run their own mail servers.

Pete Jenkins
Friday, April 9, 2004

*  Recent Topics

*  Fog Creek Home