Fog Creek Software
Discussion Board

Caching Theory

Ok, in my spare time, I've started writing an application that allows uploading and categorizing of images,  but with images able to exist in multiple categories.  I'm Using java/Tomcat, everything is pretty basic.  Well, I feel sorta wrong if I don't at least try to implement some caching.    So HOW exactly should I implement my caching?  I could do it EJB style, where I always do a "SELECT id FROM table WHERE ...", then for each ID in the resultset, call a load method on my image objects (which first would check the cache before pulling the individual record from the db).  Of course, this seems almost like a waste, unless I know for sure the majority of the records are already cached.  Is there  a right and wrong way when it comes to caching?  Am I wrong in thinking that the fact that i'm retrieving BLOBs should change my caching strategy?  Thanks in advance. 

Thursday, September 02, 2004

off the top of my head...

I'm no expert, but my understanding is that caching really boosts performance under a few circumstances:

*when multiple users are accessing the same set of relatively static objects.

*the app is pulling db records across a network link

*retrieving the records involves expensive SQL logic

*you have lots and lots of ram

My bet is that caching won't get you much.  Client-side performance is gonna be limited by their bandwidth, not the responsiveness of the DB.

Caching is just another optimization, and should be last on your list of things to worry about.  Get the app working and monitor how people are actually using it.  Then you can tweak it.

Thursday, September 02, 2004

Yet another perspective ...

A cache is what changes a high-volume memory storage from very_low_cost / low_performance to low_cost / almost_high_performance. The technique uses an associative memory (cache memory) which is low-volume / high-performance / high-cost. The ensemble makes a hierarchical memory.

For example: the cache memory on your HDD or your layers of caches holding CPU instructions (RAM is the high volume storage in this case)

In an application, your low-performance high-volume is the disk, while the RAM is the high-performance low-volume high-price cache memory. A map of key-to-data would normally suffice (things get complicated when dealing with changing data in distributed systems)

If the DB already caches this information, no need to cache it twice - performance would only decrease due to the added overhead. Make sure though, the DB is not caching some other type of object which has little or nothing to do with your caching needs - e.g. records vs. queries/cursors.

Thursday, September 02, 2004

And something else ... when the cache and new data needs to be cached, obviously, some of the cached data has to be dropped. Common algorithms to decide what data is uncached are:
- last recently used (LRU)
- least frequently used (LFU)
- ... whatever gives you best performance - ie the best successful (data in the cache) hits /  total hits.

Thursday, September 02, 2004

You might also look into creditte theory.  The basic idea is that 90% of the time 50% of your memory isn't being used.  So you put it into a banQ structure that you can draw on for those times when you need 110% of available cache.


Thursday, September 02, 2004

Measure your bottlenecks first, of course. Chances are, you're not looking to handle anywhere near the load that would require caching DB requests. Remember, you'll almost certainly run out of bandwidth long before your database starts to become a bottleneck. Consider: a P133 will saturate a T1 with static pages. I bet a 2ghz proc can do the same with dynamic code.

That said, I'll assume you want it to be optimized just because. Just like I would 8-}

Normally, I'd say don't bother caching database records, because most apps won't be pulling enough similar data out often enough to gain. However, I think it'd be worth it for you to try (and profile!) because you're using BLOBs, meaning each row you get from the database is going to be in the tens of kilobytes, instead of the hundreds of bytes, as would normally be expected.

As far as "the majority of records are already cached", well, look at your data: how big is your collection? how much ram do you have? and how is popularity distributed? I.e. do most of your requests return the same 20 images, or is any one image as likely to be desired as any?

Snotnose - banQ structure? My google search is turning up a lot of banks; do you have a pointer to a doc on it?

Mike Swieton
Thursday, September 02, 2004

The banQ structure is heavily fortified with encryption and such, and is used to hold your extra cache when you don't need it.....

Thursday, September 02, 2004

I agree with Mike, you should definitely look at your usage before deciding what you need to do Vince.  The chances are that either your load isn't sufficient for caching, and/or the requests for images aren't such that caching really helps.  You could of course use complex calculations to determine if its worth putting an image in the cache but chances are, its unnecessary.  If you have, for example a million images, and between the union of different users actual queried sets there is no overlap (and their sets very small at that, like 50 images each, and therefor not heavily viewed), caching won't solve much in your case.

If you find that a certain small set of images get loaded constantl (and when i say constantly i mean to the point where its silly for them to be retreived from the database over and over) then you may want to implement some sort of caching.

Thursday, September 02, 2004

Dang.  Years ago I read a spoof of cache memory, called credit memory.  I just googled for that spoof without much luck, but actually found some links the original poster may find useful.

Thursday, September 02, 2004

>> Consider: a P133 will saturate a T1 with static pages.

Can we retire this age old saying yet? Maybe upgrade it with a T2, T3, or some OC variant? I mean, who uses T1's anymore?

anon-y-mous cow-ard
Thursday, September 02, 2004

Snotnose:  Here's a link to the C-RAM joke.

(search for C-Ram)

Thursday, September 02, 2004

*  Recent Topics

*  Fog Creek Home