Fog Creek Software
Discussion Board




Caching "large" amounts of data.

Morning folks.  I'm currently in the process of writing a web application/site for a travel agency so they can accept reservations online.  There's data about 50,000 hotels in their database, in 12,000 cities.  My question is am I better off caching all the data at application startup or querying the data right from the database server on every request?  I figure I could store all the city structures in a hashtable and have an array of hotels structures for each city.  Since most people search by city & state, that's what I'd use to generate the hash.  The alternative is to query the database on every request.  I'm not convinced (yet!?) that the later will be faster than the hashtable.  The itself data will rarely change, so that isn't a concern.  Does anyone have any past experience with a similar scenerio?  Otherwise, I plan on giving both options a try...  Thanks in advance!

GiorgioG
Thursday, May 20, 2004

I'd go with the database server.  It may not be faster but unless this is expedia I doubt it will matter.  Using the database will certainly be more maintainable.  With enough memory the data will end up cached in the server memory anyway.  Updates, which will happen, will be possible without having to reconcile an in-memory copy.

Doug
Thursday, May 20, 2004

If your data rarely changes and most people query by City/State, why bother caching/querying at all ?

Instead, why not just generate the site, with a page per State/City ? Then you can have a trivial navigation scheme to get to the correct page.

Just a thought...

Steve Jones (UK)
Thursday, May 20, 2004

Where do your performance tests indicate the largest bottleneck?  Network lag or database access times?

If you cut the database access time from .5 second to .05 seconds, but your network lag is 5 seconds, no one is going to notice the improvement.

Caching can be added later (given a good design).  Don't spent too much time on premature optimization.

Caffeinated
Thursday, May 20, 2004

That's an interesting thought Steve.  That would definitely minimize the number of dynamic pages on the site.  But if at some point they want to add some dynamic/random links to the page (as an example) it could cause problems.  Alternatively I could generate 'include' html files for each city and load them up dynamically based on the city/state.  I'm not sure how much speed we'll be gaining by doing it this way versus completely dynamic.  As far as traffic/volume goes, the plan is to be able to scale this up (or out) in the long run to be able to handle up to 500-750 concurrent users.

GiorgioG
Thursday, May 20, 2004

Caffeinated,

I have experience in using the database method (but on a PHP site, not-caching anything) while fixing up another client's site.  It was slow.  Granted their database was poorly designed.  The database server was running on the web server.  The platform for this is ASP.NET 1.1/Win2K3.  Initially, while traffic is low, the database will reside on the web server as well.  As volume increases it will be moved to a dedicated server.  So network bottlenecks are not an issue (yet.)

GiorgioG
Thursday, May 20, 2004

You probably shouldn't blow off Caffeinated's suggestions. We don't know the tools, requirements, or performance profilings involved in your project. So it looks like a textbook case of premature optimization.

If you use good test cases, and believe your engineering and tools are reasonable, you can defer the decision. If not, then okay, run the benchmarks and see.

Anyway, that's just my opinion.

Tayssir John Gabbour
Thursday, May 20, 2004

Caching can be added later (given a good design).  Don't spent too much time on premature optimization.

This is really good advice from Caffeinated, so often I have seen people mucking about putting tricks into their code to solve supposed application performance problems when really they should have been worrying about developer performance.

Not to say that you can't consider and plan for potential performance problems, just don't waste time up front before you know your going to have a problem. Put some time in your schedule to review it when the application works.

Tony Edgecombe
Thursday, May 20, 2004

Like much of the people said, query directly the DB. If your DB is not too big ( which it seems to ), put enough RAM on the server so the DB fits in the memory.

Anonymouche
Thursday, May 20, 2004

I implemented a cache, with record-level locking, when I needed frequent (100s per second) fast reads, and the more occasional write (where a write might lock the table). It *was* an effort to implement synchronization between the database and the RAM cache (and, only possible at all due to the fact that all DB updates were made by the process which implemented the RAM cache). Therefore, I would agree with Doug.

On the other hand, if the database is "never" updated there's no doubt that it can be faster to read data from your own in-process RAM than to read it from an external database; the penalty is that you need to write/maintain extra code: because you need code to read from your cache, *and* code to read from the database (in order to populate the cache).

Christopher Wells
Thursday, May 20, 2004

> if the database is "never" updated

... means "you can restart the process which implements cache-on-startup, after the database is updated".

Another disadvantage is that the process will take longer to startup: because, you have to wait while it sucks all the data out of the database into its cache.

Christopher Wells
Thursday, May 20, 2004

Chris,

The database will be updated once a month at the most frequent interval.  Downtime can be scheduled to reload the updated cache if necessary.  Thanks to all!  Keep any other thoughts coming ;-)

GiorgioG
Thursday, May 20, 2004

> the process will take longer to startup

... which can affect availability, i.e. the Mean Time To Recover after a failure.

Christopher Wells
Thursday, May 20, 2004

One way around long startup is a lower-priority process populating the cache, so it ideally happens while the user isn't doing anything.

This might even help recovering from a network problem, at least for the data in the local caches. One can justify such a cache in terms of reliability, even forgetting performance. Should be written to persistent local storage.

Tayssir John Gabbour
Thursday, May 20, 2004

Not a DB expert but why not cache the database queries and have an experiation date on that cached data?

So something like this --> webserver --> middleware (cache + database abstraction) ---> database server

The middleware keeps a check on the cache and perhaps generates the static pages from often requested dynamic pages. The webserver and database are dump.

For an example look at how slashdot is implemented http://slashdot.org/faq/tech.shtml

blaZiT
Thursday, May 20, 2004

blaZiT,

I do something quite similar to that on some of my larger websites.  I have a caching layer which caches the result of any HTML generation I choose.  The cache can last forever or expire at a set interval.  Whenever something is changed a date is updated and that is compared with the cache creation date.  So, if the cache is old, it's ignored and overwritten. 

The cache for a particular item is cached when it's first displayed.  If I make template changes, I just delete the entire cache and it's regenerated automatically as visitors visit the pages.  Also the cache is parameterized -- so I can have different cached results depending on user queries.

The result is that most of the site is rendered from static pages most of the time BUT is still kept fresh and up to date from the database.  Our site would be completely unusable without it.

I recently added an option to make the cached pages contain scripts which can be executed before they are displayed.  This is used, for example, to display all the dates in the visitors timezone and do other neat tricks.

I also have query caching which caches the results of common queries over a set interval (every 5 minutes is often good enough).  The same rules apply, the cache is created on first use and ignored/overwritten when the interval expires.  I don't use it very often because whenever possible I perfer to cache HTML content.  However, it is used for internal processing.

All this caching wraps normal access methods.  I can disable all the caching (which I do for debugging purposes) and the site operates normally.  I avoided premature optimization -- I create the pages or queries as normal and if performance is a concern then I wrap them in the caching.

Almost Anonymous
Thursday, May 20, 2004

Yeah, first of all, your database isn't very big.  Assuming you have a decent machine, 50k rows should be a piece of cake to churn through.  I'm not sure what platform your using, but i'd recommend using a sort of "factory pattern".  The nice thing about the factory pattern is you don't *have* to cache right away.  The factory can just grab objects from the database as needed.  If later on you decide to implement caching, you can store objects  *as you retrieve them*in a hashtable, and everytime you request an object or group of objects from the factory, check and see if any of them exist in the hashtable.  If they do, just return the refrence in the hashtable.  Hope this helps.

vince
Thursday, May 20, 2004

>Yeah, first of all, your database isn't very big.  Assuming you have a decent machine, 50k rows should be a piece of cake to churn through.

Unless a search engine spider brings it to its knees by spawning a crapload of threads/requests ;-)

GiorgioG
Thursday, May 20, 2004

Do what Yahoo do and stick Squid infront of your dynamic site:

"For busy sites a reverse proxy like Squid can be used to boost performance dramatically. You can also use SquidGuard redirector to redirect different domain names to different apache instances or different machines altogether."

http://www.jimohalloran.com/archives/000377.html

Matthew Lock
Thursday, May 20, 2004

*  Recent Topics

*  Fog Creek Home