Fog Creek Software
Discussion Board




How do web sites go so fast?

I'm a fairly senior level guy who builds web-based ecommerce applications for a very large company.  However, the sites I build are not as fast as yahoo or google and I'm wondering how they do it.  I'm also persuing a master's degree so I thought this would be a good idea for a paper.

For example, I've built a reporting tool that fills a function similar to yahoo.  It's similar in look and feel but it's different in that we don't have as much common data to share.  Our apps will show which products have shipped *for you*, which invoices are coming due *for you*, etc.  Yahoo has a slight advantage in that they can show non-personal stuff like a stock quote.  (Of couse which stocks to display is still personal).  My tool takes 5-10 seconds to do it's thing.  Mostly because of the db queries.  Yahoo's seems almost instant.

Another example is google.  If my customer's asked me to pick out 100 records from a billion I'd say the tool would time out.  Google does it in 0.015 seconds.

Does anybody know what the overall software architecture is like for some of these amazing sites?  Can you provide some urls?  I'd like to compare them and write a paper for school.

Feeling a bit shy
Thursday, June 20, 2002


Look into Beowulf clustering and Occum's Razor.

Throw in one of my favorite Rand Quotes: "Contradicitions don't exist.  If you think you have a contradiction, check your assumptions ..."

In other words, check you architecture and your toolset. I'm willing to bet Google goes "to the metal" (Flat File Databases in Assembler and C/C++) to get things fast.

Otherwise, let me think on it ...

Matt H.
Thursday, June 20, 2002

Any decent DBMS (SQL Server, DB2, ORACLE) that has been tuned for your simple query should give you the result instantly. I'd say there's something not right with your set-up.
Google etc. would not use such standard DBMS tech., since they do not need all the functionality (e.g. transactions), a dedicated system can be optimised way beyond a general one and it would probably eat to much into their margin anyway (just guessing, but if you need a 1.000 processor Oracle licence,when needing only 3% of the functionality, it just might become cheaper to roll your own).

Just me (Sir to you)
Thursday, June 20, 2002

I doubt you have 10k+ linux servers clustered so comparing your application to Google might be a bit unfair.

Google's Chief Operations Engineer on Their setup:
http://www.technetcast.com/tnc_play_stream.html?stream_id=420

I know a good deal of large (and fast) websites use caching and load balancing to offset the time it takes to search.

Ian Stallings
Thursday, June 20, 2002


It sounds like you're making too many DB queries ... any request should be able to be worded as one big long SELECT statement that you can plug into a stored procedure.  In other words, it should be blazingly fast.

Contradictions don't exist.  Check your architecture!

Access doesn't scale, even for a single user, past a certain number of records.  I'd call it the "time when INSERT gets ridiculous."


regards,

Matt H.
Thursday, June 20, 2002

Here's an interesting article on some of the difficulties running a large e-commerce site and the strategies used to make it run quickly and effectively. Don't be fooled by the Perl-centric nature, these are general issues and while implementations differ the motivations are the same. (Not that being Perl-centric is a bad thing :-)

http://www.perl.com/pub/a/2001/10/17/etoys.html

Chris Winters
Thursday, June 20, 2002

Basically, they do two things: first, optimize the hell out of their queries to get maximum performance on one machine, and second, throw more iron at the problem until it goes away.

Undoubtably Google and Yahoo have developed their own custom database software, as the commercial ones aren't optimized for text search, which is a bit of a black art anyways.  Reinventing this wheel is probably not the recommended path though for the other 99% of applications, though.

In my experience, a poorly written SQL query can be 100 or more times slower than an optimized one.  I don't see any earthly reason why a system you've described needs to take 5-10 seconds for a *single* query.  Where's all the time going?  I assume you've done all the standard things like look at the execution plan for your queries, added indexes on the fields that need them, reordered joins, considered denormalizing your database, cached intermediate results, and so on?

Alyosha`
Thursday, June 20, 2002

You're actually asking a big question, and it's not just a matter of any single issue.  It's also not restricted to web sites.

To take a specific example, searching for 100 items in 1B records would take even a low-end relational DB server much less than one second, as long as you set things up correctly.  Making thousands of these queries a second is simply a matter of parallel servers, each with the same data, or a tree of servers, each with a slice of the data.  Google uses thousands of processors to do this.

If your "tool" times out under these circumstances, you have a serious architectural issue.  If you've written your own storage/searching system, then look into basic searching/sorting algorithms, as you have a serious problem there.  If you're using an off-the-shelf DB server, then you're using it incorrectly.

James Montebello
Thursday, June 20, 2002

As many have mentioned, optimizing your queries is an important step. That includes putting DB indexes in the right places, writing well-structured queries, data caching, and making sure that you're not running too many queries on a page.

Optimizing DB connections is important too. If you're not using connection pools, or if your connection pooling doesn't scale well, you'll take a big performance hit on individual page views because creating new connections is expensive.

Beth Linker
Thursday, June 20, 2002

"I'm a fairly senior level guy who builds web-based ecommerce applications for a very large company."

"I'd like to compare them and write a paper for school."

So, the senior level guy at a very large company is more concerned about writing a paper for school then fixing his broken applications so it doesn't take 15 seconds?  If your doing e-commerce applications and it takes 15 seconds, you should apologize to your employers and pick a new major for school.  I've written page long queries that had to do complex joins on million+ rows, and its never taken 15 seconds.  Maybe it did the first couple times, but then I thought about it and fixed it in a couple minutes.  People like you are who give computer science grads/students a bad name.

Vincent Marquez
Thursday, June 20, 2002


(There is no need to be harsh)

I used to have a very similar problem. I was gathering statistics for a fairly small amount of data -1.8 million records- but these stats where not friendly. Lots and lots of lookups, grouping, counts, distinctions, etc., and at last, the most we were able to optimize didn't get below 20 secs. Add the fact that we received about 100,000 visitors during the day and it was simply awful.

What I did is more practical than scientific. I set up some process to gather this data, and pregenerated the reports. Once in a minute was enough, because the data was pretty static. At the end, there was a complete system running on the background, that did the same for a lot of the consolidated reports.

Leonardo Herrera
Thursday, June 20, 2002

If you are in a fairly big company, find your best DBA (ask around), go cry on his/her shoulder and let them help you optomize and in the process teach you a few lessons in the process.

tk
Thursday, June 20, 2002

Most of the answers to your query *presume* that the slow part of the transaction is the database lookup. That is indeed the most likely source of the problem, but you really need to prove that one way or the other *before* starting to tune your application.

It would be a realy shame to tune the DB and then discover that the delays were in the network stack ...

Matthew Webber
Thursday, June 20, 2002

What tools are you using to index your data?

As mentioned, any database that allows keyword indexing will run fast. I mean, you can try and "scan" some text documents. Virtually all of the decent commercial database engines today have some kind of "text" indexing routines. That means that each word in the text is indexed just like a individual field in a table. This indexing takes a good deal of processing. Once the indexing is done then performance is just fine.

I can’t imagine what you are doing wrong, but even a cheap desktop database can perform a search on indexed data in thousands of a second. If the data is not indexed, well then of course it is slow.

If things are indexed...then it is as we say a walk in the park.

You don’t search text data without a index on that data ...do you?


Albert D. kallal
Edmonton, Alberta Canada
kallal@msn.com

Albert D. Kallal
Thursday, June 20, 2002

> In other words, check you architecture and your toolset.
> I'm willing to bet Google goes "to the metal" (Flat File
> Databases in Assembler and C/C++) to get things fast.

> Otherwise, let me think on it ...

Actually, last I knew, Google was largely written in Python (not your "to the metal" kind of tool). They structure their database across several (hundreds of!) machines and perform parallel queries across all the cheap boxes. There's no magical formula for making something go fast. I'm not sure if it even warrants a a paper. It's just a matter of properly identifying the "hot spot" in the application and devising ways around it, via either hardware or software.

-james

James Wann
Thursday, June 20, 2002

I appreciate everybody who wrote in (with perhaps the exception of Vincent :-) ).

I understand that queries can be tuned but often times it's the schema that needs tuning.  Like Leonardo, I've built denormalized tables optimized for reporting and updated them every few minutes.  The trouble with that solution is that you introduce the possibility that other tools that hit the standard tables report different results.  (Huge companies have different tools that do the same thing.) 

Also, my company assures our customers the data is "up to the second".  Extra tables can violate this goal.  Changing the standard tables is typically beyond the scope of my powers.  That's another thing about giant companies.  You don't have "jack of all trades" authority.  You read the tables and you can't just create indexes as you need them.  As a matter of fact, it can be easier to get a bill through congress than it is to get a schema change in place.

I suppose I already knew the answers to the questions I asked.  Queries are slow if the schema isn't built for reporting.  Change the schema or deal with the slow query. 

As for our hardware, we run oracle databases on 2 parallel 64 CPU Sun E10Ks.  We're a huge company with huge traffic.  Anybody who made the assumption that I'm an idiot using MS access or that you're a genius because your cheesy blog site is running well should think again.

On a final note, we have proven that it's the queries that are slow.  The timing code is in there.

Final note: As I reread this I worry that I took too much of a "tone".  I didn't mean it.  I was just trying to convey the constraints that I work with.

Feeling a bit shy
Thursday, June 20, 2002

Two final notes?  I need to proofread.

Feeling a bit shy
Thursday, June 20, 2002

You know, if your company isn't going to give you the authority to design a fast system, they should start promising the customers "up to 15 seconds ago" data.  How do web sites go so fast?  The companies that build them don't have a process in place that actively stymies improvement.  *There's* the answer you were looking for.

Alyosha`
Friday, June 21, 2002

ok, i'll be nicer this time. 

1.  You keep saying queries, which gets me to wonder if you mean you actually took one of the queries and tested it directly, or if your just assuming that its the queries because your program slows down when it tries to acess data.  One of the biggest mistakes I see smart programmers making, is doing multiple queries and then programaticly dealing with the data.  Almost anything you can do in your code, you can do in your SQL, and it will be much faster. 

2.  If it does happen to be the actual query (or queries), you might be doing something wrong.  A lot of times big joins can return geometric results and then "where clause" those out, so you wouldn't really know that it happened. 

3.  Lets say your queries and program is good, and its  the fault of the dumbass dba's architecture that your forced to use.  Write a stored procedure!  I don't think it matters how screwed up the structure is, stored procedures will solve your problems.  I'm guessing oracle allows you to write in in C, C++, maybe even java.

Vincent Marquez
Friday, June 21, 2002

The only person i see doing work on db refactoring is scott ambler and i think there is a lot more inertia and friction in refactoring db but during dev the db and code should be refactored equally often then the 'optimal' db from dev should be compared to prod db and mappings/porting done

Karel
Friday, June 21, 2002

You have probably covered all this but when you say: "we have proven that it's the queries that are slow" do you mean that a single query running on an otherwise idle system is slow, or do you mean that when you query in the production environment it gives you a slow query return.
In the former case, you realy will need to consider a seriously different approach, but in the latter, if it's your queries that are overloading the box, you might want to try trotteling superset queries at regular intervals to build on the fly report temp tables and have your real result go against these. It might reduce your overall latency form the 15 sec. you have now. Of course this would only apply in very specific cucumstances and in most cases might make things actually worse, but who knows :-)

To be honest, if as you say you are fairly senior and reporting a significant problem with a system that is obviously core business and it is just ignored, maybe it is time to start looking for new horizons.

Also,from the look of your setup it seems to me that the desing has bought into "scale up". Never works, since there's always going to be problems when you hit the biggest box ceiling. "scale out" design forces you to deal with the real scalability issues at an early stage, but maybe that should be another tread :-)

Just me (Sir to you)
Friday, June 21, 2002

I saw a paper a couple of years ago about google's original technical architecture. Two key points I remembered were - (a) scaleale across zillions of machines (b) not using a  relational database - why would they? The things RDBMS's are good for aren't relevant to what they're doing.

On the other hand, a properly-configured Oracle db for a query-intensive application should be serving somewhere in the high 90-percents of hits from memory-cached data, using memory-cached precompiled queries, so response time for simple selects shouldn't be a big issue unless your database is really humungous like Google's is. Are you sure your database is your bottleneck?  You should think about converting your most frequently used queries into stored procedures, otherwise parsing the SQL could be eating up most of your response time. And look carefully at your cache hit rates.

If I can find the paper again I'll post a link to it.

Alan Little
Friday, June 21, 2002

ok, this is how they originally did it

http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

and, a couple of years on, this is the 21st-century architecture they have migrated to

http://www.google.com/technology/pigeonrank.html

Alan Little
Friday, June 21, 2002

My past experience is that performance is in a bunch of little things.  We did an application where we were doing huge DB queries and reporting the results over SSL. Site was originally slower than molasses.  So we started analyzing everything.

Find a monitoring tool that will analyze times needed to return parts of your page.  We found that we weren't managing our images right over SSL so they weren't caching. One image always timed out, and IE and NN delayed drawing the page until the time out was done.  That was a big improvement.

Another one was one of those Oooops! moments.  One query was taking forever.  We were getting pounded by 1000s of similar queries due to a marketing push.  Our 4 machines were each doing 90%+ CPU utilization.  Did a little digging. No index on one crucial table.  We went from 90%+ to about 6%.

Start chipping away and test, test, test.

Bob

Bob Crosley
Friday, June 21, 2002

Ok, since a handful of you seem to be interested in the slow system I talked about I'll tell you more about it.

As I mentioned in the first post, it's a yahoo style portal.  The business problem that is solves is that it makes data easier to find for our customers.  Over the years, my company has developed at least 20 tools to help people find what's going on with their orders, invoices, account balance, shipment status, etc.

So you can imagine a little portlet that shows invoices and when they are due.  Another will show order status (has it been built, shipped, etc).  Another will show standard configurations that you like to use when you order.  (You get the idea.)

The first version (before I entered the project) took 240 seconds to load the page.  Optimizing the queries took that down to around 30 seconds.  I did some of the optimizing myself.  For things that needed more advanced PL/SQL expertise or for things that I just didn't have time for I pulled in an Oracle expert to help me.

The other big win was making the applicaiton multi-threaded.  The first version (again, I didn't write it) would run the queries sequentially.  However, most of the data doesn't depend on earlier queries so now the bulk of the data gathering now happens in parallel.  That brought the overall response time of the tool from the sum off all the queries (plus page assembly < 1 sec) to the "weakest link" query (plus page assembly).

Now our average users get a response time of around 7-8 seconds.  Our larger (and more important) customers get a response time of around 15 seconds.  I've moved on to my next project (where I work this speed and scalability improvement was considered a big win.  They moved me to a more important project) but they are still looking into the "weakest link" portlets in a gradual effort to keep getting the tool faster.

That experience had me wondering how somebody like Yahoo does it so much better.  As I mentioned, they have an advantage is that the data isn't quite as personal.  We all get AP News right?  However, the page layout, stock quote choices, etc. must all be stored for just me.

Maybe the answer is as simple as, "They optimized their data for queries while I'm stuck hitting an Oracle 11i schema directly."  Still, it seemed like a good idea to do a little research.  I could finish a homework assignment and bring my lessons learned back to work.

Thanks again for the great links on Google and eToys.  The only other massively traffic site I've found that explains how they do it is Microsoft.  I'd prefer not to use them because my company doesn't use msft products (except the desktop) and their "how we do it" stuff reads like a .net advertisment. 

Does anybody have a third huge site that explains what their architecture is like?

Thanks in advance.

Feeling a bit shy
Friday, June 21, 2002

Not exactly right on target, but well ...

http://sourceforge.net/docman/?group_id=1

http://www.informationweek.com/story/IWK20011207S0003

http://www.fastsearch.com/


http://www.intel.com/eBusiness/casestudies/success_stories.htm

Just me (Sir to you)
Friday, June 21, 2002

http://www.paulgraham.com/ is an
interesting website mostly about Lisp, but also
has some content on writing powerful web applications.

I found it particularly interesting to read the article
about ViaWeb at http://www.paulgraham.com/lib/paulgraham/sec.txt,

and a mention, without much detail of what Yahoo
do at http://www.paulgraham.com/faq.html in
the question "What database did you use in Viaweb?"
where the answer is:

"While we were doing Viaweb, we took a good deal of heat from pseudo-technical people like VCs and industry analysts for not using a database-- and for using cheap Intel boxes running FreeBSD as servers. But when we were getting bought by Yahoo, we found that they also just stored everything in files-- and all their servers were also cheap Intel boxes running FreeBSD.
"

Which does seem to answer at least some of the
questions about how Yahoo do things.

(if you're stuck with oracle, go mad on query optimisation,
investigate creating denormalised tables optimised
for queries, populated by triggers on the fully
normalised schema, or use materialised views. EXPLAIN PLAN is your friend, you
never want to see full table scans, almost everything
should be getting satisfied from indexes.)

Michael Stevens
Friday, June 21, 2002

"you never want to see full table scans, almost everything
should be getting satisfied from indexes"

Wrong.  This sounds like the advice from a DBA that knows nothing but witch doctor magic.

*Some* full table scans will locate the appropriate record much faster than through an index.  Of course this depends on the size of the table - it does get worse with size.  Small tables usually don't benefit, and can actually degrade performance with indexes.

If you really want speed, then you have to do away with the DBMS.  After all, it's just another program between your real process and your data.

In one situation, I've seen a relatively simple and fast (less than an hour... this was batch by the way) sequential merge based and full function business process turned into a RDBMs based partial function process that ran for 18 hours.  This was after "query optimization".  The first attempt ran for 80+ hours.

But the amazing thing is that the contractor company actually got paid for this kludge.

If you really need speed, leave the folk wisdom at home.

Joe AA.
Saturday, June 22, 2002

FYI, another time that indexes can hurt performance would be if you need to look at or return a signifigant portion of the records in the table.

A rule of thumb is that if you're returning more than 1/2 of the table then it's better not to use the index.  Of course, your milage may vary.

feeling a bit shy
Saturday, June 22, 2002

I think this is a bit nitpicky.  Yes, of course indexes don't give you much an advantage on (a) small tables and (b) when extracting a substantial fraction of records from the DB.  Doesn't everyone know that?  But in the usual case -- the selection of a few records from a very large dataset --- they give an enormous advantage.

The real point is to avoid the usual optimization stupidities: first, don't optimize a purely imaginary problem that doesn't even exist to begin with, and then second, don't forget to measure your optimizations to see if they actually improved performance.

Alyosha`
Saturday, June 22, 2002

Friends,

I've been googling for hours but I have found anything as nice as this: http://www.perl.com/pub/a/2001/10/17/etoys.html .

The Google MP3s are great and I'm going to feature them in my paper.  However, I really wish I had a third company to use.

Does anybody know of an article along the same lines as the etoys link above?

feeling a bit shy
Saturday, June 22, 2002

Now that we have a few more details, then all that speculation about what is being done certainly is reduced (note, I said a *few* more details).

You still don't mention if these searches are just on particular fields, or if any type of "text" data is being searched (however, it seems quite clear you have a good grasp on the use of indexes etc.). You also did not mention a typical number of records being searched. Also, the typical number of conditions is important. Further, are those conditions on child tables?

One thing that sql is poor at is doing conditions on child tables. In fact, it really is bad!

For example, if we have a car lot. We want to find out how many salesman sold a blue and red car.  For simplicity we would have a salesman table, and a cars sold table.

The query to return the above would be:

select SalesRep, Make, Model, Color from tblSalesReps
inner join tblCarsSold on salesrepID = salesrepID
where color = "red" or color = "blue"

Opps, the above is wrong! We had to use some type of sub query!

The above of course is wrong, since the above returns any sales rep who sold a red OR a blue car. I am looking for a sales rep who sold both a red AND a blue car.

select SalesRep, Make, Model, Color from tblSalesReps
inner join tblCarsSold on salesrepID = salesrepID
where color = "red"
and salesrepID in
(select salesrepId from tblCarsSold where color = "blue")

The above now is correct. Man, lets hope we don’t have 3 conditions, as this can get real messy!

However, notice our sub query. There is big mistake from a performance point of view?...can you spot it? Our sub query returns every single sales man that sold a blue car, and THEN checks for a match of our salesman id. To speed up the above, we have to restrict the sub query to *just* the current sales man. Hence we should written:

select SalesRep, Make, Model, Color from tblSalesReps
inner join tblCarsSold on salesrepID = salesrepID
where color = "red"
and salesrepID in
(select salesrepId from tblCarsSold where (color = "blue") and (salesrepid = tblSalesReps.salesrepId) )

Of course if we need 3 conditions....then things start to get totally out of hand (that is one area were post relational (multi-valued) databases are *way* simpler.

Thus, there still might be a lot of things that can be done for those conditions, and lot more speed to be had. As you mentioned, you have made huge gains in performance already. It seems to me that if the tables are not that large, and the conditions can be indexed...it should take but a blink of an eye. On the other hand...perhaps you files just very large, or you have a lot of conditions. Have all the queries been optimized as best as possible?


Albert D. Kallal
Edmonton, Alberta Canada
Kallal@msn.com

Albert D. Kallal
Sunday, June 23, 2002

Some things I have learned the hard way:

* Oracle is fast at getting a record from a table based on a indexed key.
--- index foreign keys, especially if they are in constraints

* It is slow at aggregate functions (GROUP BY)
--- can often be faster to do it outside in the DB in procedural code

* It is slow at nested joins (WHERE ... IN (SELECT ... FROM))
--- un-nest these to a flat join

* It is incredibly slow at anything to do with text manipulations (SUBSTR, concatenation etc).
--- just don't do it. Do it in procedural code outside your DB

These can trip up people (like myself) who have upsized from the desktop, where things like string manipulation are standard programming techniques.

Lots of luck

Les Chandra
Monday, June 24, 2002

"Maybe the answer is as simple as, "They optimized their data for queries while I'm stuck hitting an Oracle 11i schema directly."

You probably already know this but,

Check to see that your SQL-queries do not use function
calls in its WHERE-clauses, which is a very efficient way of
getting full table scans. Example:

Consider a table that has an index like this:

CREATE INDEX FOO_X1 ON FOO(SOMETHING);

SELECT SOMETHING FROM FOO F, BAR B WHERE
F.ID=B.ID AND
SUBSTR(F.SOMETHING,1,10)=B.SOMETHING_ELSE;

The SUBSTR will disable index use, and result in a
full-table-scan.

You might want to check into the possibility of using materialized views. If you have complicated joins this
will allow for a more simple query once the materialized
view is created. I dont know how often the data changes,
and it may not be a good solution, since refreshing the
materialized views can be costly. 

http://www.oracle.com/oramag/oracle/99-Sep/index.html?59bob.html
                             
Hope it helps.

Patrik
Tuesday, June 25, 2002

Vaguely related to this:

If you're doing a lot of "summarizing," you might want to look into something like Redbrick.

http://www-3.ibm.com/software/data/informix/redbrick/

I used it back before IBM bought it, and it's disturbing exactly how fast it can create reports based on aggregate data; we're talking an order of magnitude or faster than any b-tree system I could come up with.  They use some sort of bitmap indexing scheme instead of b-trees; the whole thing is kind of the analysis compliment to a transaction system.

I believe SQL Server (I can hear the laughter now) and Oracle do have something like this now, but it's not quite as fast.

Jason McCullough
Wednesday, June 26, 2002

*  Recent Topics

*  Fog Creek Home