Fog Creek Software
Discussion Board




Hardcore Data Mining?

I've started to delve into the world of Reporting on a grand scale, and I was wondering if anyone had some good advice or books on this topic. 
    I need to  reduce the seek time on my (very complex: self joins, outer joins on nulls, ect.) SQL queries.  I've started to do reports on 10 million plus rows, and on a p4 with a gig of ram, its taking about 20 minutes to return the data I want.  Am I pushing the limits of SQL?  Are there books or resources that can help me get the desired 20 second return i'm hoping for?  Any advice would be appreciated. 

Vincent Marquez
Monday, November 11, 2002

You should read all the forum topics before posting. :D

<a href="http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=17639&ixReplies=5">here</a>

Jason McCullough
Monday, November 11, 2002

1) Make sure you have the right indexes.
2) Then make sure your queries are the most efficient that they can be.
3) If one query takes too long consider using temporary tables etc
4) Don't sweat the small stuff, concentrate on the biggies.

Alberto
Monday, November 11, 2002

Oh, and use any query analysers you can lay your hands on to work out the bottlenecks. I've been using SQL Server/vb which provides some excellant tools for this sort of stuff. I'm sure some DBA types could point you in the right direction if you tell us what technology you are using.

Alberto
Monday, November 11, 2002

I've indexed everything ok, and i'm pretty sure tthat my queries are solid.( Hopefully I can find some way to optimize them though).  I've looked at following the "execution path" that Sql Query analyzer shows, but i'm not quite sure how to take that and use it to improve my queries.  Most of the books and docs i've seen seem very basic, or more related to drag and drop query builder type stuff. 

If anyone has some tips, good books on this type of thing, i'm all ears.  :-) 

Vincent Marquez
Monday, November 11, 2002

"The Guru's Guide to SQL Server" and "Transact SQL Programming" books are both very good.

For the Query Analyzer, look at the parts of the execution path that it spends the largest percentage of it's time in. Then look up what those terms mean, and if there is a way to make them go faster. With a little reading it will be easy to see where you need a more selective query or an index or to remove a function or to break the query up into seperate queries that use temporary tables.

It's hard to say how much faster your 20 minute queries could be. It really depends on your hardware, what your schema looks like and what you are trying to do with it. However, in my experience, 20 minutes is a very long time for even extensive munging on 10m rows. On databases of a similiar size, I've always been able to tune queries down to less than a few minutes (at which point I stop).

Also, performance gains from tuning are quite jumpy. Sometimes you'll cut the run time to a tiny fraction of what it was by splitting one honker query with lots of joins into several smaller ones with temporary tables. Other times you can spend hours making dozens of small improvements and only yield a 20 or 30% improvement.

Matt Sponer
Monday, November 11, 2002

More ram, a drive array and don't touch a cursor with a ten foot pole.

You don't specify if this is also being used as a transactional processing db, so I'll assume not.  Going to 2 gigs of memory might help, especially if you are going to be rerunning these queries.  In that case turn your queries into stored procs.  After they run, a lot of your database will be in ram and subsequent query times will imporve.  There is a tool called Coefficient http://www.idisoft.com/ that can help you see if stored procedures are recompiling or not. 

Another option to look at all the performance of SQL Server is Spotlight from http://www.quest.com/spotlight-portal/  It will tell you how hard ram, disks etc are being hit.  Both Coefficent and Spotlight have trial versions.

Probably a bigger bottleneck is the hard drive, I assume you only have one.  Turn on perfmon, and make sure you have the drive performance monitors enabled.  (diskperf /y) at the dos prompt.  Reboot required.

If you have drive I/O issues get an array of drives, better yet more than 1 array.  Locate the huge tables on separate arrays.  I'm talking separate controllers for each array as well.  Put the log file on a separate drive on a separate controller.

If you can handle dirty reads, use nolock on your queries.  Very useful if there are transactions going on.  (Which I assumed earlier there weren't)

These are costly options, but speed WILL improve.

ryan ware
Monday, November 11, 2002

"Hardcore data mining"

You want to get data on who is appearing on, or visiting, which porn site, exactly?


Tuesday, November 12, 2002

*  Recent Topics

*  Fog Creek Home