Fog Creek Software
Discussion Board




Optimizing a query

I have a moderately complex query in an SP. The main table currently has 100k rows and will grow to millions. This query is the central select for a busy web application, so I want it to be as efficient as possible.

The resultset will always be 15 records (one page in a grid), and the records of interest will generally be within a thousand (one user's recent records).

Currently I'm working with putting a subselect to "prefilter" the records:

SELECT Documents.DocumentID,
  Documents.docTimestamp,
  luDocumentType.DocumentType,
  Documents.DocumentNumber,
  CASE
      WHEN Documents.DocumentTypeID=2
                THEN Buyer.CompanyName
      WHEN Documents.DocumentTypeID=1
                THEN Supplier.CompanyName
  END AS OtherParty,
  AckStatusID,
  AckID,
  AckTimeStamp,
  Documents.ProcessedTS
FROM (SELECT * FROM Documents
          WHERE docTimestamp>@TargetDate) AS Documents
  INNER JOIN Companies AS Supplier
        ON Documents.SupplierID=Supplier.CompanyID
  INNER JOIN Companies AS Buyer
        ON Documents.BuyerID=Buyer.CompanyID
WHERE ((Documents.BuyerID=@CompanyID
    AND Documents.DocumentTypeID=1)
    OR (Documents.SupplierID=@CompanyID
    AND Documents.DocumentTypeID=2))
    AND Documents.docTimestamp > @TargetDate
    AND (Documents.DocumentTypeID=@DocTypeID
    OR @DocTypeID=-1)


However the use of the "prefilter" needs to be a little more complex, and I'm wondering if I might be better off creating a temp table and using that instead.

Any thoughts on pros/cons each way? Or is it a matter of try them both and profile them?

Philo

Philo
Thursday, September 11, 2003

I think the usual answer is "try them both and profile them" if you're worried about performance.  :)

You might ask on microsoft.public.sqlserver.programming.  Some sharp cookies there.

I'm no optimization expert, but if you go the temp table route I would at least experiment with breaking this select out into two separate subprocs, one for buyer and one for supplier.  You can then choose one or the other from your main stored procedure, and populate a temp table there using INSERT...EXEC calling the subproc.  Having separate query plans for buyer and supplier could make a difference.

Matt Conrad
Thursday, September 11, 2003

Problem is that buyer/supplier is based on the type of document, and a query can return both types.

So if the record is an invoice in the inbox, [OtherParty] is the Supplier. If it's a purchase order in the inbox, [OtherParty] is the Buyer.

Tricky construct, but it's served me well so far.

Philo

Philo
Thursday, September 11, 2003

Ah, I see.  I was reading that as always filtering on one @DocTypeID or the other.  My error.  If you are usually pulling down both types I wouldn't bother with playing with subprocs.

I've got to get back to work, but I'm curious about one thing.  When you're talking about your subselect, you mean that whole statement you posted, and not just the (select * from docs where ts > @targetdate), right?

Matt Conrad
Thursday, September 11, 2003

Depending on the distribution of the data, the subselect "WHERE docTimestamp>@TargetDate" may not be an optimal query, and the "OR (Documents.SupplierID=@CompanyID AND Documents.DocumentTypeID=2)" can also cause less than optimal performance (because of the "OR").

As already stated, you need to test any proposed queries with your actual data (preferably with closer to the anticipated volumne of data, if possible), but I would suggest you will probably get better performance by splitting the query into 2 separate queries UNION'ed together (and without the "prefilter" - let SQL Server do its own optimization/pre-selecting), such as:

SELECT Documents.DocumentID,
  Documents.docTimestamp,
  luDocumentType.DocumentType,
  Documents.DocumentNumber,
  Supplier.CompanyName AS OtherParty,
  AckStatusID,
  AckID,
  AckTimeStamp,
  Documents.ProcessedTS
FROM Documents
  INNER JOIN Companies AS Supplier
        ON Documents.SupplierID=Supplier.CompanyID
WHERE (Documents.BuyerID=@CompanyID
    AND Documents.DocumentTypeID=1)
    AND Documents.docTimestamp > @TargetDate
    AND (Documents.DocumentTypeID=@DocTypeID OR @DocTypeID=-1)
UNION ALL
(the same query for DocumentTypeID=2 and matching the BuyerID)

[I may not have the correct conditions in this re-written query, but it should give you the concepts.]

Philip Dickerson
Thursday, September 11, 2003

Hmmm - good call on the Union, I may try that.
Interesting thing about the subselect - I put it in in the first place because of a 50k record clustered index scan. Putting the subselect in made all index scans < 200 records. It also reduced the query from 2-3 seconds to a few milliseconds.

I've done some more research, and found that the general consensus is that derived tables (what I'm doing) is always faster than temp tables or table variables. So I guess I'll stick with it.

Philo

Philo
Thursday, September 11, 2003

It looks like you have a duplicate comparison on the timestamp.  It is both in the sub-query and the From statement.  Remove the outer one and you'll save some cycles.  Also, I suspect you are losing some time to the OR statement (although the optimizer may be smart enough to expand the statement).  I suggest timing it as it is presently and time it again without the OR statements.  You can remove the OR statements by creating a UNION and creating multiple queries. 

current query
----------------------------
SELECT Documents.DocumentID,
  Documents.docTimestamp,
  luDocumentType.DocumentType,
  Documents.DocumentNumber,
  CASE
      WHEN Documents.DocumentTypeID=2
                THEN Buyer.CompanyName
      WHEN Documents.DocumentTypeID=1
                THEN Supplier.CompanyName
  END AS OtherParty,
  AckStatusID,
  AckID,
  AckTimeStamp,
  Documents.ProcessedTS
FROM (SELECT * FROM Documents
          WHERE docTimestamp>@TargetDate) AS Documents
  INNER JOIN Companies AS Supplier
        ON Documents.SupplierID=Supplier.CompanyID
  INNER JOIN Companies AS Buyer
        ON Documents.BuyerID=Buyer.CompanyID
WHERE ((Documents.BuyerID=@CompanyID
    AND Documents.DocumentTypeID=1)
    OR (Documents.SupplierID=@CompanyID
    AND Documents.DocumentTypeID=2))
    AND Documents.docTimestamp > @TargetDate
    AND (Documents.DocumentTypeID=@DocTypeID
    OR @DocTypeID=-1)
----------------------------

I would split the two statements in the where (docTypeID) into two queries.  This has the added benefit of eliminating the Case statement in the select, which is a bit of processing. 

Lou
Thursday, September 11, 2003

Not that I mind reading through questions about TSQL, but the guys at www.sqlteam.com love to answer these types of questions.

Jason Watts
Thursday, September 11, 2003

I've been learning recently that no matter how complicated the query may be there seems to be a way to do it using derived tables (or user defined functions that return tables). Temp tables unless specially separated from normal data tend to fill up your transaction log pretty fast, plus in most raid systems reading from disk is always faster than writing to disk.

Justin K.
Thursday, September 11, 2003

A little tip: you may want to use UNION ALL instead of UNION if you know there will be no duplicate records.  I like

http://www.sql-server-performance.com/

for sql server performance tuning.

Will Portnoy
Thursday, September 11, 2003

Philo,

<quote>
However the use of the "prefilter" needs to be a little more complex, and I'm wondering if I might be better off creating a temp table and using that instead.
</quote>

Probably. But to be sure, add lots of data to the tables and true it for yourself.

Seeya

Matthew
Thursday, September 11, 2003

Looks like you have a star relationship around the document.

Anonymous
Thursday, September 11, 2003

Hi Philo,

Why not create a physical table called CompanyDocuments or something, which contains all the fields you need for your query, and use the scheduler to populate this on a periodic basis.

Essentially you'd just put the query you've already created (without the where clause) into a stored procedure, and use that to populate the table. Then create another procedure which just selects the top 15 or whatever, and includes your where clause.

This will be a lot faster than doing the subqueries and joins etc. inside your query.

If you need the results to be "up to the second" there are ways around this as well - you can create two tables, and have a function which returns the "current table name" to the population procedure each time it runs (it would also handle the toggling between the two). This way one table is being generated while the other is in use, and the cutover to the live table is seamless.

I realise this is denormalisation, which is supposed to be naughty, but so long as you keep your source tables normalised I don't see the problem.

Cheers,
Rhys

Rhys Weekley
Thursday, September 11, 2003

You could also create a view that consists of the tables that you're inner-joining on, then select from the view.  I've found this to be faster in certain situations.

Be mindful that when you alter columns in the underlying tables of the view you have to re-create the view because it uses column position instead of field name.

Wayne
Friday, September 12, 2003

Make sure and look at the query plan to see what the optimizer is actually transforming your query into.

Jason McCullough
Friday, September 12, 2003

Rhys- Actually that won't address the problem at all. The joins aren't the issue, the issue is that most of the time users are only looking at the top 5% of the data, but they also need to be able to see all of it.
In addition, keeping the company table synchonized is one more thing that can go wrong.

Wayne - views aren't faster, and can't be faster - in an SP the query optimizer looks at the entire request and makes its optmizations based on the entire request. With a view you're hiding part of the equation from the QO, thereby hobbling it. My guess is that those times it was faster it was because your original query wasn't the best, and splitting it made you clear up some issues.

Jason - I'm watching the plan like a hawk; that's what got me on the derived table in the first place (clustered query scan went from 85,000 records to 1,100)

I think I'm done with it - split the cases into a union query, using derived tables. The query takes 80 msec with anything I throw at it, so I'm happy with it for now.

Thanks for all the help!
Philo

Philo
Friday, September 12, 2003

Sorry, I did not mention that you can index the view.  Here's a good article on it: 

http://www.sqlteam.com/item.asp?ItemID=1015

According to the article: "With SQL Server 2000 (Enterprise Edition), Microsoft has introduced the concept of Indexed Views, which can make your applications and queries run faster in the right circumstances"

The beneficiaries of indexed views are any applications that join big tables that get selected against using the same keys all the time.  Your inner join between the Documents and Supplier tables may qualify this.

Basically, you create a view and then you add indexes to it.  Obviously the indexes take up space, but depending on the frequency of queries against this set of data it is worth the HDD space.

Wayne
Friday, September 12, 2003

*  Recent Topics

*  Fog Creek Home