Fog Creek Software
Discussion Board




Welcome! and rules

Joel on Software

SQL IN statement with > 255 members

SELECT id FROM m
WHERE id IN (1,2,3,...250)
OR id IN (251,252,...500) OR
OR id IN (501,502,...750) OR
OR id IN (751,752,...1000)
....

to create and run a query like the one above is a standard "Java" way to run these giant IN SQL statements.  Each IN () must be < 255 members (I used 250), and there should be < 255 total INs.  It runs fine with SQL PLUS and JDBC.

When I run this in .NET C#, I'm having terrible time with this -- 200 members work in 2 seconds, 500 in 10 seconds, and then 1000 chokes the server completely!

Can someone offer a pointer as to which FM To Read?  I tried Google, but found little of value.

Here is the code that runs the giant query above:

            using(DataAccessor objDataAccessor = new DataAccessor())
            {
                objDataAccessor.DBCommandType = DataAccessor.SQL_COMMAND_TYPE;
                objDataAccessor.Execute(finalSqlQuery, ValidationList.LOAN_VALIDATION_TABLE, listObj);
            }


public void Execute(string pi_strQueryName, string pi_strTableName, DataSet pio_objDataSet)
        {    
                    
            OracleDataAdapter objOracleDataAdapter = CreateAdapter(pi_strQueryName, null);
            objOracleDataAdapter.Fill(pio_objDataSet, pi_strTableName);            
            if (l_objDbTransaction == null)
            {
                this.Close();
            }
        }

Thanks in advance,

Greg

Greg
Tuesday, January 11, 2005

Can you explain why you have to have all these IN clauses in the first place?  It seems pretty hacky.  There's no other predicate clause to get the records you want?

?
Tuesday, January 11, 2005

Basically, I'm validating a huge list of user-entered keys/ids from an Excel spreadsheet.  Limit is 15,000 keys.

So, I'm basically doing

"select id from M where id in (user entered ids)"

And then finding the ids that "fall out" using a simple hashtable.

I hope that answers the question.

Greg

Greg
Tuesday, January 11, 2005

I'm not sure how to answer your original question but you might get better performance if you dump the ID's from Excel into a temp table and do:

select ID from M where ID in (select ID from mytemp)

I'm no query optimization expert but it seems like that'd be a little more efficient.

?
Wednesday, January 12, 2005

Some SQL Engines don't optimize this well.

SELECT <whatever>
FROM <wherever>
WHERE <something> IN (1,2,3, ... , 255)

Will get translated, when compiled, to:

SELECT <whatever>
FROM <wherever>
WHERE <something> = 1 OR
<something> = 2 OR
<something> = 3 OR
<something> = 4 OR
<something> = 5 OR
...
OR <something> = 253
OR <something> = 254
OR <something> = 255


It basically busts it out into 255 individual OR operations.

The data access engine will then have to do the index seek (if indexed) or table scan (if not indexed) once for each of the ORs. You'll see performance go down. The run time of such a query in SQL engines that do this will be proportional to the number of elements in the IN () list. The more elements in the list, the longer it will run.

In systems that do this, I find it's usually much quicker (by orders of magnitude) to create a temporary table to hold the IN() list and join to that table, rather than use an IN () list.

I'd try putting your spreadsheet in a table first, and then going the JOIN route. Look at OUTER JOINs if you want the ones that "fall out".

Sgt. Sausage
Wednesday, January 12, 2005

If I read your initial comment correctly, it works fine and all that you've changed is the front end. So we don't need to look at the SQL itself. I seem to recall (without the time to do any research right now) that there are several different data providers available for Oracle - at least one each from Microsoft and Oracle, as well as at least one third party in the business. I'd look at whether switching data providers gives you any better results.

Mike Gunderloy
Wednesday, January 12, 2005

Thanks everyone for all the responses. Mike is correct -- the SQL is fine. 

I must say I'm a bit surprised by the comments offered.  Perphaps I did not explain the situation well. If I tried to do 15,000 inserts and then do a left join that will take a lot longer, without providing any real benefit.  The only "proportionally longer" part *should* be the SQL data retrieval, since I already have an index on my id column and like I said before, SQL Plus version of this query works very well in under 300 milisec.

A less elegant solution I have come up since yesterday was to do a loop:

select... where id IN (1...250)

get the result, then do the next 250 ids in a series:

select... where id IN (251...500)

etc.

If I do this, 15000 ids come back in less then 5 seconds TOTAL.


If I do this:

SELECT id FROM m
WHERE id IN (1,2,3,...250)
OR id IN (251,252,...500) OR
OR id IN (501,502,...750) OR
OR id IN (751,752,...1000)

for just 1000 ids, it chokes.  Apparently something other then straight OR chain:

SELECT <whatever>
FROM <wherever>
WHERE <something> = 1 OR
<something> = 2 OR
<something> = 3 OR
<something> = 4 OR
<something> = 5 OR
...
OR <something> = 253
OR <something> = 254
OR <something> = 255

is going on here.

The loop solution above (select 250 at a time) seems to go against conventional wisdom of having some overhead for every SQL statement that is applied -- thus my desire to implement the (admitedly much simpler and more ellegant) original solution of getting everythig in one shot.  Like I said the original works very well in JDBC, SQLPLus, but chokes .NET.

Mike, I'm afraid I can't mess with drivers either -- I work for a large bank and they would need something on the order of 6 months of testing to switch drivers. 

Maybe there is some re-sizing problem of the DataSet object I'm missing here?  Does anyone have experience simply retrieving large data sets from Oracle on .NET?  This is somewhat stupifying since I have little knowledge of what's under the hood in .NET.  I have a feeling it is the driver problem, so I guess I'm as good as stuck with a loop solution...

Thanks again for all the responses,

Greg

Greg
Wednesday, January 12, 2005

Have you tried a different way of running through the resultset? A DataSet is nice, but it can be slow for large data. On the other hand, a forward only datareader and grab the returned id's into an integer array might work quicker.

this site gives some useful info http://www.sitepoint.com/article/dataset-datareader

regards,
S

SK
Wednesday, January 12, 2005

I would definately go with inserting these values to a temporary table (a true Oracle temporary table) and joining to it in the query.

IN lists are limited to 1000 values in Oracle I believe, and this method would allow you your full 15,000 values.

The methodology would be ...

i) bulk insert the 15,000 (or however many) values into a global temporary table.

ii) Call dbms_stats.set_table_stats() to tell the optimizer how many rows you just inserted

iii) Execute the query as a join.

Faster, scalable, probably easier to code.

David Aldridge
Tuesday, January 18, 2005

*  Recent Topics

*  Fog Creek Home