Fog Creek Software
Discussion Board




REfactoring large procedure

I have just been tasked with optimizing a stored proc in SQL Server 2000.  It's fairly big (for me), with about 1000 lines, using 2 concurrent cursors. 

How do I optimize this?  I'm not looking for SQL help, I'm looking for refactoring help.  Converting huge if/else -> switch/case is easy enough, but what about redesigning to eliminate the cursors?  How do I keep track of everything that's going on?  Do I write inputs on a white board?  print out the code, and lay it out on the floor to look at it?  What methods do you guys use to refactor something like this?

Also, I would like to make sure I don't change functionality (of course).  Unit tests would be good, but how do I utilize these in a proc? 

Any ideas will be much appreciated.

I'm currently running the proc w/ test data & tracing it in Profiler.  Then at least I can see where the problems are.

nathan
Tuesday, November 25, 2003

To start with you need to have a representative set of data on which to perform the procedure.  Try to create a data set for each potential case the procedure is currently expected to handle and a few that it doesn't handle (if any).

Now run the procedure on that data set and record the output.  That's your baseline.  All the changes you make should never alter that baseline return (aside from any currently unhandled cases).

I would suggest looking carefully at your code and seeing if it naturally breaks anywhere.  That's a place to start modularizing the code.  But it may be too early for that based on the way you describe the code.

The biggest thing you'll ever hear in refactoring is likely to be "never make multiple changes at the same time".  It is too tempting to be halfway into one change and start making another.  Before you know it you'll have broken the entire thing.  Start by cleaning up all your IFs and any poorly handled exceptions.  Then clarify with comments anywhere that is likely to give you trouble later or processes things strangely or for special cases.

Make sure you use clear variable names.  Try to work on a coherent chunk of code at any moment.  Remember you aren't trying to replace code full out, you're trying to remove the cruft and edit the code so it is clearer and more concise.  If that means replacing two cursors with one giant cursor with a correlated subquery, so be it, but that's a huge job which requires lots of testing.

Test often.  More than you think is necessary.

Lou
Tuesday, November 25, 2003

As for eliminating cursors, I can only suggest the obvious: use temp tables, whenever possible.

As for keeping track, I recommend some kind of source code control system, if you're not already using one, so you can restore a previous (working) version and include comments on what changes you made.

As for inputs and unit tests, you may want to have another table for the inputs and use whatever programming language you're using now to create the unit tests (this might be easier than trying to use SQL and whatever proprietary extensions your db vendor provides).

Full name
Tuesday, November 25, 2003

Along with the advice to use temp tables, also try and avoid the row-at-a-time (RAT) mentality (procedural) and try and think set-based.  Oftentimes I've been able to re-write RAT queries to set-based with huge gains in performance *and* readability.

We’ve been so conditioned to write code in a procedural fashion (with loops and such) that it carries over into the relational model’s set-theoretic fashion which disastrous results.  Personally, I’ve been working with DBMS products almost as long as I have been programming, so I have no problem making the conceptual switch.  Some I work with, however, still don’t ‘get’ set-based thinking even after several years.

MR
Tuesday, November 25, 2003

When working on a large lump of code, I usually make a TODO list. That's not very original, but it works. When you work on a large scale change, you almost always have ideas popping into your mind as you go and keeping your grip at that particular change you are trying to make can be quite difficult. That's what I use the todo list for. Everything I think I should think about later I write down. I a digitized format, because pieces of paper gets lost.
It perfectly alright for a todo list such as this grow to enormous proportions. Once a day or a few days you'll have to sort through the thing, maybe categorize it a little, decide on preferences and put some things on the not-to-do list.
Using this technique you can track your progress, remember each morning where you left off the evening before and have a better understanding of how much work you have left - for your and your manager's benefit.
It is very tempting to use Notepad for this task, but it's usually the wrong tool. The second you have a need to mark things, create hierarchies of todo's and do some general mingling, you need something better. I use wordpad for this.

Eli Golovinsky
Tuesday, November 25, 2003

With 1000 line procedures, there are probably code clumps for "init", "do it", and "cleanup". Try to refactor these code clumps into new procedures. You'll probably need a "wide" interface to pass many in and out parameters. Fortunately, once you've mapped these data relationships, you can find more common or redundant code.

runtime
Tuesday, November 25, 2003

Often SQL stored procedures are written with a "procedural" mindset, i.e. looping through a couple of cursors. The first question I would ask is can I re-implement this using a more relational aprroach. Perhaps the author used a cursor because there were some calculations too complex to be put into a SELECT clause easily, if so, put these into seperate functions. Good luck and have fun.

John Ridout
Wednesday, November 26, 2003

Check to see if the data returned from the 2 cursors are joinable, so you can get the data from a single cursor instead. That would give only a one time loop over the result set.

Patrik
Wednesday, November 26, 2003

Thanks for the help everyone.  It turns out that I can get rid of a *lot* of code.  I mapped out the steps the code took, and mapped it out on a white board.  I found that I could duplicate the functionality with about 50 - 100 lines of code.  The key was just looking at the data relationally, instead of procedurally.

You guys were a huge help.  Thanks again!

nathan
Wednesday, November 26, 2003

Yet another example of why lines of code are not a good measure of productivity, although they are useful for measuring cost.

NoName
Wednesday, November 26, 2003

Instead of cursors or temp tables think about using table variables (SQL Server supports them, not sure about other RDBMs).  They are "in memory" table structures and can be great cursor and temp table substitutes.

Other than that, I agree with others... build some unit tests so you can constantly monitor that it's working correctly.  You'll likely need a set of test data to work on too, just insert and delete the data as part of the unit test.

chris
Wednesday, November 26, 2003

*  Recent Topics

*  Fog Creek Home