Fog Creek Software
Discussion Board

Detecting duplicate code

Does anyone have any recommendations/experience with any utility to detect duplicate source code? In this case I have a very large volume of T-SQL that I would like to scan for re-appearing blocks (from manual analysis I know this happens frequently where sections of code are copy-pasted). Barring any sort of automatic utility, I will filter a scripted output of all objects of any non-logic whitespace (i.e. pre/post tabs and spaces) and then...I'm not even sure. Create an array of the CRC32 of the 560,000 sql output file and then scan for duplicated integer sequences?  Seems like an NP problem.


Dennis Forbes
Tuesday, June 15, 2004

Best bet is probably something like the repeated substrings compression ala a zip file.

Peter Ibbotson
Tuesday, June 15, 2004

Hmm, that is an interesting problem. You might try looking at the query plans and seeing if there are duplicates. For example, you can write a query two different ways which accomplishes the same thing. Hopefully (if your query optimizer is good) the query plans will be the same, so you can compare those (or bits and pieces).

I don't know which is better -- sounds like 6 of one, one-half dozen of annother.

Captain McFly
Tuesday, June 15, 2004

That would be a tough one. I tried looking for refactoring tools, duplicate text searches didn't find much.

You will probably have to roll your own. 

Like you, I would probably remove the extra whitespace from each line. Dump those lines into a table with line #, proc name, line text and group by line text to get an idea of any duplicate lines. From that listing you would still have to manually look for identical blocks but at least it would narrow your search.

You could also probably take that further and create some more queries that tell you exactly how many lines are duplicated in each proc.


Tuesday, June 15, 2004

It would be my guess that any duplicate code detecting program would have to take into account the syntax of the language and detect the duplicate code by examining certain langauge specific blocks like functions, statements, modules, files etc.

Dave B.
Tuesday, June 15, 2004

Don't know if this can be done in T-SQL...

You would probably get better results if you first ran a code formatting routine (like the Source -> Format option in Eclipse).

Tuesday, June 15, 2004

Dennis, ESR wrote some code to do this to find comparisons between Sys V UNIX and Linux.

Hope that helps...

Andrew Hurst
Tuesday, June 15, 2004

See if there are any refactoring tools for your given language.  This is what they are built to do.

Tuesday, June 15, 2004

Here's a quick and super dirty way. Put the code into a db and sort it.

Tuesday, June 15, 2004

Pick up the July 2004 issue of Dr Dobb's. It has an article, "Detecting Source-Code Plagiarism" by Bob Zeidman which addresses this problem from a different point of view.

Eric Pearl
Tuesday, June 15, 2004

Funny you should mention this. I'm in the process right now of rolling my own code to do something similar.

My company has its own functional programming language (with XML syntax) for configuring our server software. My code looks for redundant bits of functional code and then creates (essentially) a function call for the redundant pieces of code. It also finds bits of code that are *SIMILAR* to other bits of code and prompts the user with a few different choices for how to abstract away the similarities.

By the end of the day, I'm hoping to have a tool that can reduce 181,000 lines of source code down to about 100,000 (that's my goal, anyhow).

My suggestion for building your tool would be to write a function that can create a canonical form for each query. Develop a well-defined order for all fields in the SELECT clause, and then sort all of the fields using that well-defined order. Then create a well-defined order for the elements of your WHERE clause.

Etc, etc, etc.

Depending on how complex your queries are, you might have a very difficult problem on your hands. (For example, how do you determining whether a particular LEFT JOIN produces the same output as a query that does all of its "joining" in the WHERE clause?)

Benji Smith
Tuesday, June 15, 2004

Ho about remove white space, concatenate multi-line statements to single line, import into Excel, sort, then use the subtotals function to let you know what's the same

Tuesday, June 15, 2004

The latest issue of Dr. Dobbs Journal (July 2004) has an article on this on page 57.

T. Norman
Tuesday, June 15, 2004

Here is a possible (untried) approach

Read each code line into a array of structure as shown below

typedef struct
unsigned int line_number ;
char code_line[256] ;
} CodeStructure ;

line number is the line you read the code from and codeline is the code. For better results remove spaces from the code lines before storing and convert them to uppercase

Now sort the array based on code_line.

Remove all elements from the array which do not match anything else. Now you only have those lines which match atleast twice in your code.

Now take the line numbers of each of the two or more matches and scan for the next line number of each of them in the array. If the code is duplicated you will find that the next line numbers of each of the earlier match also form a "clump"

e.g if lines 1 and 10  have the matching lines "delete sometable". Lines 2 and 11 should also match  so you search for line number 2 and see if the next line number is 12 and if it is *and* if the code_line for 2 and 12 matches you have a code match based on two lines. For matching more lines you just have to increase the scan length

Code Monkey
Tuesday, June 15, 2004

I was thinking of doing something like this too for the codebase I am working on.

I think your best bet is actually to use some sort of diff program and find the longest common subsequences (you can probably come up with a heuristic to ignore trivial changes like variable names).  This is my plan for when I have time.

I was planning to use python diffutils.  The advantage is that you don't have to diff line by line, you can diff token-by token.

And you would use a combination of automatic diffing through sets of code, or just diffing individual sets that you know are likely to have a lot of duplication.

Tuesday, June 15, 2004

Just curious, not trolling. What's driving you to do this?

Programmers often want to "clean up" code. But IMHO the effort rarely pays. You'll end up spending weeks or months coding, debugging, testing, etc. and if everything goes well you'll end up with what you have today. We used to call that "polishing a turd".

On the other hand, if you need to make nontrivial changes anyway, why not redsign or at least refactor based on the requirements change?  I just don't see how a scrubbing of working code is cost effective.

Tom H
Tuesday, June 15, 2004

We have this tool running as part of our automated build:

Rhys Keepence
Tuesday, June 15, 2004

That is true, Tom.  It is usually not a good idea to mess with working code... but if you are refactoring due to a requirements change, then why not have some tool that can help you refactor?

I can think of a lot of cases in my current project where I will want to do this... I haven't done it yet because it is a working system, but the next cycle I am going to clean it up with the aid of a tool, because I know I will have to change that stuff a lot.

Tuesday, June 15, 2004

You'd be better asking on the refactoring mailing list, Dennis. There is a link from

Wednesday, June 16, 2004

"Just curious, not trolling. What's driving you to do this?"

It's more for example/training and metrics purposes than actually doing code changes - I know that this huge code base is partly huge because of extensively duplicated code, and I think there's a good lesson there in modularizing (because every fix requires the fix to be put in place in multiple, or dozens, of places).

Dennis Forbes
Wednesday, June 16, 2004

*  Recent Topics

*  Fog Creek Home