Copied and pasted code is usually bad

But it can be hard to find, especially in a large project. So we wrote a utility - CPD - to find it for us. It's been through three major incarnations:

  • First we wrote it using a variant of Michael Wise's Greedy String Tiling algorithm (our variant is described here)
  • Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform
  • Finally, it was rewritten by Steve Hawkins to use the Karp-Rabin string matching algorithm.

Each rewrite made it much faster, and now it can process the JDK java.* packages in about 4 seconds (on my Linux workstation, at least).

Here's a screenshot of CPD after running on the JDK java.lang package.

Note that CPD works with Java, C, C++, and PHP code.

If you have Java Web Start, you can run CPD by clicking here.

Here are the duplicates CPD found in the JDK 1.4 source code.

Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache (just the httpd-2.0/server/ directory).

Andy Glover wrote an Ant task for CPD; here's how to use it:


<target name="cpd">
    <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" />
    <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt">
        <fileset dir="/home/tom/tmp/ant">
            <include name="**/*.java"/>
        </fileset>
    </cpd>
</target>

       

Also, you can get verbose output from this task by running ant with the -v flag; i.e., ant -v -f mybuildfile.xml cpd.

There's also a JavaSpaces version available for splitting the CPD effort across a farm of machines. I usually post news on that here and the releases are here. This project is pretty much dead, though, since the current code is fast enough to just run it on one machine.

Suggestions? Comments? Post them here. Thanks!