Understanding On finding duplication and near-duplication in large software systems

This paper introduces a program called "dup" designed to identify instances of duplication or near-duplication in large software systems. Dup can detect both exact matches and parameterized matches, where sections of code are identical except for systematic substitutions of variable names and constants. The tool is text-based and line-based, ignoring comments and white space. It processes code written in C and can be extended for other languages. The paper describes the definition of maximal parameterized matches and how they are adapted in dup to find interesting duplication. The postprocessor analyzes these matches further, providing statistics and plots, and estimating the percentage of lines that could be eliminated if the code were rewritten. Experiments on the X Window System and a production subsystem show that dup can locate a significant amount of duplication, with the longest match being 2585 non-commentary lines. The tool is efficient, running in linear time for input sizes, and can handle systems with millions of lines of code. The paper also discusses the visualization of results and potential future work, including improving the handling of repetitive code and developing a graphical user interface.This paper introduces a program called "dup" designed to identify instances of duplication or near-duplication in large software systems. Dup can detect both exact matches and parameterized matches, where sections of code are identical except for systematic substitutions of variable names and constants. The tool is text-based and line-based, ignoring comments and white space. It processes code written in C and can be extended for other languages. The paper describes the definition of maximal parameterized matches and how they are adapted in dup to find interesting duplication. The postprocessor analyzes these matches further, providing statistics and plots, and estimating the percentage of lines that could be eliminated if the code were rewritten. Experiments on the X Window System and a production subsystem show that dup can locate a significant amount of duplication, with the longest match being 2585 non-commentary lines. The tool is efficient, running in linear time for input sizes, and can handle systems with millions of lines of code. The paper also discusses the visualization of results and potential future work, including improving the handling of repetitive code and developing a graphical user interface.

On Finding Duplication and Near-Duplication in Large Software Systems

1995 | Brenda S. Baker