On Finding Duplication and Near-Duplication in Large Software Systems

On Finding Duplication and Near-Duplication in Large Software Systems

1995 | Brenda S. Baker
This paper describes a program called dup that helps locate duplication or near-duplication in large software systems. Dup identifies textually identical sections of code and sections that are the same except for systematic substitution of variable names and constants. Further processing can find longer sections that are the same except for small modifications. Experimental results show dup is effective and fast. Applications include identifying code for replacement by procedures, eliminating duplication during reengineering, redocumentation, and debugging. Dup finds maximal sections of code that are either exactly the same or the same except for global substitutions of parameter names. It ignores comments and whitespace. The tool is text-based and line-based. Dup can find longer sections that were copied and then modified locally. It uses a parameterized suffix tree for efficient processing. Dup has been applied to millions of lines of code from two large systems, finding significant duplication. For example, in the X Window System, dup found 2487 matches of at least 30 lines, involving 19% of the code. Dup estimates that 12% of the code could be eliminated by rewriting. In a 1.1M line production system, dup found 5550 matches, involving 20% of the code, with 13% potentially eliminable. Dup is fast, processing a million lines in seven minutes on a 40MHz R3000 processor. It uses efficient algorithms and a parameterized suffix tree. Dup's results show that duplication is common in large systems, often created by copying and editing. Dup helps identify such duplication, which can be replaced by procedures or loops. The postprocessor generates statistics, plots, and profiles to help manage and reengineer systems. Dup's output can be used to identify code that could be rewritten using arrays and loops. Visualization tools like scatter plots help identify duplication patterns. Dup's approach is more effective than exact matches for locating duplication. It can identify code that was copied and modified, even with parameter changes. Dup's results show that duplication is widespread, and the tool is useful for managing large systems. Dup's efficiency and effectiveness make it a valuable tool for software maintenance and reengineering. Further research is needed to improve duplication detection and factorization of repetitive code.This paper describes a program called dup that helps locate duplication or near-duplication in large software systems. Dup identifies textually identical sections of code and sections that are the same except for systematic substitution of variable names and constants. Further processing can find longer sections that are the same except for small modifications. Experimental results show dup is effective and fast. Applications include identifying code for replacement by procedures, eliminating duplication during reengineering, redocumentation, and debugging. Dup finds maximal sections of code that are either exactly the same or the same except for global substitutions of parameter names. It ignores comments and whitespace. The tool is text-based and line-based. Dup can find longer sections that were copied and then modified locally. It uses a parameterized suffix tree for efficient processing. Dup has been applied to millions of lines of code from two large systems, finding significant duplication. For example, in the X Window System, dup found 2487 matches of at least 30 lines, involving 19% of the code. Dup estimates that 12% of the code could be eliminated by rewriting. In a 1.1M line production system, dup found 5550 matches, involving 20% of the code, with 13% potentially eliminable. Dup is fast, processing a million lines in seven minutes on a 40MHz R3000 processor. It uses efficient algorithms and a parameterized suffix tree. Dup's results show that duplication is common in large systems, often created by copying and editing. Dup helps identify such duplication, which can be replaced by procedures or loops. The postprocessor generates statistics, plots, and profiles to help manage and reengineer systems. Dup's output can be used to identify code that could be rewritten using arrays and loops. Visualization tools like scatter plots help identify duplication patterns. Dup's approach is more effective than exact matches for locating duplication. It can identify code that was copied and modified, even with parameter changes. Dup's results show that duplication is widespread, and the tool is useful for managing large systems. Dup's efficiency and effectiveness make it a valuable tool for software maintenance and reengineering. Further research is needed to improve duplication detection and factorization of repetitive code.
Reach us at info@futurestudyspace.com
[slides] On finding duplication and near-duplication in large software systems | StudySpace