[slides and audio] Source Code Clone Detection Using Unsupervised Similarity Measures

This paper presents a comparative analysis of unsupervised similarity measures for source code clone detection. The goal is to evaluate the current state-of-the-art techniques, their strengths, and weaknesses, and to guide software engineers in selecting appropriate methods for their specific use cases. The study evaluates various unsupervised similarity measures, including token comparison, code embeddings, function call comparison, graph-based methods, and others, on a benchmark dataset of code fragments with varying degrees of similarity. The results show that several measures could be valid tools for source code clone detection. The analysis focuses on practical applicability and efficiency, and highlights the importance of unsupervised measures in software engineering tasks such as clone detection and code search and recommendation. The study also identifies promising directions for future research in source code similarity assessment. The primary contribution of this work is to guide the choice of appropriate unsupervised similarity measures for clone detection, and to identify promising directions for future research in source code similarity assessment. The study also highlights the importance of unsupervised measures in software engineering tasks such as clone detection and code search and recommendation. The results indicate that some measures, such as Output Analysis, have high accuracy but are computationally expensive. Other measures, such as Jaccard, N-grams, Winnow, and RKR-GST, offer a reasonable balance of accuracy and execution time. The study concludes that unsupervised similarity measures are essential for source code clone detection and that further research is needed to improve their effectiveness and efficiency.This paper presents a comparative analysis of unsupervised similarity measures for source code clone detection. The goal is to evaluate the current state-of-the-art techniques, their strengths, and weaknesses, and to guide software engineers in selecting appropriate methods for their specific use cases. The study evaluates various unsupervised similarity measures, including token comparison, code embeddings, function call comparison, graph-based methods, and others, on a benchmark dataset of code fragments with varying degrees of similarity. The results show that several measures could be valid tools for source code clone detection. The analysis focuses on practical applicability and efficiency, and highlights the importance of unsupervised measures in software engineering tasks such as clone detection and code search and recommendation. The study also identifies promising directions for future research in source code similarity assessment. The primary contribution of this work is to guide the choice of appropriate unsupervised similarity measures for clone detection, and to identify promising directions for future research in source code similarity assessment. The study also highlights the importance of unsupervised measures in software engineering tasks such as clone detection and code search and recommendation. The results indicate that some measures, such as Output Analysis, have high accuracy but are computationally expensive. Other measures, such as Jaccard, N-grams, Winnow, and RKR-GST, offer a reasonable balance of accuracy and execution time. The study concludes that unsupervised similarity measures are essential for source code clone detection and that further research is needed to improve their effectiveness and efficiency.

Source Code Clone Detection Using Unsupervised Similarity Measures

6 Feb 2024 | Jorge Martinez-Gil