April 14–20, 2024, Lisbon, Portugal | Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, Yuqun Zhang
BinaryAI is a novel binary-to-source Software Composition Analysis (SCA) technique designed to identify third-party libraries (TPLs) in binary files. It addresses the limitations of existing binary-to-source SCA techniques, which often rely on basic syntactic features that are redundant and lack robustness in large-scale TPL datasets, leading to false positives and compromised recall. BinaryAI employs a two-phase binary source code matching approach, using a transformer-based model to generate function-level embeddings and perform intelligent function-level binary source code matching. The first phase involves training a transformer model to produce function-level embeddings and retrieve similar source functions for each binary function. The second phase leverages link-time locality and function call graph to enhance the accuracy of function matching by identifying the exact source function from the retrieved top-k candidates. Experimental results demonstrate that BinaryAI outperforms state-of-the-art models like CodeCMR and existing binary-to-source SCA tools in terms of binary source code matching and downstream SCA tasks, achieving higher precision and recall rates.BinaryAI is a novel binary-to-source Software Composition Analysis (SCA) technique designed to identify third-party libraries (TPLs) in binary files. It addresses the limitations of existing binary-to-source SCA techniques, which often rely on basic syntactic features that are redundant and lack robustness in large-scale TPL datasets, leading to false positives and compromised recall. BinaryAI employs a two-phase binary source code matching approach, using a transformer-based model to generate function-level embeddings and perform intelligent function-level binary source code matching. The first phase involves training a transformer model to produce function-level embeddings and retrieve similar source functions for each binary function. The second phase leverages link-time locality and function call graph to enhance the accuracy of function matching by identifying the exact source function from the retrieved top-k candidates. Experimental results demonstrate that BinaryAI outperforms state-of-the-art models like CodeCMR and existing binary-to-source SCA tools in terms of binary source code matching and downstream SCA tasks, achieving higher precision and recall rates.