CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

2024 | Hao Wang, Zeyu Gao, Chao Zhang, Zihan Sha, Mingyang Sun, Yuchen Zhou, Wenyu Zhu, Wenju Sun, Han Qiu, Xi Xiao
CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision CLAP is a novel method that uses natural language supervision to learn transferable binary code representations, particularly for assembly code. The method aligns binary code with their natural language explanations to enhance transfer learning capabilities. A dataset engine is proposed to automatically generate a large and diverse dataset of binary code and corresponding natural language explanations. CLAP is evaluated on various downstream tasks in binary analysis, showing exceptional performance, even without task-specific training. The results indicate that CLAP is competitive with a fully supervised baseline, demonstrating excellent transferability. The pre-trained model and code are released for research purposes. Binary code representation learning has shown significant performance in binary analysis tasks. However, existing solutions often have poor transferability, especially in few-shot and zero-shot scenarios. CLAP addresses this by aligning binary code with natural language explanations, resulting in a model that can generate better embeddings for binary code. The dataset engine generates 195 million pairs of binary code and explanations, and a prototype of CLAP is trained. Evaluations on various tasks show that CLAP outperforms existing state-of-the-art solutions, even without further training. Binary code representation is essential for tasks such as binary code similarity detection, function prototype inference, malware classification, and reverse engineering. Various techniques have been developed to represent binary code as continuous vectors in a vector space, enabling downstream tasks. These methods include direct modeling of raw bytes, graph modeling, and sequence modeling. However, these methods often lose critical information, such as function-call parameters, strings, and variable names. Large language models have demonstrated extraordinary source code understanding capabilities. However, they exhibit less proficiency with assembly code. CLAP leverages natural language supervision to enhance transferability by aligning binary code with natural language explanations. The method uses a contrastive learning strategy with natural language explanations to align the assembly code encoder with a text encoder, resulting in better representations of binary code. The methodology involves pre-training an assembly encoder and using contrastive learning to align it with a text encoder. The dataset engine generates a large and diverse dataset of binary code and natural language explanations. The CLAP model is evaluated on various tasks, including binary code similarity detection, crypto-related function identification, and protocol categorization. The results show that CLAP outperforms existing methods, demonstrating its effectiveness in binary analysis. The paper explores the question of whether a model can be trained in binary analysis to effectively transfer knowledge to various tasks, even with limited or no data. CLAP is shown to be effective in this regard, demonstrating its potential in binary analysis. The contributions include the introduction of CLAP, the development of a dataset engine, extensive experiments showing the effectiveness of CLAP, and the release of the model and code for research.CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision CLAP is a novel method that uses natural language supervision to learn transferable binary code representations, particularly for assembly code. The method aligns binary code with their natural language explanations to enhance transfer learning capabilities. A dataset engine is proposed to automatically generate a large and diverse dataset of binary code and corresponding natural language explanations. CLAP is evaluated on various downstream tasks in binary analysis, showing exceptional performance, even without task-specific training. The results indicate that CLAP is competitive with a fully supervised baseline, demonstrating excellent transferability. The pre-trained model and code are released for research purposes. Binary code representation learning has shown significant performance in binary analysis tasks. However, existing solutions often have poor transferability, especially in few-shot and zero-shot scenarios. CLAP addresses this by aligning binary code with natural language explanations, resulting in a model that can generate better embeddings for binary code. The dataset engine generates 195 million pairs of binary code and explanations, and a prototype of CLAP is trained. Evaluations on various tasks show that CLAP outperforms existing state-of-the-art solutions, even without further training. Binary code representation is essential for tasks such as binary code similarity detection, function prototype inference, malware classification, and reverse engineering. Various techniques have been developed to represent binary code as continuous vectors in a vector space, enabling downstream tasks. These methods include direct modeling of raw bytes, graph modeling, and sequence modeling. However, these methods often lose critical information, such as function-call parameters, strings, and variable names. Large language models have demonstrated extraordinary source code understanding capabilities. However, they exhibit less proficiency with assembly code. CLAP leverages natural language supervision to enhance transferability by aligning binary code with natural language explanations. The method uses a contrastive learning strategy with natural language explanations to align the assembly code encoder with a text encoder, resulting in better representations of binary code. The methodology involves pre-training an assembly encoder and using contrastive learning to align it with a text encoder. The dataset engine generates a large and diverse dataset of binary code and natural language explanations. The CLAP model is evaluated on various tasks, including binary code similarity detection, crypto-related function identification, and protocol categorization. The results show that CLAP outperforms existing methods, demonstrating its effectiveness in binary analysis. The paper explores the question of whether a model can be trained in binary analysis to effectively transfer knowledge to various tasks, even with limited or no data. CLAP is shown to be effective in this regard, demonstrating its potential in binary analysis. The contributions include the introduction of CLAP, the development of a dataset engine, extensive experiments showing the effectiveness of CLAP, and the release of the model and code for research.
Reach us at info@study.space
[slides] CLAP%3A Learning Transferable Binary Code Representations with Natural Language Supervision | StudySpace