[slides and audio] CLAP%3A Learning Transferable Binary Code Representations with Natural Language Supervision

CLAP (Contrastive Language-Assembly Pre-training) is a novel method designed to learn transferable binary code representations using natural language supervision. The core idea is to align binary code with its semantic explanations in natural language, enhancing the model's ability to generate better embeddings for binary code. To achieve this, CLAP employs an efficient dataset engine that automatically generates a large and diverse dataset of binary code and corresponding natural language explanations. The dataset engine uses package managers to compile source code into assembly code and generates explanations using GPT-3.5. The CLAP model is pre-trained using a two-stage process: first, an assembly encoder is pre-trained to understand assembly code, and then it is contrastively pre-trained with natural language supervision to align with text encoders. This approach significantly improves the model's performance in various downstream tasks, such as binary code similarity detection (BCSD), crypto-related functions identification, and protocol categorization. Evaluations show that CLAP outperforms existing state-of-the-art solutions, even without task-specific training or fine-tuning. The model's zero-shot and few-shot capabilities are also demonstrated, highlighting its strong transfer learning performance. The broader impact of CLAP includes its potential to facilitate zero-shot learning and handle complex scenarios where traditional data-driven methods may fall short.CLAP (Contrastive Language-Assembly Pre-training) is a novel method designed to learn transferable binary code representations using natural language supervision. The core idea is to align binary code with its semantic explanations in natural language, enhancing the model's ability to generate better embeddings for binary code. To achieve this, CLAP employs an efficient dataset engine that automatically generates a large and diverse dataset of binary code and corresponding natural language explanations. The dataset engine uses package managers to compile source code into assembly code and generates explanations using GPT-3.5. The CLAP model is pre-trained using a two-stage process: first, an assembly encoder is pre-trained to understand assembly code, and then it is contrastively pre-trained with natural language supervision to align with text encoders. This approach significantly improves the model's performance in various downstream tasks, such as binary code similarity detection (BCSD), crypto-related functions identification, and protocol categorization. Evaluations show that CLAP outperforms existing state-of-the-art solutions, even without task-specific training or fine-tuning. The model's zero-shot and few-shot capabilities are also demonstrated, highlighting its strong transfer learning performance. The broader impact of CLAP includes its potential to facilitate zero-shot learning and handle complex scenarios where traditional data-driven methods may fall short.

CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

2024 | Hao Wang1, Zeyu Gao1, Chao Zhang1, Zihan Sha2, Mingyang Sun3, Yuchen Zhou4, Wenyu Zhu1, Wenju Sun5, Han Qiu1, Xi Xiao5