code2vec: Learning Distributed Representations of Code

code2vec: Learning Distributed Representations of Code

January 2019 | URI ALON, MEITAL ZILBERSTEIN, OMER LEVY, ERAN YAHAV
code2vec is a neural model that learns continuous distributed vectors ("code embeddings") for code snippets. The model represents a code snippet as a single fixed-length vector, which can be used to predict semantic properties of the snippet. The approach involves decomposing code into paths in its abstract syntax tree (AST), then learning atomic representations of these paths and aggregating them. The model is trained on a dataset of 12M methods and demonstrates effectiveness in predicting method names from code bodies, achieving over 75% improvement over previous techniques. It also captures semantic similarities, combinations, and analogies between method names. The model uses a path-based attention mechanism to aggregate path contexts into a single code vector, enabling tasks like automatic code review and API discovery. The model's ability to capture semantic information is demonstrated through vector arithmetic analogies, such as "receive is to send as download is to upload." The model's architecture is compared to previous works, showing significant improvements in generalization and space complexity. The model is available as an interactive online demo and on GitHub. The paper highlights the importance of code embeddings for various programming-language tasks and demonstrates the effectiveness of the approach in predicting method names and other semantic properties.code2vec is a neural model that learns continuous distributed vectors ("code embeddings") for code snippets. The model represents a code snippet as a single fixed-length vector, which can be used to predict semantic properties of the snippet. The approach involves decomposing code into paths in its abstract syntax tree (AST), then learning atomic representations of these paths and aggregating them. The model is trained on a dataset of 12M methods and demonstrates effectiveness in predicting method names from code bodies, achieving over 75% improvement over previous techniques. It also captures semantic similarities, combinations, and analogies between method names. The model uses a path-based attention mechanism to aggregate path contexts into a single code vector, enabling tasks like automatic code review and API discovery. The model's ability to capture semantic information is demonstrated through vector arithmetic analogies, such as "receive is to send as download is to upload." The model's architecture is compared to previous works, showing significant improvements in generalization and space complexity. The model is available as an interactive online demo and on GitHub. The paper highlights the importance of code embeddings for various programming-language tasks and demonstrates the effectiveness of the approach in predicting method names and other semantic properties.
Reach us at info@study.space