code2vec: Learning Distributed Representations of Code

code2vec: Learning Distributed Representations of Code

January 2019 | URI ALON, Technion, Israel MEITAL ZILBERSTEIN, Technion, Israel OMER LEVY, Facebook AI Research, USA ERAN YAHAV, Technion, Israel
The paper "code2vec: Learning Distributed Representations of Code" by Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav presents a neural model for representing code snippets as continuous distributed vectors, known as "code embeddings." The main idea is to represent a code snippet as a single fixed-length vector that can be used to predict semantic properties of the snippet. The approach involves decomposing the code into paths in its abstract syntax tree (AST) and then learning the atomic representation of each path while simultaneously learning how to aggregate these paths. The effectiveness of the approach is demonstrated by using it to predict method names from the vector representation of their bodies. The model is trained on a dataset of 12 million methods and shows that it can predict method names from files that were unobserved during training. The model also learns useful method name vectors that capture semantic similarities, combinations, and analogies. Compared to previous techniques, the proposed approach improves by more than 75%, making it the first to successfully predict method names based on a large, cross-project corpus. The trained model, visualizations, and vector similarities are available as an interactive online demo, and the code, data, and trained models are available on GitHub. The paper discusses the challenges of representing code snippets and learning relevant parts of the representation for prediction. It introduces a path-based attention model that learns to aggregate multiple syntactic paths into a single vector, allowing for the prediction of semantic properties of code snippets. The model is evaluated on the task of predicting method names, showing significant improvements over previous methods. The paper also highlights the advantages of distributed representations over symbolic representations in terms of generalization ability and space complexity.The paper "code2vec: Learning Distributed Representations of Code" by Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav presents a neural model for representing code snippets as continuous distributed vectors, known as "code embeddings." The main idea is to represent a code snippet as a single fixed-length vector that can be used to predict semantic properties of the snippet. The approach involves decomposing the code into paths in its abstract syntax tree (AST) and then learning the atomic representation of each path while simultaneously learning how to aggregate these paths. The effectiveness of the approach is demonstrated by using it to predict method names from the vector representation of their bodies. The model is trained on a dataset of 12 million methods and shows that it can predict method names from files that were unobserved during training. The model also learns useful method name vectors that capture semantic similarities, combinations, and analogies. Compared to previous techniques, the proposed approach improves by more than 75%, making it the first to successfully predict method names based on a large, cross-project corpus. The trained model, visualizations, and vector similarities are available as an interactive online demo, and the code, data, and trained models are available on GitHub. The paper discusses the challenges of representing code snippets and learning relevant parts of the representation for prediction. It introduces a path-based attention model that learns to aggregate multiple syntactic paths into a single vector, allowing for the prediction of semantic properties of code snippets. The model is evaluated on the task of predicting method names, showing significant improvements over previous methods. The paper also highlights the advantages of distributed representations over symbolic representations in terms of generalization ability and space complexity.
Reach us at info@study.space