Linguistic Regularities in Continuous Space Word Representations

Linguistic Regularities in Continuous Space Word Representations

9-14 June 2013 | Tomas Mikolov*, Wen-tau Yih, Geoffrey Zweig
This paper presents a study on linguistic regularities in continuous space word representations. The authors examine the vector-space word representations learned by the input-layer weights of continuous space language models. They find that these representations are surprisingly effective at capturing syntactic and semantic regularities in language, with each relationship characterized by a relation-specific vector offset. This allows vector-based reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, "King - Man + Woman" results in a vector very close to "Queen." The paper demonstrates that the word vectors capture syntactic regularities by means of syntactic analogy questions, correctly answering almost 40% of the questions. They also show that the word vectors capture semantic regularities by using the vector offset method to answer SemEval-2012 Task 2 questions, outperforming previous systems. The study uses a recurrent neural network language model (RNNLM) to generate word vectors. The word representations are found in the columns of the input weight matrix U. The RNN is trained with backpropagation to maximize the data log-likelihood under the model. The model itself has no knowledge of syntax or semantics, but training such a purely lexical model to maximize likelihood induces word representations with striking syntactic and semantic properties. The authors created a test set of analogy questions to evaluate the syntactic regularities in the learned representations. They also used the SemEval-2012 Task 2 to evaluate the semantic regularities. The results show that the RNN-based representations perform best in both tasks, outperforming previous systems. The vector offset method is proposed to solve these analogy questions, assuming relationships are present as vector offsets. This method is effective in solving both syntactic and semantic analogy questions. The experimental results show that the RNN-based representations capture significantly more syntactic regularity than the LSA vectors and perform well in an absolute sense, answering more than one in three questions correctly. The RNN vectors also outperform previous systems in the semantic task, even though they are not specifically trained or tuned for this task. The results indicate that the RNN-based representations are robust and effective in capturing both syntactic and semantic regularities in language.This paper presents a study on linguistic regularities in continuous space word representations. The authors examine the vector-space word representations learned by the input-layer weights of continuous space language models. They find that these representations are surprisingly effective at capturing syntactic and semantic regularities in language, with each relationship characterized by a relation-specific vector offset. This allows vector-based reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, "King - Man + Woman" results in a vector very close to "Queen." The paper demonstrates that the word vectors capture syntactic regularities by means of syntactic analogy questions, correctly answering almost 40% of the questions. They also show that the word vectors capture semantic regularities by using the vector offset method to answer SemEval-2012 Task 2 questions, outperforming previous systems. The study uses a recurrent neural network language model (RNNLM) to generate word vectors. The word representations are found in the columns of the input weight matrix U. The RNN is trained with backpropagation to maximize the data log-likelihood under the model. The model itself has no knowledge of syntax or semantics, but training such a purely lexical model to maximize likelihood induces word representations with striking syntactic and semantic properties. The authors created a test set of analogy questions to evaluate the syntactic regularities in the learned representations. They also used the SemEval-2012 Task 2 to evaluate the semantic regularities. The results show that the RNN-based representations perform best in both tasks, outperforming previous systems. The vector offset method is proposed to solve these analogy questions, assuming relationships are present as vector offsets. This method is effective in solving both syntactic and semantic analogy questions. The experimental results show that the RNN-based representations capture significantly more syntactic regularity than the LSA vectors and perform well in an absolute sense, answering more than one in three questions correctly. The RNN vectors also outperform previous systems in the semantic task, even though they are not specifically trained or tuned for this task. The results indicate that the RNN-based representations are robust and effective in capturing both syntactic and semantic regularities in language.
Reach us at info@futurestudyspace.com
[slides] Linguistic Regularities in Continuous Space Word Representations | StudySpace