[slides and audio] Grounded Compositional Semantics for Finding and Describing Images with Sentences

The paper introduces the DT-RNN (Dependency Tree Recursive Neural Network) model, which uses dependency trees to embed sentences into a vector space, enabling the retrieval of images described by those sentences. Unlike previous RNN models based on constituency trees, DT-RNNs focus on the action and agents in a sentence, making them more robust to changes in syntactic structure and word order. The model outperforms other recursive and recurrent neural networks, kernelized CCA, and a bag-of-words baseline in tasks such as finding images that fit sentence descriptions and vice versa. The DT-RNN is evaluated using a dataset of 1000 images, each with 5 descriptions, and shows superior performance in capturing visual meaning and generating accurate sentence representations. The paper also discusses related work in semantic vector spaces, multimodal embeddings, and image annotation, and provides a detailed explanation of the DT-RNN architecture and training process.The paper introduces the DT-RNN (Dependency Tree Recursive Neural Network) model, which uses dependency trees to embed sentences into a vector space, enabling the retrieval of images described by those sentences. Unlike previous RNN models based on constituency trees, DT-RNNs focus on the action and agents in a sentence, making them more robust to changes in syntactic structure and word order. The model outperforms other recursive and recurrent neural networks, kernelized CCA, and a bag-of-words baseline in tasks such as finding images that fit sentence descriptions and vice versa. The DT-RNN is evaluated using a dataset of 1000 images, each with 5 descriptions, and shows superior performance in capturing visual meaning and generating accurate sentence representations. The paper also discusses related work in semantic vector spaces, multimodal embeddings, and image annotation, and provides a detailed explanation of the DT-RNN architecture and training process.

Grounded Compositional Semantics for Finding and Describing Images with Sentences

2014 | Richard Socher, Andrej Karpathy, Quoc V. Le*, Christopher D. Manning, Andrew Y. Ng