[slides and audio] Multimodal Distributional Semantics

The paper addresses the limitations of distributional semantic models (DSMs), which rely solely on textual co-occurrence data to represent word meanings. These models, while successful in computational linguistics, lack perceptual grounding, which is crucial for capturing human semantic knowledge. To overcome this, the authors propose integrating visual information from images into DSMs. They use computer vision techniques to identify "visual words" in images and extend the distributional representation of words to include their co-occurrence with these visual words. The authors develop a flexible architecture for integrating text- and image-based distributional information and demonstrate that their multimodal model outperforms purely text-based models in various empirical tests, providing complementary semantic information. The paper also reviews related work, including earlier attempts to construct multimodal distributional representations and discusses different fusion strategies for combining textual and visual cues.The paper addresses the limitations of distributional semantic models (DSMs), which rely solely on textual co-occurrence data to represent word meanings. These models, while successful in computational linguistics, lack perceptual grounding, which is crucial for capturing human semantic knowledge. To overcome this, the authors propose integrating visual information from images into DSMs. They use computer vision techniques to identify "visual words" in images and extend the distributional representation of words to include their co-occurrence with these visual words. The authors develop a flexible architecture for integrating text- and image-based distributional information and demonstrate that their multimodal model outperforms purely text-based models in various empirical tests, providing complementary semantic information. The paper also reviews related work, including earlier attempts to construct multimodal distributional representations and discusses different fusion strategies for combining textual and visual cues.

Multimodal Distributional Semantics

Submitted 7/13; published 1/14 | Elia Bruni, Nam Khanh Tran, Marco Baroni