This paper presents a multimodal distributional semantic model that integrates text and image-based information to enhance the representation of word meaning. Distributional semantic models (DSMs) derive word meanings from co-occurrence patterns in text, but they lack perceptual grounding, which is crucial for human semantic knowledge. To address this, the authors propose a flexible architecture that combines text and image-based features to create perceptually enhanced distributional vectors. They show that their integrated model outperforms purely text-based approaches and provides complementary semantic information.
The paper first reviews the distributional hypothesis and its application in DSMs, which approximate word meaning with vectors based on co-occurrence patterns. However, DSMs are limited by their reliance on textual contexts, which are less rich than perceptual sources. The authors argue that integrating visual information can provide a more accurate representation of word meaning.
To achieve this, the authors use computer vision techniques to extract visual words from images, which are then combined with text-based features. They propose a general parametrized architecture for multimodal fusion that automatically determines the optimal mixture of text and image-based features for a given task. The model is evaluated on two semantic tasks: predicting the degree of semantic relatedness between word pairs and categorizing nominal concepts into classes.
The results show that multimodal DSMs consistently outperform purely textual models, confirming that grounding meaning in perception improves computational models of meaning. The paper also discusses related work, including earlier attempts to construct multimodal distributional representations and strategies for combining text and image information. It concludes with a discussion of the challenges and future directions for multimodal distributional semantics.This paper presents a multimodal distributional semantic model that integrates text and image-based information to enhance the representation of word meaning. Distributional semantic models (DSMs) derive word meanings from co-occurrence patterns in text, but they lack perceptual grounding, which is crucial for human semantic knowledge. To address this, the authors propose a flexible architecture that combines text and image-based features to create perceptually enhanced distributional vectors. They show that their integrated model outperforms purely text-based approaches and provides complementary semantic information.
The paper first reviews the distributional hypothesis and its application in DSMs, which approximate word meaning with vectors based on co-occurrence patterns. However, DSMs are limited by their reliance on textual contexts, which are less rich than perceptual sources. The authors argue that integrating visual information can provide a more accurate representation of word meaning.
To achieve this, the authors use computer vision techniques to extract visual words from images, which are then combined with text-based features. They propose a general parametrized architecture for multimodal fusion that automatically determines the optimal mixture of text and image-based features for a given task. The model is evaluated on two semantic tasks: predicting the degree of semantic relatedness between word pairs and categorizing nominal concepts into classes.
The results show that multimodal DSMs consistently outperform purely textual models, confirming that grounding meaning in perception improves computational models of meaning. The paper also discusses related work, including earlier attempts to construct multimodal distributional representations and strategies for combining text and image information. It concludes with a discussion of the challenges and future directions for multimodal distributional semantics.