Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

26 Mar 2024 | Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuv eer Rao, Zhiqiang Tao
This paper introduces T-MASS, a novel stochastic text modeling method for text-video retrieval. The core idea is to model text as a stochastic embedding, enabling a more flexible and resilient semantic range to better capture video semantics. Unlike existing methods that treat text as a single point in the embedding space, T-MASS projects text as a "mass" to account for potential misalignment between video and text embeddings. A similarity-aware radius module is introduced to adapt the scale of the text mass based on text-video pairs, while a support text regularization helps control the text mass during training. The inference pipeline is tailored to fully exploit the text mass for accurate retrieval. Empirical results show that T-MASS significantly improves performance over baseline methods, achieving state-of-the-art results on five benchmark datasets: MSRVTT, LSMDC, DiDeMo, VATEX, and Charades. T-MASS outperforms existing methods by a margin of 3% to 6.3% at R@1. The method also enables better text-video alignment and text semantics adaptation. The proposed approach enhances text embedding with more expressiveness and flexibility, allowing for more accurate retrieval by capturing rich video semantic clues. The method is evaluated on multiple datasets and shows consistent performance improvements across different model sizes and data scales. The results demonstrate that T-MASS effectively bridges relevant text-video pairs while distancing irrelevant ones, and enables precise text semantics mapping. The method is implemented with a stochastic text embedding, which is trained using a combination of symmetric cross-entropy loss and support text regularization. The proposed method is efficient and effective, offering a promising solution for text-video retrieval.This paper introduces T-MASS, a novel stochastic text modeling method for text-video retrieval. The core idea is to model text as a stochastic embedding, enabling a more flexible and resilient semantic range to better capture video semantics. Unlike existing methods that treat text as a single point in the embedding space, T-MASS projects text as a "mass" to account for potential misalignment between video and text embeddings. A similarity-aware radius module is introduced to adapt the scale of the text mass based on text-video pairs, while a support text regularization helps control the text mass during training. The inference pipeline is tailored to fully exploit the text mass for accurate retrieval. Empirical results show that T-MASS significantly improves performance over baseline methods, achieving state-of-the-art results on five benchmark datasets: MSRVTT, LSMDC, DiDeMo, VATEX, and Charades. T-MASS outperforms existing methods by a margin of 3% to 6.3% at R@1. The method also enables better text-video alignment and text semantics adaptation. The proposed approach enhances text embedding with more expressiveness and flexibility, allowing for more accurate retrieval by capturing rich video semantic clues. The method is evaluated on multiple datasets and shows consistent performance improvements across different model sizes and data scales. The results demonstrate that T-MASS effectively bridges relevant text-video pairs while distancing irrelevant ones, and enables precise text semantics mapping. The method is implemented with a stochastic text embedding, which is trained using a combination of symmetric cross-entropy loss and support text regularization. The proposed method is efficient and effective, offering a promising solution for text-video retrieval.
Reach us at info@study.space