[slides] Text Is MASS%3A Modeling as Stochastic Embedding for Text-Video Retrieval

The paper "Text is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval" addresses the challenge of text-video retrieval, where the text content in existing datasets is often short and concise, making it difficult to fully describe the redundant semantics of a video. To improve this, the authors propose a new stochastic text modeling method called T-MASS (Text is Modeled As a StochStic embedding). T-MASS projects text into a "text mass" rather than a single embedding point, allowing for a more flexible and resilient semantic range. This approach enables better alignment between video and text semantics in the joint embedding space. Key contributions of T-MASS include: 1. **Stochastic Text Modeling**: Text is modeled as a stochastic embedding, allowing for a flexible and resilient semantic range. 2. **Similarity-Aware Radius Modeling**: A similarity-aware radius module is introduced to adapt the scale of the text mass based on the given text-video pairs. 3. **Support Text Regularization**: A support text vector is used to control the position and scale of the text mass during training. 4. **Inference Pipeline**: The inference pipeline is tailored to fully exploit the text mass for accurate retrieval. The authors evaluate T-MASS on five benchmark datasets (MSRVTT, LSMDC, DiDeMo, VATEX, and Charades) and show that it outperforms baseline methods by a significant margin (3% to 6.3% improvement at R@1). The experimental results demonstrate that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones but also enables precise text embedding for relevant pairs. The code and models are available online.The paper "Text is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval" addresses the challenge of text-video retrieval, where the text content in existing datasets is often short and concise, making it difficult to fully describe the redundant semantics of a video. To improve this, the authors propose a new stochastic text modeling method called T-MASS (Text is Modeled As a StochStic embedding). T-MASS projects text into a "text mass" rather than a single embedding point, allowing for a more flexible and resilient semantic range. This approach enables better alignment between video and text semantics in the joint embedding space. Key contributions of T-MASS include: 1. **Stochastic Text Modeling**: Text is modeled as a stochastic embedding, allowing for a flexible and resilient semantic range. 2. **Similarity-Aware Radius Modeling**: A similarity-aware radius module is introduced to adapt the scale of the text mass based on the given text-video pairs. 3. **Support Text Regularization**: A support text vector is used to control the position and scale of the text mass during training. 4. **Inference Pipeline**: The inference pipeline is tailored to fully exploit the text mass for accurate retrieval. The authors evaluate T-MASS on five benchmark datasets (MSRVTT, LSMDC, DiDeMo, VATEX, and Charades) and show that it outperforms baseline methods by a significant margin (3% to 6.3% improvement at R@1). The experimental results demonstrate that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones but also enables precise text embedding for relevant pairs. The code and models are available online.

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

26 Mar 2024 | Jiamian Wang1, Guohao Sun1, Pichao Wang2, Dongfang Liu1†, Sohail Dianat1, Majid Rabbani1, Raghuveer Rao3, Zhiqiang Tao1†