The paper "Text is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval" addresses the challenge of text-video retrieval, where the text content in existing datasets is often short and concise, making it difficult to fully describe the redundant semantics of a video. To improve this, the authors propose a new stochastic text modeling method called T-MASS (Text is Modeled As a StochStic embedding). T-MASS projects text into a "text mass" rather than a single embedding point, allowing for a more flexible and resilient semantic range. This approach enables better alignment between video and text semantics in the joint embedding space.
Key contributions of T-MASS include:
1. **Stochastic Text Modeling**: Text is modeled as a stochastic embedding, allowing for a flexible and resilient semantic range.
2. **Similarity-Aware Radius Modeling**: A similarity-aware radius module is introduced to adapt the scale of the text mass based on the given text-video pairs.
3. **Support Text Regularization**: A support text vector is used to control the position and scale of the text mass during training.
4. **Inference Pipeline**: The inference pipeline is tailored to fully exploit the text mass for accurate retrieval.
The authors evaluate T-MASS on five benchmark datasets (MSRVTT, LSMDC, DiDeMo, VATEX, and Charades) and show that it outperforms baseline methods by a significant margin (3% to 6.3% improvement at R@1). The experimental results demonstrate that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones but also enables precise text embedding for relevant pairs. The code and models are available online.The paper "Text is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval" addresses the challenge of text-video retrieval, where the text content in existing datasets is often short and concise, making it difficult to fully describe the redundant semantics of a video. To improve this, the authors propose a new stochastic text modeling method called T-MASS (Text is Modeled As a StochStic embedding). T-MASS projects text into a "text mass" rather than a single embedding point, allowing for a more flexible and resilient semantic range. This approach enables better alignment between video and text semantics in the joint embedding space.
Key contributions of T-MASS include:
1. **Stochastic Text Modeling**: Text is modeled as a stochastic embedding, allowing for a flexible and resilient semantic range.
2. **Similarity-Aware Radius Modeling**: A similarity-aware radius module is introduced to adapt the scale of the text mass based on the given text-video pairs.
3. **Support Text Regularization**: A support text vector is used to control the position and scale of the text mass during training.
4. **Inference Pipeline**: The inference pipeline is tailored to fully exploit the text mass for accurate retrieval.
The authors evaluate T-MASS on five benchmark datasets (MSRVTT, LSMDC, DiDeMo, VATEX, and Charades) and show that it outperforms baseline methods by a significant margin (3% to 6.3% improvement at R@1). The experimental results demonstrate that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones but also enables precise text embedding for relevant pairs. The code and models are available online.