[slides] VideoBERT%3A A Joint Model for Video and Language Representation Learning

VideoBERT is a joint model designed to learn high-level visual and linguistic representations from video data. Inspired by the success of BERT in natural language processing, VideoBERT combines automatic speech recognition (ASR) for converting speech to text, vector quantization of low-level visual features, and the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens. The model is trained on a large-scale cooking video dataset from YouTube, leveraging both video and text data. Key contributions include: 1. **Joint Visual-Linguistic Representation Learning**: VideoBERT learns high-level semantic features by combining visual and linguistic data, which is crucial for understanding long-term events and actions in videos. 2. **Text-to-Video Generation**: The model can generate video tokens from text, such as illustrating a recipe step-by-step. 3. **Future Forecasting**: VideoBERT can predict future video tokens at different time scales, providing insights into the temporal dynamics of video content. 4. **Performance on Tasks**: VideoBERT outperforms state-of-the-art models in video captioning tasks, demonstrating its effectiveness in capturing high-level semantic features. The paper also discusses the importance of large training datasets and cross-modal information for achieving superior performance. Experimental results on the YouCook II dataset validate the model's capabilities, showing that VideoBERT can perform zero-shot classification and improve video captioning tasks. The authors conclude by highlighting the promising future of joint visual-linguistic representation learning and plan to explore more applications and datasets.VideoBERT is a joint model designed to learn high-level visual and linguistic representations from video data. Inspired by the success of BERT in natural language processing, VideoBERT combines automatic speech recognition (ASR) for converting speech to text, vector quantization of low-level visual features, and the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens. The model is trained on a large-scale cooking video dataset from YouTube, leveraging both video and text data. Key contributions include: 1. **Joint Visual-Linguistic Representation Learning**: VideoBERT learns high-level semantic features by combining visual and linguistic data, which is crucial for understanding long-term events and actions in videos. 2. **Text-to-Video Generation**: The model can generate video tokens from text, such as illustrating a recipe step-by-step. 3. **Future Forecasting**: VideoBERT can predict future video tokens at different time scales, providing insights into the temporal dynamics of video content. 4. **Performance on Tasks**: VideoBERT outperforms state-of-the-art models in video captioning tasks, demonstrating its effectiveness in capturing high-level semantic features. The paper also discusses the importance of large training datasets and cross-modal information for achieving superior performance. Experimental results on the YouCook II dataset validate the model's capabilities, showing that VideoBERT can perform zero-shot classification and improve video captioning tasks. The authors conclude by highlighting the promising future of joint visual-linguistic representation learning and plan to explore more applications and datasets.

VideoBERT: A Joint Model for Video and Language Representation Learning

11 Sep 2019 | Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid