11 Sep 2019 | Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid
VideoBERT is a joint model for learning video and language representations. The paper proposes a model that learns high-level features from video and text without explicit supervision. Inspired by BERT, VideoBERT learns bidirectional joint distributions over sequences of visual and linguistic tokens derived from vector quantization of video data and speech recognition outputs. The model is applied to various tasks, including action classification and video captioning. It outperforms the state-of-the-art on video captioning and learns high-level semantic features. The model uses a combination of automatic speech recognition, vector quantization, and BERT to model the relationship between visual and linguistic domains. It is trained on a large-scale dataset of cooking videos, and the results show that it can be used for zero-shot classification and video captioning. The model is also effective for cross-modal learning, using speech and visual signals to learn strong self-supervised video representations. The paper demonstrates that VideoBERT can be used for various downstream tasks, including video captioning and action classification, and that its performance improves with larger training data. The model is evaluated on the YouCook II dataset, and it achieves state-of-the-art results in video captioning. The paper also discusses the benefits of large training sets and the potential for future work in learning joint representations from video and language.VideoBERT is a joint model for learning video and language representations. The paper proposes a model that learns high-level features from video and text without explicit supervision. Inspired by BERT, VideoBERT learns bidirectional joint distributions over sequences of visual and linguistic tokens derived from vector quantization of video data and speech recognition outputs. The model is applied to various tasks, including action classification and video captioning. It outperforms the state-of-the-art on video captioning and learns high-level semantic features. The model uses a combination of automatic speech recognition, vector quantization, and BERT to model the relationship between visual and linguistic domains. It is trained on a large-scale dataset of cooking videos, and the results show that it can be used for zero-shot classification and video captioning. The model is also effective for cross-modal learning, using speech and visual signals to learn strong self-supervised video representations. The paper demonstrates that VideoBERT can be used for various downstream tasks, including video captioning and action classification, and that its performance improves with larger training data. The model is evaluated on the YouCook II dataset, and it achieves state-of-the-art results in video captioning. The paper also discusses the benefits of large training sets and the potential for future work in learning joint representations from video and language.