OmniViD: A Generative Framework for Universal Video Understanding

OmniViD: A Generative Framework for Universal Video Understanding

26 Mar 2024 | Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang
OmniViD is a generative framework for universal video understanding that unifies various video tasks, including action recognition, video captioning, and object tracking, into a single architecture. The framework introduces a shared vocabulary that includes word tokens, time tokens, and box tokens, enabling the representation of diverse video tasks as token sequences. This unified approach allows for the development of a shared encoder-decoder architecture, where the encoder processes video and text inputs, and the decoder generates token sequences based on the multimodal information. The MQ-former is used to efficiently encode video features into a compact representation, incorporating content, sentence, and box queries. The framework is trained to maximize the log-likelihood of predicted tokens, achieving state-of-the-art results on multiple video benchmarks. Experiments show that OmniViD outperforms existing methods in action recognition, video captioning, and object tracking, demonstrating its effectiveness in handling a wide range of video understanding tasks. The framework's ability to handle diverse tasks within a unified architecture highlights its potential for more universal video understanding.OmniViD is a generative framework for universal video understanding that unifies various video tasks, including action recognition, video captioning, and object tracking, into a single architecture. The framework introduces a shared vocabulary that includes word tokens, time tokens, and box tokens, enabling the representation of diverse video tasks as token sequences. This unified approach allows for the development of a shared encoder-decoder architecture, where the encoder processes video and text inputs, and the decoder generates token sequences based on the multimodal information. The MQ-former is used to efficiently encode video features into a compact representation, incorporating content, sentence, and box queries. The framework is trained to maximize the log-likelihood of predicted tokens, achieving state-of-the-art results on multiple video benchmarks. Experiments show that OmniViD outperforms existing methods in action recognition, video captioning, and object tracking, demonstrating its effectiveness in handling a wide range of video understanding tasks. The framework's ability to handle diverse tasks within a unified architecture highlights its potential for more universal video understanding.
Reach us at info@study.space