26 Mar 2024 | Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang
OmniViD is a generative framework designed to unify the output space of various video understanding tasks, such as action recognition, captioning, and tracking. By using languages as labels and introducing time and box tokens, OmniViD formulates these tasks as video-grounded token generation. This approach allows for a fully shared encoder-decoder architecture, enabling the model to address different types of video tasks effectively. The framework leverages a lightweight MQ-former to enhance the efficiency of video representations and a token decoder to generate token sequences. Extensive experiments on seven video benchmarks demonstrate that OmniViD achieves state-of-the-art or competitive results, showcasing its effectiveness and versatility in universal video understanding.OmniViD is a generative framework designed to unify the output space of various video understanding tasks, such as action recognition, captioning, and tracking. By using languages as labels and introducing time and box tokens, OmniViD formulates these tasks as video-grounded token generation. This approach allows for a fully shared encoder-decoder architecture, enabling the model to address different types of video tasks effectively. The framework leverages a lightweight MQ-former to enhance the efficiency of video representations and a token decoder to generate token sequences. Extensive experiments on seven video benchmarks demonstrate that OmniViD achieves state-of-the-art or competitive results, showcasing its effectiveness and versatility in universal video understanding.