2024 | Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J. Hénaff
The article introduces a novel approach called Memory-Consolidated Vision Transformer (MC-ViT) for long-context video understanding. Traditional transformer-based video encoders are limited by quadratic complexity, restricting their ability to process long videos. MC-ViT addresses this by reusing pre-trained video transformers through non-parametric memory consolidation, allowing them to attend to past activations without architectural changes. This method enables efficient scaling to longer videos while maintaining bounded complexity. MC-ViT outperforms existing methods on benchmark tasks like fine-grained action recognition and video question answering, achieving state-of-the-art results with significantly fewer parameters. The approach leverages redundancy reduction and non-parametric memory schemes to compress and consolidate past activations, enabling effective long-context modeling. MC-ViT is also competitive with large-scale proprietary models, demonstrating the effectiveness of its memory consolidation strategy. The method is efficient, scalable, and requires minimal training overhead, making it suitable for long-context video understanding tasks.The article introduces a novel approach called Memory-Consolidated Vision Transformer (MC-ViT) for long-context video understanding. Traditional transformer-based video encoders are limited by quadratic complexity, restricting their ability to process long videos. MC-ViT addresses this by reusing pre-trained video transformers through non-parametric memory consolidation, allowing them to attend to past activations without architectural changes. This method enables efficient scaling to longer videos while maintaining bounded complexity. MC-ViT outperforms existing methods on benchmark tasks like fine-grained action recognition and video question answering, achieving state-of-the-art results with significantly fewer parameters. The approach leverages redundancy reduction and non-parametric memory schemes to compress and consolidate past activations, enabling effective long-context modeling. MC-ViT is also competitive with large-scale proprietary models, demonstrating the effectiveness of its memory consolidation strategy. The method is efficient, scalable, and requires minimal training overhead, making it suitable for long-context video understanding tasks.