2024 | Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J. Hénaff
Memory consolidation enables long-context video understanding by reusing pre-trained video transformers through non-parametric memory mechanisms. Traditional video transformers are limited by quadratic complexity, but the proposed Memory-Consolidated Vision Transformer (MC-ViT) extends context by consolidating past activations into a compact memory bank. This approach allows MC-ViT to process long videos efficiently, outperforming methods with significantly more parameters. MC-ViT achieves state-of-the-art results on benchmarks like Diving48, EgoSchema, and Perception Test, demonstrating strong long-context modeling capabilities. It uses non-parametric memory consolidation techniques, such as k-means and coreset selection, to compress past activations, enabling efficient scaling. MC-ViT is competitive with large-scale proprietary models like GPT-4V and Bard, despite using a standard architecture. The method is efficient in both training and inference, with reduced memory and computational complexity. MC-ViT's approach can be applied to other domains involving sequential data, such as natural language and audio processing, and shows robust performance in long-context video understanding tasks. The work highlights the importance of compressed video representations and efficient memory consolidation for long-context modeling.Memory consolidation enables long-context video understanding by reusing pre-trained video transformers through non-parametric memory mechanisms. Traditional video transformers are limited by quadratic complexity, but the proposed Memory-Consolidated Vision Transformer (MC-ViT) extends context by consolidating past activations into a compact memory bank. This approach allows MC-ViT to process long videos efficiently, outperforming methods with significantly more parameters. MC-ViT achieves state-of-the-art results on benchmarks like Diving48, EgoSchema, and Perception Test, demonstrating strong long-context modeling capabilities. It uses non-parametric memory consolidation techniques, such as k-means and coreset selection, to compress past activations, enabling efficient scaling. MC-ViT is competitive with large-scale proprietary models like GPT-4V and Bard, despite using a standard architecture. The method is efficient in both training and inference, with reduced memory and computational complexity. MC-ViT's approach can be applied to other domains involving sequential data, such as natural language and audio processing, and shows robust performance in long-context video understanding tasks. The work highlights the importance of compressed video representations and efficient memory consolidation for long-context modeling.