This paper addresses the challenge of training models to understand the world by combining language and video data. The authors leverage the Blockwise RingAttention technique to scalably train on long sequences, gradually increasing the context size from 4K to 1M tokens. They curate a large dataset of diverse videos and books and introduce solutions for overcoming challenges in vision-language training, such as masked sequence packing and loss weighting. The paper makes the following contributions:
1. **Largest Context Size Neural Network**: Trains one of the largest context-size transformers on long video and language sequences, setting new benchmarks in retrieval tasks and long video understanding.
2. **Solutions for Training Challenges**: Proposes methods to balance language and vision, use masked sequence packing for different sequence lengths, and generate QA datasets for long sequence chat.
3. **Optimized Implementation**: Provides a highly-optimized implementation with RingAttention, Blockwise Transformers, masked sequence packing, and other key features for training on millions-length multimodal sequences.
4. **Open-Sourced Models**: Fully opensourced a family of 7B parameter models capable of processing long text documents and videos of over 1M tokens.
The paper demonstrates the effectiveness of the proposed approach through various evaluation tasks, including needle retrieval, multi-needle retrieval, short context language evaluation, and chat evaluation. The results show that the model can perform well on complex, long-form tasks involving videos and language, paving the way for broader AI capabilities.This paper addresses the challenge of training models to understand the world by combining language and video data. The authors leverage the Blockwise RingAttention technique to scalably train on long sequences, gradually increasing the context size from 4K to 1M tokens. They curate a large dataset of diverse videos and books and introduce solutions for overcoming challenges in vision-language training, such as masked sequence packing and loss weighting. The paper makes the following contributions:
1. **Largest Context Size Neural Network**: Trains one of the largest context-size transformers on long video and language sequences, setting new benchmarks in retrieval tasks and long video understanding.
2. **Solutions for Training Challenges**: Proposes methods to balance language and vision, use masked sequence packing for different sequence lengths, and generate QA datasets for long sequence chat.
3. **Optimized Implementation**: Provides a highly-optimized implementation with RingAttention, Blockwise Transformers, masked sequence packing, and other key features for training on millions-length multimodal sequences.
4. **Open-Sourced Models**: Fully opensourced a family of 7B parameter models capable of processing long text documents and videos of over 1M tokens.
The paper demonstrates the effectiveness of the proposed approach through various evaluation tasks, including needle retrieval, multi-needle retrieval, short context language evaluation, and chat evaluation. The results show that the model can perform well on complex, long-form tasks involving videos and language, paving the way for broader AI capabilities.