World Model on Million-Length Video and Language with Blockwise RingAttention

World Model on Million-Length Video and Language with Blockwise RingAttention

23 Jul 2024 | Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
This paper presents a large-scale world model capable of processing millions of tokens of video and language data, using Blockwise RingAttention to scale training on long sequences. The model, named LWM, is trained on a diverse dataset of videos and books, with context size gradually increasing from 4K to 1M tokens. The model achieves state-of-the-art performance in long video understanding and long-context fact retrieval. Key contributions include the largest context size transformer trained on long video and language sequences, solutions for overcoming vision-language training challenges, and a highly-optimized implementation with RingAttention, Blockwise Transformers, and masked sequence packing. The model is fully open-sourced, providing a family of 7B parameter models capable of processing long text documents and videos of over 1M tokens. The model demonstrates strong performance in both language and vision tasks, including long video understanding, image and video generation, and multi-needle retrieval. The paper also discusses the challenges of training on long sequences, including the need for efficient masked sequence packing and loss weighting to balance language and vision training. The model is trained on a combination of text-image, text-video, and books data, with progressive training on increasing sequence lengths. The model is capable of answering questions over long videos and retrieving facts from long text documents. The paper also highlights the importance of training on diverse data and the need for further research to improve context utilization and model performance. The model is a significant step forward in the development of large-scale multimodal models capable of understanding both human knowledge and the physical world.This paper presents a large-scale world model capable of processing millions of tokens of video and language data, using Blockwise RingAttention to scale training on long sequences. The model, named LWM, is trained on a diverse dataset of videos and books, with context size gradually increasing from 4K to 1M tokens. The model achieves state-of-the-art performance in long video understanding and long-context fact retrieval. Key contributions include the largest context size transformer trained on long video and language sequences, solutions for overcoming vision-language training challenges, and a highly-optimized implementation with RingAttention, Blockwise Transformers, and masked sequence packing. The model is fully open-sourced, providing a family of 7B parameter models capable of processing long text documents and videos of over 1M tokens. The model demonstrates strong performance in both language and vision tasks, including long video understanding, image and video generation, and multi-needle retrieval. The paper also discusses the challenges of training on long sequences, including the need for efficient masked sequence packing and loss weighting to balance language and vision training. The model is trained on a combination of text-image, text-video, and books data, with progressive training on increasing sequence lengths. The model is capable of answering questions over long videos and retrieving facts from long text documents. The paper also highlights the importance of training on diverse data and the need for further research to improve context utilization and model performance. The model is a significant step forward in the development of large-scale multimodal models capable of understanding both human knowledge and the physical world.
Reach us at info@study.space
[slides] World Model on Million-Length Video And Language With Blockwise RingAttention | StudySpace