4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

14 Jun 2024 | Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir
The paper introduces 4M-21, a single any-to-any vision model trained on tens of diverse modalities, including images, text, semantic and geometric data, and feature maps from state-of-the-art models like DINOv2 and ImageBind. The model is trained using a multimodal masked pre-training scheme, with discrete tokenization applied to various modalities, enabling the model to generate any modality from any subset of them. The model demonstrates strong performance across a wide range of tasks and modalities, outperforming existing models in terms of task coverage and performance. It also enables fine-grained and controllable multimodal generation, and allows for the study of model distillation into a unified model. The model is trained on a three billion parameter scale and a large dataset, and is open-sourced. The model can perform multimodal retrieval, steerable generation, and strong out-of-the-box capabilities. It is also effective in transfer learning tasks, showing improved performance on various downstream tasks. The paper discusses the limitations and future directions of the model, including the potential for transfer/emergent capabilities, better tokenization, and co-training on partially aligned datasets.The paper introduces 4M-21, a single any-to-any vision model trained on tens of diverse modalities, including images, text, semantic and geometric data, and feature maps from state-of-the-art models like DINOv2 and ImageBind. The model is trained using a multimodal masked pre-training scheme, with discrete tokenization applied to various modalities, enabling the model to generate any modality from any subset of them. The model demonstrates strong performance across a wide range of tasks and modalities, outperforming existing models in terms of task coverage and performance. It also enables fine-grained and controllable multimodal generation, and allows for the study of model distillation into a unified model. The model is trained on a three billion parameter scale and a large dataset, and is open-sourced. The model can perform multimodal retrieval, steerable generation, and strong out-of-the-box capabilities. It is also effective in transfer learning tasks, showing improved performance on various downstream tasks. The paper discusses the limitations and future directions of the model, including the potential for transfer/emergent capabilities, better tokenization, and co-training on partially aligned datasets.
Reach us at info@study.space
[slides and audio] 4M-21%3A An Any-to-Any Vision Model for Tens of Tasks and Modalities