4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

14 Jun 2024 | Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir
The paper presents 4M-21, an advanced any-to-any vision model capable of handling tens of diverse modalities and tasks without a loss in performance compared to specialized single/few task models. The model is trained on a wide range of modalities, including images, text, semantic and geometric features, and more, using modality-specific tokenizers to map these modalities into discrete tokens. This approach enables the model to generate any modality from any subset of the training modalities, demonstrating strong out-of-the-box capabilities and enabling fine-grained and controllable multimodal generation. The model is trained on a large dataset with 3 billion parameters and is evaluated on various tasks, showing superior performance to existing models. The paper also discusses the benefits of discrete tokenization, co-training on multiple datasets, and the model's ability to perform multimodal retrieval and generation. The code and pre-trained models are open-sourced, and the project website provides additional visualizations and details.The paper presents 4M-21, an advanced any-to-any vision model capable of handling tens of diverse modalities and tasks without a loss in performance compared to specialized single/few task models. The model is trained on a wide range of modalities, including images, text, semantic and geometric features, and more, using modality-specific tokenizers to map these modalities into discrete tokens. This approach enables the model to generate any modality from any subset of the training modalities, demonstrating strong out-of-the-box capabilities and enabling fine-grained and controllable multimodal generation. The model is trained on a large dataset with 3 billion parameters and is evaluated on various tasks, showing superior performance to existing models. The paper also discusses the benefits of discrete tokenization, co-training on multiple datasets, and the model's ability to perform multimodal retrieval and generation. The code and pre-trained models are open-sourced, and the project website provides additional visualizations and details.
Reach us at info@study.space
[slides and audio] 4M-21%3A An Any-to-Any Vision Model for Tens of Tasks and Modalities