The Evolution of Multimodal Model Architectures

The Evolution of Multimodal Model Architectures

28 May 2024 | Shaktri N. Wadekar, Abhishek Chaurasia, Aman Chadha, Eugenio Culurciello
This paper identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. These types are distinguished by their methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuse multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid in monitoring any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. This work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability. The paper systematically categorizes models by architecture type, facilitating monitoring of developments in the multimodal domain. It also identifies the principal architectural types involved in constructing any-to-any modality multimodal models, which are not covered in other survey works. The paper discusses four types of multimodal model architectures: Type-A (Standard Cross-Attention based Deep Fusion), Type-B (Custom Layer based Deep Fusion), Type-C (Non-Tokenized Early Fusion), and Type-D (Tokenized Early Fusion). Each type is described in detail, including their training data, computational requirements, and advantages and disadvantages. The paper also provides a comprehensive overview of the training methods and data used for each type, as well as the compute resources required for training. The paper concludes with a summary of the contributions of the study, which includes the identification of the four architecture types and their associated advantages and disadvantages.This paper identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. These types are distinguished by their methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuse multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid in monitoring any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. This work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability. The paper systematically categorizes models by architecture type, facilitating monitoring of developments in the multimodal domain. It also identifies the principal architectural types involved in constructing any-to-any modality multimodal models, which are not covered in other survey works. The paper discusses four types of multimodal model architectures: Type-A (Standard Cross-Attention based Deep Fusion), Type-B (Custom Layer based Deep Fusion), Type-C (Non-Tokenized Early Fusion), and Type-D (Tokenized Early Fusion). Each type is described in detail, including their training data, computational requirements, and advantages and disadvantages. The paper also provides a comprehensive overview of the training methods and data used for each type, as well as the compute resources required for training. The paper concludes with a summary of the contributions of the study, which includes the identification of the four architecture types and their associated advantages and disadvantages.
Reach us at info@study.space