28 May 2024 | Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, Eugenio Culurciello
This work identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. The research systematically categorizes models by architecture type, facilitating the monitoring of developments in the multimodal domain. Unlike recent survey papers that provide general information on multimodal architectures, this study conducts a comprehensive exploration of architectural details and identifies four specific architectural types: Type A, B, C, and D. These types are distinguished by their methodologies for integrating multimodal inputs into deep neural network models. Type A and B deeply fuse multimodal inputs within the internal layers of the model, while Types C and D facilitate early fusion at the input stage. Type A employs standard cross-attention, while Type B uses custom-designed layers for modality fusion within the internal layers. Type C utilizes modality-specific encoders, and Type D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid in monitoring the development of any-to-any multimodal models. Notably, Types C and D are currently favored in constructing any-to-any multimodal models. Type C, a non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type D, which uses input-tokenizing techniques. The study highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, ease of integrating modalities, and any-to-any multimodal generation capability.This work identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. The research systematically categorizes models by architecture type, facilitating the monitoring of developments in the multimodal domain. Unlike recent survey papers that provide general information on multimodal architectures, this study conducts a comprehensive exploration of architectural details and identifies four specific architectural types: Type A, B, C, and D. These types are distinguished by their methodologies for integrating multimodal inputs into deep neural network models. Type A and B deeply fuse multimodal inputs within the internal layers of the model, while Types C and D facilitate early fusion at the input stage. Type A employs standard cross-attention, while Type B uses custom-designed layers for modality fusion within the internal layers. Type C utilizes modality-specific encoders, and Type D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid in monitoring the development of any-to-any multimodal models. Notably, Types C and D are currently favored in constructing any-to-any multimodal models. Type C, a non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type D, which uses input-tokenizing techniques. The study highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, ease of integrating modalities, and any-to-any multimodal generation capability.