[slides and audio] ModaVerse%3A Efficiently Transforming Modalities with LLMs

The paper introduces ModaVerse, a Multi-modal Large Language Model (MLLM) designed to understand and transform content across various modalities, including images, videos, and audio. Traditional MLLM frameworks often rely on aligning latent spaces of textual and non-textual features, which can be complex and resource-intensive. Inspired by LLM-as-agent methodologies, the authors propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language, simplifying the training process and improving efficiency. The proposed Adaptor+Agent paradigm combines the benefits of adaptor training and LLM-as-agent approaches. It uses linear adaptors to map non-textual features into the LLM's textual space, allowing the model to comprehend multi-modal inputs while preserving training efficiency. For output, the LLM is treated as an agent, invoking external models to generate non-text outputs without the need for additional projection layers. The I/O Alignment strategy involves training the LLM to generate meta-responses that contain detailed instructions for activating generative models. This approach reduces the complexity of feature-level alignment and simplifies the training process into a single stage. Experiments on various benchmarks demonstrate that ModaVerse achieves comparable performance to state-of-the-art methods while requiring fewer data and computational resources. The paper also discusses related work, including multi-modal pretraining, adaptor training, and LLM-as-agent approaches, and provides a detailed overview of the ModaVerse pipeline, including input projection, meta-response generation, and final response generation. The results show that ModaVerse performs well in text-to-image, image-to-text, text-to-audio, audio-to-text, text-to-video, and video-to-text tasks, with some limitations in image editing and random output generation without language clues.The paper introduces ModaVerse, a Multi-modal Large Language Model (MLLM) designed to understand and transform content across various modalities, including images, videos, and audio. Traditional MLLM frameworks often rely on aligning latent spaces of textual and non-textual features, which can be complex and resource-intensive. Inspired by LLM-as-agent methodologies, the authors propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language, simplifying the training process and improving efficiency. The proposed Adaptor+Agent paradigm combines the benefits of adaptor training and LLM-as-agent approaches. It uses linear adaptors to map non-textual features into the LLM's textual space, allowing the model to comprehend multi-modal inputs while preserving training efficiency. For output, the LLM is treated as an agent, invoking external models to generate non-text outputs without the need for additional projection layers. The I/O Alignment strategy involves training the LLM to generate meta-responses that contain detailed instructions for activating generative models. This approach reduces the complexity of feature-level alignment and simplifies the training process into a single stage. Experiments on various benchmarks demonstrate that ModaVerse achieves comparable performance to state-of-the-art methods while requiring fewer data and computational resources. The paper also discusses related work, including multi-modal pretraining, adaptor training, and LLM-as-agent approaches, and provides a detailed overview of the ModaVerse pipeline, including input projection, meta-response generation, and final response generation. The results show that ModaVerse performs well in text-to-image, image-to-text, text-to-audio, audio-to-text, text-to-video, and video-to-text tasks, with some limitations in image editing and random output generation without language clues.

ModaVerse: Efficiently Transforming Modalities with LLMs

4 Apr 2024 | Xinyu Wang, Bohan Zhuang, Qi Wu