ModaVerse is a multi-modal large language model (MLLM) that can understand and transform content across various modalities, including images, videos, and audio. The paper introduces a novel approach called Adaptor+Agent, which combines the efficiency of LLM-as-agent methods with the flexibility of adaptor training. This approach aligns the LLM's output with the input of generative models at the level of natural language, avoiding the complexities of latent feature alignment and simplifying the training process. The model uses linear projection layers to map non-textual features into the LLM's textual space and employs an LLM-as-agent design to generate non-text outputs. The proposed method achieves comparable performance to state-of-the-art models while requiring fewer data and training resources. The paper also presents quantitative and qualitative results showing that ModaVerse performs well on various benchmarks and demonstrates the ability to generate multi-modal outputs across different modalities. The model's limitations include challenges in image editing tasks and the need for language instructions to produce appropriate outputs. The paper concludes that the Adaptor+Agent paradigm offers a more efficient and flexible approach for training MLLMs.ModaVerse is a multi-modal large language model (MLLM) that can understand and transform content across various modalities, including images, videos, and audio. The paper introduces a novel approach called Adaptor+Agent, which combines the efficiency of LLM-as-agent methods with the flexibility of adaptor training. This approach aligns the LLM's output with the input of generative models at the level of natural language, avoiding the complexities of latent feature alignment and simplifying the training process. The model uses linear projection layers to map non-textual features into the LLM's textual space and employs an LLM-as-agent design to generate non-text outputs. The proposed method achieves comparable performance to state-of-the-art models while requiring fewer data and training resources. The paper also presents quantitative and qualitative results showing that ModaVerse performs well on various benchmarks and demonstrates the ability to generate multi-modal outputs across different modalities. The model's limitations include challenges in image editing tasks and the need for language instructions to produce appropriate outputs. The paper concludes that the Adaptor+Agent paradigm offers a more efficient and flexible approach for training MLLMs.