LLMs Meet Multimodal Generation and Editing: A Survey

LLMs Meet Multimodal Generation and Editing: A Survey

9 Jun 2024 | Yingqing He*, Zhaoyang Liu*, Jingye Chen*, Zeyue Tian*, Hongyu Liu*, Xiaowei Chi*, Runtao Liu, Ruibin Yuan*, Yazhou Xing*, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen
This survey explores the integration of large language models (LLMs) with multimodal generation and editing across various domains, including image, video, 3D, and audio. It summarizes recent advancements in these areas, categorizing studies into LLM-based and CLIP/T5-based methods. The survey discusses the roles of LLMs in multimodal generation, technical components, datasets, and emerging applications. It also addresses generative AI safety, multimodal agents, and future directions. The work provides a systematic overview of multimodal generation and processing, aiming to advance Artificial Intelligence for Generative Content (AIGC) and world models. Key contributions include a comprehensive review of LLMs in multimodal generation, a comparative analysis of pre-LLM and post-LLM eras, and discussions on AI safety, emerging applications, and future prospects. The survey covers image, video, 3D, and audio generation and editing, emphasizing the role of LLMs in enhancing generation quality and enabling interactive multimodal tasks. It also highlights the importance of multimodal alignment models, such as CLIP and ImageBind, and the integration of LLMs with generative models for tasks like text-to-image, text-to-video, and text-to-3D generation. The survey discusses technical components, datasets, and challenges in multimodal generation, providing insights into the development of multimodal AI systems.This survey explores the integration of large language models (LLMs) with multimodal generation and editing across various domains, including image, video, 3D, and audio. It summarizes recent advancements in these areas, categorizing studies into LLM-based and CLIP/T5-based methods. The survey discusses the roles of LLMs in multimodal generation, technical components, datasets, and emerging applications. It also addresses generative AI safety, multimodal agents, and future directions. The work provides a systematic overview of multimodal generation and processing, aiming to advance Artificial Intelligence for Generative Content (AIGC) and world models. Key contributions include a comprehensive review of LLMs in multimodal generation, a comparative analysis of pre-LLM and post-LLM eras, and discussions on AI safety, emerging applications, and future prospects. The survey covers image, video, 3D, and audio generation and editing, emphasizing the role of LLMs in enhancing generation quality and enabling interactive multimodal tasks. It also highlights the importance of multimodal alignment models, such as CLIP and ImageBind, and the integration of LLMs with generative models for tasks like text-to-image, text-to-video, and text-to-3D generation. The survey discusses technical components, datasets, and challenges in multimodal generation, providing insights into the development of multimodal AI systems.
Reach us at info@study.space
Understanding LLMs Meet Multimodal Generation and Editing%3A A Survey