[slides] LLMs Meet Multimodal Generation and Editing%3A A Survey

This survey provides a comprehensive overview of the integration of large language models (LLMs) with multimodal learning, focusing on multimodal generation and editing across various domains such as image, video, 3D, and audio. The authors categorize the studies into LLM-based and CLIP/T5-based methods, highlighting notable advancements and milestone works. They discuss the critical technical components behind these methods and the multimodal datasets used. Additionally, the survey explores tool-augmented multimodal agents that leverage existing generative models for human-computer interaction and advances in generative AI safety. The work aims to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models, providing a systematic and insightful overview of multimodal generation and processing. The survey is structured into sections covering image, video, 3D, and audio generation and editing, as well as safety, emerging applications, and future prospects.This survey provides a comprehensive overview of the integration of large language models (LLMs) with multimodal learning, focusing on multimodal generation and editing across various domains such as image, video, 3D, and audio. The authors categorize the studies into LLM-based and CLIP/T5-based methods, highlighting notable advancements and milestone works. They discuss the critical technical components behind these methods and the multimodal datasets used. Additionally, the survey explores tool-augmented multimodal agents that leverage existing generative models for human-computer interaction and advances in generative AI safety. The work aims to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models, providing a systematic and insightful overview of multimodal generation and processing. The survey is structured into sections covering image, video, 3D, and audio generation and editing, as well as safety, emerging applications, and future prospects.

LLMs Meet Multimodal Generation and Editing: A Survey

9 Jun 2024 | Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen