Understanding TheaterGen%3A Character Management with LLM for Consistent Multi-turn Image Generation

The paper introduces TheaterGen, a training-free framework that integrates large language models (LLMs) with text-to-image (T2I) models to enable consistent multi-turn image generation. The framework addresses the challenges of maintaining semantic and contextual consistency in multi-turn image generation by leveraging LLMs to manage a standardized prompt book that includes prompts and layouts for each character in the target image. This prompt book is used to generate character images and extract guidance information, which is then incorporated into the T2I diffusion models to generate the final image. TheaterGen also introduces a new benchmark, CMIGBench, which includes 8000 multi-turn instructions and evaluates both semantic and contextual consistency in multi-turn image generation. Experimental results show that TheaterGen outperforms state-of-the-art methods in terms of both semantic and contextual consistency, with significant improvements in average character-character similarity and text-image similarity. The framework is designed to handle complex multi-turn tasks such as story generation and multi-turn editing, and it demonstrates the effectiveness of using LLMs to manage the generation process and maintain consistency across multiple turns. The paper also discusses the limitations of existing methods and proposes future directions for improving multi-turn image generation.The paper introduces TheaterGen, a training-free framework that integrates large language models (LLMs) with text-to-image (T2I) models to enable consistent multi-turn image generation. The framework addresses the challenges of maintaining semantic and contextual consistency in multi-turn image generation by leveraging LLMs to manage a standardized prompt book that includes prompts and layouts for each character in the target image. This prompt book is used to generate character images and extract guidance information, which is then incorporated into the T2I diffusion models to generate the final image. TheaterGen also introduces a new benchmark, CMIGBench, which includes 8000 multi-turn instructions and evaluates both semantic and contextual consistency in multi-turn image generation. Experimental results show that TheaterGen outperforms state-of-the-art methods in terms of both semantic and contextual consistency, with significant improvements in average character-character similarity and text-image similarity. The framework is designed to handle complex multi-turn tasks such as story generation and multi-turn editing, and it demonstrates the effectiveness of using LLMs to manage the generation process and maintain consistency across multiple turns. The paper also discusses the limitations of existing methods and proposes future directions for improving multi-turn image generation.

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

29 Apr 2024 | Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang