TheaterGen 🎭: Character Management with LLM for Consistent Multi-turn Image Generation

TheaterGen 🎭: Character Management with LLM for Consistent Multi-turn Image Generation

29 Apr 2024 | Junhao Cheng1, Baiqiao Yin1, Kaixin Cai1, Minbin Huang2, Hanhui Li1, Yuxin He1, Xi Lu1, Yue Li1, Yifei Li1, Yuhao Cheng3, Yiqiang Yan3, and Xiaodan Liang1,*
The paper introduces TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to address the challenges of maintaining semantic and contextual consistency in multi-turn image generation. TheaterGen leverages LLMs as "Screenwriters" to manage a standardized prompt book, which includes character prompts and layout designs. This framework generates character images and extracts guidance information, which are then used in the reverse denoising process of T2I diffusion models to produce final images. The authors also introduce CMIGBench, a new benchmark with 8000 multi-turn instructions, to evaluate semantic and contextual consistency in multi-turn image generation. Extensive experiments show that TheaterGen outperforms state-of-the-art methods, achieving significant improvements in character-character similarity and text-image similarity. The paper discusses the contributions, related work, methodological details, and experimental results, highlighting the effectiveness of TheaterGen in generating consistent and high-quality images in multi-turn scenarios.The paper introduces TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to address the challenges of maintaining semantic and contextual consistency in multi-turn image generation. TheaterGen leverages LLMs as "Screenwriters" to manage a standardized prompt book, which includes character prompts and layout designs. This framework generates character images and extracts guidance information, which are then used in the reverse denoising process of T2I diffusion models to produce final images. The authors also introduce CMIGBench, a new benchmark with 8000 multi-turn instructions, to evaluate semantic and contextual consistency in multi-turn image generation. Extensive experiments show that TheaterGen outperforms state-of-the-art methods, achieving significant improvements in character-character similarity and text-image similarity. The paper discusses the contributions, related work, methodological details, and experimental results, highlighting the effectiveness of TheaterGen in generating consistent and high-quality images in multi-turn scenarios.
Reach us at info@study.space
[slides and audio] TheaterGen%3A Character Management with LLM for Consistent Multi-turn Image Generation