LCM-Lookahead for Encoder-based Text-to-Image Personalization

LCM-Lookahead for Encoder-based Text-to-Image Personalization

4 Apr 2024 | RINON GAL*, Tel Aviv University, NVIDIA, Israel; OR LICHTER*, Tel Aviv University, Israel; ELAD RICHARDSON*, Tel Aviv University, Israel; OR PATASHNIK, Tel Aviv University, Israel; AMIT H. BERMANO, Tel Aviv University, Israel; GAL CHECHIK, NVIDIA, Israel; DANIEL COHEN-OR, Tel Aviv University, Israel
This paper introduces LCM-Lookahead, a novel method for applying image-space losses to encoder-based text-to-image personalization. The approach leverages a latent consistency model (LCM) to create a "shortcut" mechanism that allows gradients to backpropagate through earlier diffusion steps, improving identity preservation and prompt alignment without sacrificing layout diversity. The method combines LCM-based lookahead losses with attention sharing and consistent data generation techniques to enhance encoder training. The key idea is to use a pretrained LCM-LoRA model to generate high-quality previews of the denoised image, which are then used to calculate image-space losses such as identity loss. This allows the model to better align with the textual prompts and preserve the subject's identity. Additionally, the method introduces an extended self-attention mechanism that enables the encoder to draw visual features from the conditioning image, improving identity fidelity. The paper also explores the use of a synthetic dataset generated using SDXL-Turbo, which allows for consistent generation of images across a wide range of prompts and styles. This dataset is used to train the encoder, ensuring that the model can generalize well to new identities and styles. Experiments show that the proposed method outperforms existing approaches in terms of identity preservation and prompt alignment. The method is evaluated against several baselines, including IP-Adapter, InstantID, and PhotoMaker, and is shown to achieve better results in both qualitative and quantitative comparisons. The paper also discusses the limitations of the approach, including the potential for biases and the need for careful handling of sensitive data. The authors conclude that their method provides a significant improvement in text-to-image personalization, and that further research is needed to address the remaining challenges.This paper introduces LCM-Lookahead, a novel method for applying image-space losses to encoder-based text-to-image personalization. The approach leverages a latent consistency model (LCM) to create a "shortcut" mechanism that allows gradients to backpropagate through earlier diffusion steps, improving identity preservation and prompt alignment without sacrificing layout diversity. The method combines LCM-based lookahead losses with attention sharing and consistent data generation techniques to enhance encoder training. The key idea is to use a pretrained LCM-LoRA model to generate high-quality previews of the denoised image, which are then used to calculate image-space losses such as identity loss. This allows the model to better align with the textual prompts and preserve the subject's identity. Additionally, the method introduces an extended self-attention mechanism that enables the encoder to draw visual features from the conditioning image, improving identity fidelity. The paper also explores the use of a synthetic dataset generated using SDXL-Turbo, which allows for consistent generation of images across a wide range of prompts and styles. This dataset is used to train the encoder, ensuring that the model can generalize well to new identities and styles. Experiments show that the proposed method outperforms existing approaches in terms of identity preservation and prompt alignment. The method is evaluated against several baselines, including IP-Adapter, InstantID, and PhotoMaker, and is shown to achieve better results in both qualitative and quantitative comparisons. The paper also discusses the limitations of the approach, including the potential for biases and the need for careful handling of sensitive data. The authors conclude that their method provides a significant improvement in text-to-image personalization, and that further research is needed to address the remaining challenges.
Reach us at info@study.space
[slides and audio] LCM-Lookahead for Encoder-based Text-to-Image Personalization