4 Mar 2024 | Zhengyao Lv, Yuxiang Wei, Wangmeng Zuo, Kwan-Yee K. Wong
This supplementary material provides additional details about the Layout-Free Prior Preservation (LFP) loss, which is used to preserve the semantic priors of pre-trained models during fine-tuning. The LFP loss is designed to work with text-image pairs that do not include semantic masks, allowing the model to retain the prior knowledge of the pre-trained model without relying on layout annotations.
The LFP loss is computed using the denoising loss of the pre-trained model, but with the adaptive fusion parameter set to 0. This means that during the synthesis process, the model does not use the layout control map, and instead relies solely on the text-image pairs to preserve the semantic priors. The loss is calculated as follows:
$$ \mathcal{L}_{L F P}:=\mathbb{E}_{\mathcal{E}(x),y^{\prime},\epsilon^{\prime},t^{\prime}}||\epsilon^{\prime}-\epsilon_{\theta,\alpha=0}(z_{t}^{\prime},t^{\prime},\tau_{\theta}(y^{\prime}))||_{2}^{2}. $$
This loss helps the model retain the semantic priors of the pre-trained model, even when the layout information is not available. The results show that the model can generate diverse images and exhibit improved performance in visual quality and semantic consistency.This supplementary material provides additional details about the Layout-Free Prior Preservation (LFP) loss, which is used to preserve the semantic priors of pre-trained models during fine-tuning. The LFP loss is designed to work with text-image pairs that do not include semantic masks, allowing the model to retain the prior knowledge of the pre-trained model without relying on layout annotations.
The LFP loss is computed using the denoising loss of the pre-trained model, but with the adaptive fusion parameter set to 0. This means that during the synthesis process, the model does not use the layout control map, and instead relies solely on the text-image pairs to preserve the semantic priors. The loss is calculated as follows:
$$ \mathcal{L}_{L F P}:=\mathbb{E}_{\mathcal{E}(x),y^{\prime},\epsilon^{\prime},t^{\prime}}||\epsilon^{\prime}-\epsilon_{\theta,\alpha=0}(z_{t}^{\prime},t^{\prime},\tau_{\theta}(y^{\prime}))||_{2}^{2}. $$
This loss helps the model retain the semantic priors of the pre-trained model, even when the layout information is not available. The results show that the model can generate diverse images and exhibit improved performance in visual quality and semantic consistency.