23 Aug 2024 | Shimon Vainer, Mark Boss, Mathias Parger, Konstantin Kutsy, Dante De Nigris, Ciara Rowles, Nicolas Perony, Simon Donne
This paper proposes a collaborative control approach for generating physically-based rendering (PBR) images conditioned on geometry and text prompts. The method directly models the joint distribution of RGB and PBR images, avoiding the inaccuracies of RGB-based generation and the ambiguity of extracting PBR from RGB. A pre-trained RGB model is kept frozen, while a new PBR model is trained in parallel, using a novel cross-network communication paradigm to leverage the RGB model's expressivity and internal state. This allows the generation of high-quality, diverse PBR content, even for unlikely appearances of objects. The frozen RGB model prevents catastrophic forgetting and remains compatible with techniques like IPAdapter.
The approach is data-efficient and generates high-quality images even from a limited training set. It is compatible with IPAdapter and demonstrates improvements over existing paradigms through ablation studies. The method uses a dedicated PBR VAE to encode images into a lower-dimensional latent space, as RGB images cannot be compressed into the same space due to their higher dimensionality. The PBR model is trained to denoise PBR images given additional RGB context, rather than learning inverse rendering in degraded image space.
The method is evaluated using distribution match metrics such as Inception Score (IS), Fréchet Inception Distance (FID), and CLIP Maximum-Mean Discrepancy (CMMD). It also assesses out-of-distribution (OOD) performance using CLIP alignment scores and OneAlign aesthetics and quality metrics. The results show that the proposed approach outperforms existing methods in terms of distribution match, quality, and OOD performance. The method is also compatible with other control techniques, demonstrating its practical applications in graphics pipelines. The approach is data-efficient and performs well even with limited training data, and it is compatible with existing adaptations of the base RGB model. The method is also stable in terms of initial noise and text prompts, as demonstrated through interpolation experiments. The main limitations include a lack of detail in roughness, metallic, and bump maps, and failure to follow OOD prompts. The method requires two parallel diffusion models, which may be costly in some applications.This paper proposes a collaborative control approach for generating physically-based rendering (PBR) images conditioned on geometry and text prompts. The method directly models the joint distribution of RGB and PBR images, avoiding the inaccuracies of RGB-based generation and the ambiguity of extracting PBR from RGB. A pre-trained RGB model is kept frozen, while a new PBR model is trained in parallel, using a novel cross-network communication paradigm to leverage the RGB model's expressivity and internal state. This allows the generation of high-quality, diverse PBR content, even for unlikely appearances of objects. The frozen RGB model prevents catastrophic forgetting and remains compatible with techniques like IPAdapter.
The approach is data-efficient and generates high-quality images even from a limited training set. It is compatible with IPAdapter and demonstrates improvements over existing paradigms through ablation studies. The method uses a dedicated PBR VAE to encode images into a lower-dimensional latent space, as RGB images cannot be compressed into the same space due to their higher dimensionality. The PBR model is trained to denoise PBR images given additional RGB context, rather than learning inverse rendering in degraded image space.
The method is evaluated using distribution match metrics such as Inception Score (IS), Fréchet Inception Distance (FID), and CLIP Maximum-Mean Discrepancy (CMMD). It also assesses out-of-distribution (OOD) performance using CLIP alignment scores and OneAlign aesthetics and quality metrics. The results show that the proposed approach outperforms existing methods in terms of distribution match, quality, and OOD performance. The method is also compatible with other control techniques, demonstrating its practical applications in graphics pipelines. The approach is data-efficient and performs well even with limited training data, and it is compatible with existing adaptations of the base RGB model. The method is also stable in terms of initial noise and text prompts, as demonstrated through interpolation experiments. The main limitations include a lack of detail in roughness, metallic, and bump maps, and failure to follow OOD prompts. The method requires two parallel diffusion models, which may be costly in some applications.