R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
This paper proposes R3CD, a novel method for generating images from scene graphs by leveraging large-scale diffusion models and contrastive control mechanisms. The method addresses two main challenges in scene graph to image generation: (1) the inability to depict concise and accurate interactions via abstract relations, and (2) the failure to generate complete entities. R3CD introduces a scene graph transformer (SGFormer) to capture both local and global information from scene graphs, and a relation-aware compositional contrastive control framework to guide the diffusion model in generating images that align with the scene graph's abstract relations.
The SGFormer is initialized with a T5 model and encodes both nodes and edges of the scene graph. A joint contrastive loss based on attention maps and denoising steps is developed to control the diffusion model, ensuring that the generated images reflect the abstract relations in the scene graph. The method also employs a triplet-level compositional generation approach, which enables the model to generate images that not only reflect the visual interactions between two objects in a triplet but also address the problem of missing entities.
Extensive experiments on the Visual Genome and COCO-Stuff datasets demonstrate that R3CD outperforms existing methods in both quantitative and qualitative metrics. The method achieves superior performance in terms of image quality, diversity, and alignment with the scene graph specifications. The results show that R3CD can generate more realistic and diverse images that respect the scene graph's abstract relations, especially for interactions that are difficult to express with entity stitching.
The method's effectiveness is validated through ablation studies, which show that each component of R3CD contributes to the overall performance. The SGFormer enhances the semantic encoding of nodes and edges with local and global information, while the attention map contrastive loss ensures spatial consistency of entities under the same relations. The diffusion steps contrastive loss ensures interaction consistency of entities under the same relations by aligning their noise distributions with relation embeddings.
In conclusion, R3CD provides a novel framework for scene graph to image generation that leverages large-scale diffusion models and contrastive control mechanisms to capture the interactions between entity regions and abstract relations in the scene graph. The method has been evaluated on two datasets and has demonstrated superior performance in terms of both quantitative and qualitative metrics.R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
This paper proposes R3CD, a novel method for generating images from scene graphs by leveraging large-scale diffusion models and contrastive control mechanisms. The method addresses two main challenges in scene graph to image generation: (1) the inability to depict concise and accurate interactions via abstract relations, and (2) the failure to generate complete entities. R3CD introduces a scene graph transformer (SGFormer) to capture both local and global information from scene graphs, and a relation-aware compositional contrastive control framework to guide the diffusion model in generating images that align with the scene graph's abstract relations.
The SGFormer is initialized with a T5 model and encodes both nodes and edges of the scene graph. A joint contrastive loss based on attention maps and denoising steps is developed to control the diffusion model, ensuring that the generated images reflect the abstract relations in the scene graph. The method also employs a triplet-level compositional generation approach, which enables the model to generate images that not only reflect the visual interactions between two objects in a triplet but also address the problem of missing entities.
Extensive experiments on the Visual Genome and COCO-Stuff datasets demonstrate that R3CD outperforms existing methods in both quantitative and qualitative metrics. The method achieves superior performance in terms of image quality, diversity, and alignment with the scene graph specifications. The results show that R3CD can generate more realistic and diverse images that respect the scene graph's abstract relations, especially for interactions that are difficult to express with entity stitching.
The method's effectiveness is validated through ablation studies, which show that each component of R3CD contributes to the overall performance. The SGFormer enhances the semantic encoding of nodes and edges with local and global information, while the attention map contrastive loss ensures spatial consistency of entities under the same relations. The diffusion steps contrastive loss ensures interaction consistency of entities under the same relations by aligning their noise distributions with relation embeddings.
In conclusion, R3CD provides a novel framework for scene graph to image generation that leverages large-scale diffusion models and contrastive control mechanisms to capture the interactions between entity regions and abstract relations in the scene graph. The method has been evaluated on two datasets and has demonstrated superior performance in terms of both quantitative and qualitative metrics.