14 Jul 2024 | Jing Wu, Jia-Wang Bian, Xinghui Li, Guangrun Wang, Ian Reid, Philip Torr, and Victor Adrian Prisacariu
GaussCtrl is a text-driven method for editing 3D Gaussian Splatting (3DGS) scenes. The method edits a 3DGS scene by modifying its descriptive prompt, which is achieved by editing the rendered images of 3DGS and re-training the 3D model. The key contribution is a depth-conditioned multi-view consistent editing framework that significantly improves the blurry or unreasonable 3D results caused by inconsistent editing in previous work. The method first renders a collection of images using the 3DGS and edits them using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimize the 3D model. The method enables editing all images together instead of iteratively editing one image while updating the 3D model as in previous works. This leads to faster editing and higher visual quality. The two terms that achieve this are (a) depth-conditioned editing that enforces geometric consistency across multi-view images by leveraging naturally consistent depth maps, and (b) attention-based latent code alignment that unifies the appearance of edited images by conditioning their editing to several reference views through self and cross-view attention between images' latent representations. Experiments demonstrate that the method achieves faster editing and better visual results than previous state-of-the-art methods. The method is evaluated on a variety of scenes with different text prompts, ranging from forward-facing scenes to challenging 360-degree object-centered scenes. The method also performs an ablation study on different components of the method to validate their effectiveness. Experiments show that the method significantly improves the visual quality of editing and greatly reduces processing time. The method's contributions include: (1) proposing GaussCtrl to enable efficient editing of 3DGS scenes with text instructions; (2) employing depth guidance and the attention-based latent code alignment module to encourage multi-view consistent editing; and (3) demonstrating more realistic editing and achieving higher visual quality than previous work on a variety of 3D editing scenes. The method is compared with two state-of-the-art methods: Instruct-GS2GS and ViCA-NeRF. The method is shown to generate more consistent and higher-quality images than previous state-of-the-art methods. The method is also shown to generate more realistic results with better quality, consistency, and less artifact in forward-facing scenes. The method is evaluated using CLIP Text-Image Directional Similarity (CLIP_dir) to evaluate the alignment of the 3D edit with text instructions. The method is shown to outperform other approaches in four out of six scenes. However, CLIP_dir may not always reflect the true visual quality of editing. The method is also shown to generate better visual results but gets lower scores than previous methods. The method is the fastest among them. The method is evaluated using ablation studies to demonstrate the effectiveness of each proposed component. The method is shown to significantly improve the general style alignment compared toGaussCtrl is a text-driven method for editing 3D Gaussian Splatting (3DGS) scenes. The method edits a 3DGS scene by modifying its descriptive prompt, which is achieved by editing the rendered images of 3DGS and re-training the 3D model. The key contribution is a depth-conditioned multi-view consistent editing framework that significantly improves the blurry or unreasonable 3D results caused by inconsistent editing in previous work. The method first renders a collection of images using the 3DGS and edits them using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimize the 3D model. The method enables editing all images together instead of iteratively editing one image while updating the 3D model as in previous works. This leads to faster editing and higher visual quality. The two terms that achieve this are (a) depth-conditioned editing that enforces geometric consistency across multi-view images by leveraging naturally consistent depth maps, and (b) attention-based latent code alignment that unifies the appearance of edited images by conditioning their editing to several reference views through self and cross-view attention between images' latent representations. Experiments demonstrate that the method achieves faster editing and better visual results than previous state-of-the-art methods. The method is evaluated on a variety of scenes with different text prompts, ranging from forward-facing scenes to challenging 360-degree object-centered scenes. The method also performs an ablation study on different components of the method to validate their effectiveness. Experiments show that the method significantly improves the visual quality of editing and greatly reduces processing time. The method's contributions include: (1) proposing GaussCtrl to enable efficient editing of 3DGS scenes with text instructions; (2) employing depth guidance and the attention-based latent code alignment module to encourage multi-view consistent editing; and (3) demonstrating more realistic editing and achieving higher visual quality than previous work on a variety of 3D editing scenes. The method is compared with two state-of-the-art methods: Instruct-GS2GS and ViCA-NeRF. The method is shown to generate more consistent and higher-quality images than previous state-of-the-art methods. The method is also shown to generate more realistic results with better quality, consistency, and less artifact in forward-facing scenes. The method is evaluated using CLIP Text-Image Directional Similarity (CLIP_dir) to evaluate the alignment of the 3D edit with text instructions. The method is shown to outperform other approaches in four out of six scenes. However, CLIP_dir may not always reflect the true visual quality of editing. The method is also shown to generate better visual results but gets lower scores than previous methods. The method is the fastest among them. The method is evaluated using ablation studies to demonstrate the effectiveness of each proposed component. The method is shown to significantly improve the general style alignment compared to