[slides and audio] TIP-Editor%3A An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

**TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts** **Authors:** Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan **Abstract:** Text-driven 3D scene editing has gained significant attention due to its convenience and user-friendliness. However, existing methods often lack precise control over the specified appearance and location of the editing result due to the limitations of text descriptions. To address this, we propose TIP-Editor, a 3D scene editing framework that accepts both text and image prompts, along with a 3D bounding box, to specify the editing region. The image prompt complements the textual description, enabling more accurate control over the appearance. TIP-Editor employs a stepwise 2D personalization strategy, including a localization loss to ensure correct object placement and a separate content personalization step based on LoRA, to achieve precise location and appearance control. Additionally, it uses 3D Gaussian splatting (GS) as the 3D representation, which is efficient and suitable for local editing while keeping the background unchanged. Extensive experiments demonstrate that TIP-Editor consistently outperforms baselines in editing quality and alignment to the prompts, both qualitatively and quantitatively. **Contributions:** - We present TIP-Editor, a versatile 3D scene editing framework that allows users to perform various editing operations guided by both text and image prompts. - We introduce a novel stepwise 2D personalization strategy, featuring a localization loss and a separate content personalization step dedicated to the reference image, to enable accurate location and appearance control. - We adopt 3D Gaussian splatting as the 3D representation due to its rendering efficiency and explicit point data structure, which facilitates precise local editing. **Methods:** 1. **Stepwise 2D Personalization:** This strategy includes a localization loss to enforce interaction between the existing scene and the novel content specified by the 3D bounding box, and a separate content personalization step using LoRA layers to capture unique characteristics of the reference image. 2. **Coarse Editing via SDS Loss:** The selected Gaussians inside the bounding box are optimized using score distillation sampling (SDS) loss, with different criteria for object insertion, replacement, re-texturing, and stylization. 3. **Pixel-Level Image Refinement:** A pixel-level reconstruction loss is applied to enhance the quality of the editing results by creating a pseudo-GT image and supervising the rendered image. **Experiments:** - **Setup:** Implementation details, dataset selection, baselines, and evaluation criteria are provided. - **Visual Results:** Qualitative comparisons show that TIP-Editor effectively executes various editing tasks, including re-texturing, object insertion, replacement, and stylization, achieving high-quality results and strict adherence to the provided**TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts** **Authors:** Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan **Abstract:** Text-driven 3D scene editing has gained significant attention due to its convenience and user-friendliness. However, existing methods often lack precise control over the specified appearance and location of the editing result due to the limitations of text descriptions. To address this, we propose TIP-Editor, a 3D scene editing framework that accepts both text and image prompts, along with a 3D bounding box, to specify the editing region. The image prompt complements the textual description, enabling more accurate control over the appearance. TIP-Editor employs a stepwise 2D personalization strategy, including a localization loss to ensure correct object placement and a separate content personalization step based on LoRA, to achieve precise location and appearance control. Additionally, it uses 3D Gaussian splatting (GS) as the 3D representation, which is efficient and suitable for local editing while keeping the background unchanged. Extensive experiments demonstrate that TIP-Editor consistently outperforms baselines in editing quality and alignment to the prompts, both qualitatively and quantitatively. **Contributions:** - We present TIP-Editor, a versatile 3D scene editing framework that allows users to perform various editing operations guided by both text and image prompts. - We introduce a novel stepwise 2D personalization strategy, featuring a localization loss and a separate content personalization step dedicated to the reference image, to enable accurate location and appearance control. - We adopt 3D Gaussian splatting as the 3D representation due to its rendering efficiency and explicit point data structure, which facilitates precise local editing. **Methods:** 1. **Stepwise 2D Personalization:** This strategy includes a localization loss to enforce interaction between the existing scene and the novel content specified by the 3D bounding box, and a separate content personalization step using LoRA layers to capture unique characteristics of the reference image. 2. **Coarse Editing via SDS Loss:** The selected Gaussians inside the bounding box are optimized using score distillation sampling (SDS) loss, with different criteria for object insertion, replacement, re-texturing, and stylization. 3. **Pixel-Level Image Refinement:** A pixel-level reconstruction loss is applied to enhance the quality of the editing results by creating a pseudo-GT image and supervising the rendered image. **Experiments:** - **Setup:** Implementation details, dataset selection, baselines, and evaluation criteria are provided. - **Visual Results:** Qualitative comparisons show that TIP-Editor effectively executes various editing tasks, including re-texturing, object insertion, replacement, and stylization, achieving high-quality results and strict adherence to the provided