July 2024 | JINGYU ZHUANG, DI KANG, YAN-PEI CAO, GUANBIN LI, LIANG LIN, YING SHAN
TIP-Editor is a 3D scene editing framework that accepts both text and image prompts along with a 3D bounding box to specify the editing region. It enables precise and high-quality localized editing, allowing users to perform various types of editing on a 3D scene, such as object insertion, whole object replacement, part-level object editing, combination of these editing types, and stylization. The editing process is guided by both text and a reference image, which complements the textual description and results in more accurate editing control. Images in the text prompts denote their associated rare tokens, which are fixed without optimization.
TIP-Editor employs a stepwise 2D personalization strategy to better learn the representation of the existing scene and the reference image. A localization loss is proposed to encourage correct object placement as specified by the bounding box. Additionally, TIP-Editor utilizes explicit and flexible 3D Gaussian splatting (GS) as the 3D representation to facilitate local editing while keeping the background unchanged. Extensive experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region, consistently outperforming the baselines in editing quality and alignment to the prompts.
TIP-Editor is evaluated across various real-world scenes, including objects, human faces, and outdoor scenes. The editing results successfully capture the unique characteristics specified in the reference images, significantly enhancing the controllability of the editing process. In both qualitative and quantitative comparisons, TIP-Editor consistently demonstrates superior performance in editing quality, visual fidelity, and user satisfaction compared to existing methods. The contributions include a versatile 3D scene editing framework that allows users to perform various editing operations guided by text prompts and a reference image, a novel stepwise 2D personalization strategy, and the adoption of 3D Gaussian splatting for efficient and precise local editing.TIP-Editor is a 3D scene editing framework that accepts both text and image prompts along with a 3D bounding box to specify the editing region. It enables precise and high-quality localized editing, allowing users to perform various types of editing on a 3D scene, such as object insertion, whole object replacement, part-level object editing, combination of these editing types, and stylization. The editing process is guided by both text and a reference image, which complements the textual description and results in more accurate editing control. Images in the text prompts denote their associated rare tokens, which are fixed without optimization.
TIP-Editor employs a stepwise 2D personalization strategy to better learn the representation of the existing scene and the reference image. A localization loss is proposed to encourage correct object placement as specified by the bounding box. Additionally, TIP-Editor utilizes explicit and flexible 3D Gaussian splatting (GS) as the 3D representation to facilitate local editing while keeping the background unchanged. Extensive experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region, consistently outperforming the baselines in editing quality and alignment to the prompts.
TIP-Editor is evaluated across various real-world scenes, including objects, human faces, and outdoor scenes. The editing results successfully capture the unique characteristics specified in the reference images, significantly enhancing the controllability of the editing process. In both qualitative and quantitative comparisons, TIP-Editor consistently demonstrates superior performance in editing quality, visual fidelity, and user satisfaction compared to existing methods. The contributions include a versatile 3D scene editing framework that allows users to perform various editing operations guided by text prompts and a reference image, a novel stepwise 2D personalization strategy, and the adoption of 3D Gaussian splatting for efficient and precise local editing.