CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

8 Mar 2024 | Zhengyi Wang1,3, Yikai Wang1, Yifei Chen1, Chendong Xiang1,3, Shuo Chen1, Dajiang Yu1, Chongxuan Li2, Hang Su1 and Jun Zhu*1,3
The paper introduces the Convolutional Reconstruction Model (CRM), a feed-forward single image-to-3D generative model that generates high-fidelity textured meshes from a single input image in just 10 seconds. The key innovation of CRM lies in its integration of geometric priors into the network design, leveraging the spatial correspondence between six orthographic images and the triplane component. The model first generates six orthographic view images from the input image using a U-Net, which then feeds these images into a convolutional network to create a high-resolution triplane. The triplane is further processed using Flexicubes, a geometric representation that facilitates end-to-end optimization on textured meshes. The paper also details the training of multi-view diffusion models to generate the six orthographic images and canonical coordinate maps (CCMs) from the input image. The overall approach is designed to be efficient, with a smaller batch size and less training time compared to transformer-based methods like the Large Reconstruction Model (LRM). Experimental results demonstrate that CRM produces high-quality textured meshes with better geometry and texture than existing baselines, while being significantly faster and more computationally efficient.The paper introduces the Convolutional Reconstruction Model (CRM), a feed-forward single image-to-3D generative model that generates high-fidelity textured meshes from a single input image in just 10 seconds. The key innovation of CRM lies in its integration of geometric priors into the network design, leveraging the spatial correspondence between six orthographic images and the triplane component. The model first generates six orthographic view images from the input image using a U-Net, which then feeds these images into a convolutional network to create a high-resolution triplane. The triplane is further processed using Flexicubes, a geometric representation that facilitates end-to-end optimization on textured meshes. The paper also details the training of multi-view diffusion models to generate the six orthographic images and canonical coordinate maps (CCMs) from the input image. The overall approach is designed to be efficient, with a smaller batch size and less training time compared to transformer-based methods like the Large Reconstruction Model (LRM). Experimental results demonstrate that CRM produces high-quality textured meshes with better geometry and texture than existing baselines, while being significantly faster and more computationally efficient.
Reach us at info@study.space