12 Apr 2024 | Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani
This paper investigates the 3D awareness of visual foundation models, which are large-scale pretraining models that have shown strong generalization capabilities in various visual tasks such as classification, segmentation, and generation. The authors propose that 3D awareness can be evaluated through two key capabilities: single-view surface reconstruction and multiview consistency. They conduct experiments using task-specific probes and zero-shot inference procedures on frozen features to assess the models' ability to encode 3D structure and represent surfaces consistently across different views.
The study reveals several limitations of current models. While some models, such as DINOv2 and StableDiffusion, perform well in encoding depth and surface normals, others, like CLIP, struggle with these tasks despite their impressive semantic generalization capabilities. The models generally perform better with small viewpoint changes but fail with large viewpoint variations, indicating a lack of multiview consistency. The analysis also suggests that the models' performance in semantic correspondence is not a good measure of 3D consistency.
The paper discusses the implications of these findings and highlights the need for more comprehensive benchmarks and further research to better understand the 3D awareness of visual foundation models. The authors hope that their work will stimulate more interest in this area and contribute to more advanced models that can better represent and understand the 3D world.This paper investigates the 3D awareness of visual foundation models, which are large-scale pretraining models that have shown strong generalization capabilities in various visual tasks such as classification, segmentation, and generation. The authors propose that 3D awareness can be evaluated through two key capabilities: single-view surface reconstruction and multiview consistency. They conduct experiments using task-specific probes and zero-shot inference procedures on frozen features to assess the models' ability to encode 3D structure and represent surfaces consistently across different views.
The study reveals several limitations of current models. While some models, such as DINOv2 and StableDiffusion, perform well in encoding depth and surface normals, others, like CLIP, struggle with these tasks despite their impressive semantic generalization capabilities. The models generally perform better with small viewpoint changes but fail with large viewpoint variations, indicating a lack of multiview consistency. The analysis also suggests that the models' performance in semantic correspondence is not a good measure of 3D consistency.
The paper discusses the implications of these findings and highlights the need for more comprehensive benchmarks and further research to better understand the 3D awareness of visual foundation models. The authors hope that their work will stimulate more interest in this area and contribute to more advanced models that can better represent and understand the 3D world.