Probing the 3D Awareness of Visual Foundation Models

Probing the 3D Awareness of Visual Foundation Models

12 Apr 2024 | Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Degqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani
This paper investigates the 3D awareness of visual foundation models, which are large-scale pre-trained models capable of generalizing to various tasks. The study focuses on whether these models can represent the 3D structure of scenes and maintain consistency across different views. The authors propose two key capabilities for evaluating 3D awareness: single-view surface reconstruction and multiview consistency. They conduct experiments using task-specific probes and zero-shot inference on frozen features to assess the models' ability to encode depth, surface normals, and 3D correspondence. The results show that while some models, such as DINOv2 and StableDiffusion, perform well in capturing fine details of depth and surface normals, others, like CLIP and MAE, fail to encode depth and rely on prior knowledge. The study also reveals that models struggle with multiview consistency, especially under large viewpoint variations. Although some models can estimate accurate correspondence for small viewpoint changes, they perform poorly for larger changes, indicating a lack of 3D consistency. The analysis further suggests that semantic correspondence is not a reliable indicator of 3D consistency. While models can match semantic parts across different instances of the same class, they often fail to incorporate global object pose. The study highlights the importance of evaluating 3D awareness in visual models, as it provides insights into how these models represent the 3D world and can contribute to more comprehensive benchmarks in visual representation learning. The findings suggest that current models are not fully 3D consistent, despite their strong performance in other tasks. The research underscores the need for further investigation into the 3D awareness of visual foundation models to better understand their capabilities and limitations.This paper investigates the 3D awareness of visual foundation models, which are large-scale pre-trained models capable of generalizing to various tasks. The study focuses on whether these models can represent the 3D structure of scenes and maintain consistency across different views. The authors propose two key capabilities for evaluating 3D awareness: single-view surface reconstruction and multiview consistency. They conduct experiments using task-specific probes and zero-shot inference on frozen features to assess the models' ability to encode depth, surface normals, and 3D correspondence. The results show that while some models, such as DINOv2 and StableDiffusion, perform well in capturing fine details of depth and surface normals, others, like CLIP and MAE, fail to encode depth and rely on prior knowledge. The study also reveals that models struggle with multiview consistency, especially under large viewpoint variations. Although some models can estimate accurate correspondence for small viewpoint changes, they perform poorly for larger changes, indicating a lack of 3D consistency. The analysis further suggests that semantic correspondence is not a reliable indicator of 3D consistency. While models can match semantic parts across different instances of the same class, they often fail to incorporate global object pose. The study highlights the importance of evaluating 3D awareness in visual models, as it provides insights into how these models represent the 3D world and can contribute to more comprehensive benchmarks in visual representation learning. The findings suggest that current models are not fully 3D consistent, despite their strong performance in other tasks. The research underscores the need for further investigation into the 3D awareness of visual foundation models to better understand their capabilities and limitations.
Reach us at info@study.space