[slides and audio] Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models

The paper introduces SCAFFOLD, a novel visual prompting method designed to enhance the coordination between vision and language in Large Multi-Modal Models (LMMs). SCAFFOLD overlays a dot matrix with labeled coordinates on input images, providing visual anchors and textual references to guide LMMs in complex vision-language tasks. The method aims to address the limitations of existing prompting techniques, which often focus on improving textual reasoning or image preprocessing without a general solution for vision-language coordination. SCAFFOLD is evaluated on various challenging vision-language benchmarks, including spatial reasoning, compositional reasoning, fine-grained visual understanding, and hallucination. Extensive experiments demonstrate that SCAFFOLD significantly improves the performance of LMMs compared to GPT-4V with textual CoT prompting. The method also shows promise in enhancing active perception and integrating with other prompting techniques like Chain-of-Thought. The paper discusses the effectiveness of different aspects of SCAFFOLD, such as matrix size, color, and coordinate format, through ablation studies. It also explores the robustness of SCAFFOLD against perturbations and its compatibility with other methods. The results highlight the potential of SCAFFOLD in advancing the capabilities of LMMs in vision-language coordination.The paper introduces SCAFFOLD, a novel visual prompting method designed to enhance the coordination between vision and language in Large Multi-Modal Models (LMMs). SCAFFOLD overlays a dot matrix with labeled coordinates on input images, providing visual anchors and textual references to guide LMMs in complex vision-language tasks. The method aims to address the limitations of existing prompting techniques, which often focus on improving textual reasoning or image preprocessing without a general solution for vision-language coordination. SCAFFOLD is evaluated on various challenging vision-language benchmarks, including spatial reasoning, compositional reasoning, fine-grained visual understanding, and hallucination. Extensive experiments demonstrate that SCAFFOLD significantly improves the performance of LMMs compared to GPT-4V with textual CoT prompting. The method also shows promise in enhancing active perception and integrating with other prompting techniques like Chain-of-Thought. The paper discusses the effectiveness of different aspects of SCAFFOLD, such as matrix size, color, and coordinate format, through ablation studies. It also explores the robustness of SCAFFOLD against perturbations and its compatibility with other methods. The results highlight the potential of SCAFFOLD in advancing the capabilities of LMMs in vision-language coordination.

Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models

19 Feb 2024 | Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, Yang Liu