21 Mar 2024 | Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu
**Abstract:**
In the realm of vision-language understanding, enhancing the ability of models to interpret and reason over visual content is crucial for various applications. However, large vision-language models (LVMs) struggle to extract relevant features from images tailored to specific questions, often resorting to lower-resolution images to reduce computational complexity. This paper introduces Chain-of-Spot (CoS), an interactive reasoning approach that improves feature extraction by focusing on key regions of interest (ROI) within images, corresponding to the posed questions or instructions. By integrating CoS with instruct-following LLaVA-1.5 models, the method consistently improves performance across a wide range of multimodal datasets and benchmarks without increasing image resolution. Empirical results demonstrate significant improvements in LVMs' ability to understand and reason about visual content, paving the way for more sophisticated visual instruction-following applications.
**Keywords:** Large Vision-Language Models $\cdot$ Chain-of-Spot**Abstract:**
In the realm of vision-language understanding, enhancing the ability of models to interpret and reason over visual content is crucial for various applications. However, large vision-language models (LVMs) struggle to extract relevant features from images tailored to specific questions, often resorting to lower-resolution images to reduce computational complexity. This paper introduces Chain-of-Spot (CoS), an interactive reasoning approach that improves feature extraction by focusing on key regions of interest (ROI) within images, corresponding to the posed questions or instructions. By integrating CoS with instruct-following LLaVA-1.5 models, the method consistently improves performance across a wide range of multimodal datasets and benchmarks without increasing image resolution. Empirical results demonstrate significant improvements in LVMs' ability to understand and reason about visual content, paving the way for more sophisticated visual instruction-following applications.
**Keywords:** Large Vision-Language Models $\cdot$ Chain-of-Spot