11 Apr 2024 | Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin
This paper introduces a framework called LocVLM that enhances spatial reasoning in visual-language models (V-LLMs). The key idea is to incorporate spatial awareness through instruction fine-tuning objectives that explicitly process and generate image-space coordinates. The proposed method improves spatial reasoning in V-LLMs, leading to better performance in visual question answering (VQA), object hallucination reduction, and region description tasks. The framework also supports video domain operation.
The paper explores three instruction fine-tuning objectives that help V-LLMs understand spatial relationships in images. These objectives are designed to improve the model's ability to reason about spatial composition using image-space coordinates. The method also introduces pseudo-data generation strategies that enhance the model's ability to describe regions in images and scale to video domains.
The framework is evaluated across five vision-language tasks involving 14 different datasets, demonstrating clear performance improvements. The results show that the proposed framework outperforms existing models in spatial reasoning, image VQA, video VQA, object hallucination, and region description tasks. The model also shows improved performance in video domain tasks, with results indicating state-of-the-art performance across four different video VQA benchmarks.
The paper also highlights the importance of spatial awareness in V-LLMs, showing that existing models often lack this ability, leading to poor performance in tasks that require spatial reasoning. The proposed framework addresses this limitation by incorporating spatial awareness through instruction fine-tuning objectives and pseudo-data generation. The results demonstrate that the proposed framework significantly improves spatial reasoning in V-LLMs, leading to better performance in various vision-language tasks.This paper introduces a framework called LocVLM that enhances spatial reasoning in visual-language models (V-LLMs). The key idea is to incorporate spatial awareness through instruction fine-tuning objectives that explicitly process and generate image-space coordinates. The proposed method improves spatial reasoning in V-LLMs, leading to better performance in visual question answering (VQA), object hallucination reduction, and region description tasks. The framework also supports video domain operation.
The paper explores three instruction fine-tuning objectives that help V-LLMs understand spatial relationships in images. These objectives are designed to improve the model's ability to reason about spatial composition using image-space coordinates. The method also introduces pseudo-data generation strategies that enhance the model's ability to describe regions in images and scale to video domains.
The framework is evaluated across five vision-language tasks involving 14 different datasets, demonstrating clear performance improvements. The results show that the proposed framework outperforms existing models in spatial reasoning, image VQA, video VQA, object hallucination, and region description tasks. The model also shows improved performance in video domain tasks, with results indicating state-of-the-art performance across four different video VQA benchmarks.
The paper also highlights the importance of spatial awareness in V-LLMs, showing that existing models often lack this ability, leading to poor performance in tasks that require spatial reasoning. The proposed framework addresses this limitation by incorporating spatial awareness through instruction fine-tuning objectives and pseudo-data generation. The results demonstrate that the proposed framework significantly improves spatial reasoning in V-LLMs, leading to better performance in various vision-language tasks.