[slides and audio] Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

The paper "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs" by Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, and Tsung-Yu Lin explores how image-space coordinate-based instruction fine-tuning can enhance spatial awareness in visual Large Language Models (V-LLMs). The authors identify that existing V-LLMs, such as BLIP-2 and LLaVA, while excelling in generating descriptive answers, struggle with simple spatial tasks like distinguishing left from right locations. To address this, they propose three novel instruction fine-tuning objectives that integrate location representation with natural language, aiming to improve spatial reasoning in V-LLMs. The key contributions of the paper include: 1. **Injecting Textual Spatial Coordinate Awareness**: Introducing textual spatial coordinates into V-LLMs to enhance their spatial understanding. 2. **Proposing Instruction Fine-Tuning Objectives**: Developing three distinct objectives (Location Prediction, Negative Prediction, and Reverse-Location Prediction) to train V-LLMs on spatial reasoning tasks. 3. **Discovering Optimal Coordinate Representations**: Identifying the most effective ways to represent image-space locations. 4. **Pseudo-Data Generation**: Utilizing pseudo-data generated from self-training and weak supervision to scale the framework and improve performance. The authors evaluate their approach on 14 datasets across five vision-language tasks, including spatial reasoning, image VQA, video VQA, object hallucination, and region description. The results demonstrate significant improvements in spatial reasoning, reducing object hallucination, and enhancing region description capabilities. The proposed framework, named LocVLM, shows state-of-the-art performance in various benchmarks, highlighting its effectiveness in improving spatial awareness in V-LLMs.The paper "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs" by Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, and Tsung-Yu Lin explores how image-space coordinate-based instruction fine-tuning can enhance spatial awareness in visual Large Language Models (V-LLMs). The authors identify that existing V-LLMs, such as BLIP-2 and LLaVA, while excelling in generating descriptive answers, struggle with simple spatial tasks like distinguishing left from right locations. To address this, they propose three novel instruction fine-tuning objectives that integrate location representation with natural language, aiming to improve spatial reasoning in V-LLMs. The key contributions of the paper include: 1. **Injecting Textual Spatial Coordinate Awareness**: Introducing textual spatial coordinates into V-LLMs to enhance their spatial understanding. 2. **Proposing Instruction Fine-Tuning Objectives**: Developing three distinct objectives (Location Prediction, Negative Prediction, and Reverse-Location Prediction) to train V-LLMs on spatial reasoning tasks. 3. **Discovering Optimal Coordinate Representations**: Identifying the most effective ways to represent image-space locations. 4. **Pseudo-Data Generation**: Utilizing pseudo-data generated from self-training and weak supervision to scale the framework and improve performance. The authors evaluate their approach on 14 datasets across five vision-language tasks, including spatial reasoning, image VQA, video VQA, object hallucination, and region description. The results demonstrate significant improvements in spatial reasoning, reducing object hallucination, and enhancing region description capabilities. The proposed framework, named LocVLM, shows state-of-the-art performance in various benchmarks, highlighting its effectiveness in improving spatial awareness in V-LLMs.

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

11 Apr 2024 | Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin