VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

3 Apr 2024 | Bufang Yang, Lixing He, Kaiwei Liu and Zhenyu Yan
This paper introduces VIAssist, a multi-modal large language model (MLLM) tailored for visually impaired (VI) individuals. VI individuals often face challenges in capturing high-quality images due to limited vision, leading to low-quality images that hinder the effectiveness of MLLMs. VIAssist addresses this by providing actionable suggestions for retaking photos and generating reliable answers based on the captured images. The system was developed using a custom instruction dataset containing questions, images, and aligned responses, enabling VIAssist to better understand and respond to VI-specific queries. The paper highlights the challenges faced by MLLMs in handling VI-specific queries, such as incomplete or low-quality images, and the need for improved guidance for retaking photos. VIAssist was trained using the LLaVA model with LoRA for efficient fine-tuning. It outperforms existing models like GPT-4V in terms of BERTScore and ROUGE scores, demonstrating its ability to generate more accurate and relevant responses for VI users. The system's performance is evaluated on standard VQA datasets and a VI-specific VizWiz dataset, showing significant improvements in accuracy and reliability. VIAssist can assess image quality, provide detailed retaking suggestions, and generate fewer irrelevant responses. Future work includes expanding the instruction dataset, enhancing automatic reshooting capabilities, improving real-time performance, and exploring additional modalities for better assistance. The study also discusses the potential of MLLMs in aiding other disability groups and highlights the importance of prompt engineering and other modalities in improving VI assistance systems.This paper introduces VIAssist, a multi-modal large language model (MLLM) tailored for visually impaired (VI) individuals. VI individuals often face challenges in capturing high-quality images due to limited vision, leading to low-quality images that hinder the effectiveness of MLLMs. VIAssist addresses this by providing actionable suggestions for retaking photos and generating reliable answers based on the captured images. The system was developed using a custom instruction dataset containing questions, images, and aligned responses, enabling VIAssist to better understand and respond to VI-specific queries. The paper highlights the challenges faced by MLLMs in handling VI-specific queries, such as incomplete or low-quality images, and the need for improved guidance for retaking photos. VIAssist was trained using the LLaVA model with LoRA for efficient fine-tuning. It outperforms existing models like GPT-4V in terms of BERTScore and ROUGE scores, demonstrating its ability to generate more accurate and relevant responses for VI users. The system's performance is evaluated on standard VQA datasets and a VI-specific VizWiz dataset, showing significant improvements in accuracy and reliability. VIAssist can assess image quality, provide detailed retaking suggestions, and generate fewer irrelevant responses. Future work includes expanding the instruction dataset, enhancing automatic reshooting capabilities, improving real-time performance, and exploring additional modalities for better assistance. The study also discusses the potential of MLLMs in aiding other disability groups and highlights the importance of prompt engineering and other modalities in improving VI assistance systems.
Reach us at info@study.space
[slides] VIAssist%3A Adapting Multi-Modal Large Language Models for Users with Visual Impairments | StudySpace