[slides] VIAssist%3A Adapting Multi-Modal Large Language Models for Users with Visual Impairments

The paper "VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments" by Bufang Yang, Lixing He, Kaiwei Liu, and Zhenyu Yan from the Chinese University of Hong Kong explores the challenges faced by visually impaired (VI) individuals in using multi-modal large language models (MLLMs) for visual question answering (VQA). The authors highlight that VI individuals often capture low-quality images due to their limited vision, which can lead to unreliable responses from MLLMs. To address this, they propose VIAssist, an MLLM tailored to enhance the adaptability of MLLMs to VI users' unique needs. VIAssist can identify and provide detailed actions for retaking low-quality images, and it can generate reliable answers to queries based on high-quality images. The system is trained using a dataset of VI-specific questions and images, along with aligned responses. The results show that VIAssist outperforms existing MLLMs in terms of BERTScore and ROUGE scores, providing more accurate and relevant responses. The paper also discusses the limitations of current MLLMs and future directions, including enriching the instruction dataset, enabling automatic reshooting, improving real-time efficiency, and exploring additional modalities and other types of impaired individuals. Overall, VIAssist demonstrates significant potential in enhancing the usability of MLLMs for VI individuals.The paper "VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments" by Bufang Yang, Lixing He, Kaiwei Liu, and Zhenyu Yan from the Chinese University of Hong Kong explores the challenges faced by visually impaired (VI) individuals in using multi-modal large language models (MLLMs) for visual question answering (VQA). The authors highlight that VI individuals often capture low-quality images due to their limited vision, which can lead to unreliable responses from MLLMs. To address this, they propose VIAssist, an MLLM tailored to enhance the adaptability of MLLMs to VI users' unique needs. VIAssist can identify and provide detailed actions for retaking low-quality images, and it can generate reliable answers to queries based on high-quality images. The system is trained using a dataset of VI-specific questions and images, along with aligned responses. The results show that VIAssist outperforms existing MLLMs in terms of BERTScore and ROUGE scores, providing more accurate and relevant responses. The paper also discusses the limitations of current MLLMs and future directions, including enriching the instruction dataset, enabling automatic reshooting, improving real-time efficiency, and exploring additional modalities and other types of impaired individuals. Overall, VIAssist demonstrates significant potential in enhancing the usability of MLLMs for VI individuals.

VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

3 Apr 2024 | Bufang Yang†, Lixing He†, Kaiwei Liu† and Zhenyu Yan†