10 Feb 2024 | Yi Zhao1, Yilin Zhang1, Rong Xiang1, Jing Li1 and Hillming Li2
The paper "VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models" by Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, and Hillming Li explores the potential of large models (LMs) in enhancing visually impaired assistance (VIA). The authors define a novel task called Visual Impaired Assistance with Language Models (VIALM), where LMs provide step-by-step guidance to visually impaired users based on images and linguistic requests. The study includes a comprehensive survey of relevant LMs and a benchmark experiment to evaluate their capabilities in VIA tasks.
Key findings from the benchmark experiments include:
1. **Environment Grounding**: LMs struggle to generate environment-grounded guidance, with 25.7% of GPT-4 responses failing to do so.
2. **Fine-Grained Guidance**: LMs also lack fine-grained guidance, with 32.1% of GPT-4 responses lacking detailed and actionable instructions.
The authors propose potential solutions to these limitations, such as improving visual capabilities and incorporating tactile modalities into language generation. They also highlight the need for synergistic enhancements in multimodal capabilities to boost overall effectiveness.
The study contributes to the field by introducing VIALM, conducting a thorough survey of LMs, and providing a benchmark dataset and evaluation metrics to assess their zero-shot VIA capabilities. The results suggest that while LMs have significant potential, they need further development to better support visually impaired individuals in their daily activities.The paper "VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models" by Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, and Hillming Li explores the potential of large models (LMs) in enhancing visually impaired assistance (VIA). The authors define a novel task called Visual Impaired Assistance with Language Models (VIALM), where LMs provide step-by-step guidance to visually impaired users based on images and linguistic requests. The study includes a comprehensive survey of relevant LMs and a benchmark experiment to evaluate their capabilities in VIA tasks.
Key findings from the benchmark experiments include:
1. **Environment Grounding**: LMs struggle to generate environment-grounded guidance, with 25.7% of GPT-4 responses failing to do so.
2. **Fine-Grained Guidance**: LMs also lack fine-grained guidance, with 32.1% of GPT-4 responses lacking detailed and actionable instructions.
The authors propose potential solutions to these limitations, such as improving visual capabilities and incorporating tactile modalities into language generation. They also highlight the need for synergistic enhancements in multimodal capabilities to boost overall effectiveness.
The study contributes to the field by introducing VIALM, conducting a thorough survey of LMs, and providing a benchmark dataset and evaluation metrics to assess their zero-shot VIA capabilities. The results suggest that while LMs have significant potential, they need further development to better support visually impaired individuals in their daily activities.