2024 | Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, Hillming Li
This paper presents VIALM, a task that evaluates the ability of large models (LMs) to assist visually impaired (VI) users in completing tasks based on environmental images and linguistic requests. The study investigates the potential and limitations of state-of-the-art (SOTA) LMs in VIA applications. The task involves generating step-by-step guidance for VI users to complete tasks in a given environment, with a focus on environment-grounded and fine-grained guidance. The study includes a survey of recent LM research and benchmark experiments examining selected LMs' capabilities in VIA.
The results indicate that while LMs can potentially benefit VIA, their outputs often lack environment grounding (25.7% of GPT-4's responses) and fine-grained guidance (32.1% of GPT-4's responses). The study also reveals that visually-focused LMs excel in environment understanding, while those with advanced language models are superior in generating easy-to-follow guidance. To overcome these limitations, the paper proposes potential solutions: (1) improving visual capabilities to enhance environmental grounding, and (2) incorporating tactile modalities into language generation to produce more fine-grained guidance.
The paper introduces a novel task, VIALM, to extensively investigate how LMs transform the VIA landscape. It also thoroughly surveys important LM work applicable to VIA and constructs the first VIALM benchmark. The benchmark includes 200 visual environment images with paired questions and answers, covering two types of environments: supermarket and home. The study examines six SOTA VLMs, GPT-4, CogVLM, Qwen-VL, LLaVA, and MiniGPT-v2 to study their zero-shot capabilities in VIA.
The benchmark experiments identify two main limitations in current SOTA LMs: (1) inferiority to generate environment-grounded guidance (25.7% of GPT-4's responses), and (2) a lack of fine-grained guidance (32.1% of GPT-4's responses), with a shortfall in integrating tactile sensation. However, it is also revealed that visually-focused LMs excel in environment understanding, while those with advanced language models are superior in generating easy-to-follow guidance. The paper proposes potential solutions to address these limitations, including improving visual capabilities and incorporating tactile modalities into language generation. Future research directions include synergistically enhancing multimodal capabilities to boost overall effectiveness. The study also opens-sources the resources and highlights the contributions of this work.This paper presents VIALM, a task that evaluates the ability of large models (LMs) to assist visually impaired (VI) users in completing tasks based on environmental images and linguistic requests. The study investigates the potential and limitations of state-of-the-art (SOTA) LMs in VIA applications. The task involves generating step-by-step guidance for VI users to complete tasks in a given environment, with a focus on environment-grounded and fine-grained guidance. The study includes a survey of recent LM research and benchmark experiments examining selected LMs' capabilities in VIA.
The results indicate that while LMs can potentially benefit VIA, their outputs often lack environment grounding (25.7% of GPT-4's responses) and fine-grained guidance (32.1% of GPT-4's responses). The study also reveals that visually-focused LMs excel in environment understanding, while those with advanced language models are superior in generating easy-to-follow guidance. To overcome these limitations, the paper proposes potential solutions: (1) improving visual capabilities to enhance environmental grounding, and (2) incorporating tactile modalities into language generation to produce more fine-grained guidance.
The paper introduces a novel task, VIALM, to extensively investigate how LMs transform the VIA landscape. It also thoroughly surveys important LM work applicable to VIA and constructs the first VIALM benchmark. The benchmark includes 200 visual environment images with paired questions and answers, covering two types of environments: supermarket and home. The study examines six SOTA VLMs, GPT-4, CogVLM, Qwen-VL, LLaVA, and MiniGPT-v2 to study their zero-shot capabilities in VIA.
The benchmark experiments identify two main limitations in current SOTA LMs: (1) inferiority to generate environment-grounded guidance (25.7% of GPT-4's responses), and (2) a lack of fine-grained guidance (32.1% of GPT-4's responses), with a shortfall in integrating tactile sensation. However, it is also revealed that visually-focused LMs excel in environment understanding, while those with advanced language models are superior in generating easy-to-follow guidance. The paper proposes potential solutions to address these limitations, including improving visual capabilities and incorporating tactile modalities into language generation. Future research directions include synergistically enhancing multimodal capabilities to boost overall effectiveness. The study also opens-sources the resources and highlights the contributions of this work.