Understanding AffordanceLLM%3A Grounding Affordance from Vision Language Models

**AffordanceLLM: Grounding Affordance from Vision Language Models** **Authors:** Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li **Affiliation:** AWS AI, Amazon **Abstract:** Affordance grounding is the task of identifying areas in an image where interaction is possible. This paper introduces AffordanceLLM, a novel approach that leverages the rich world knowledge embedded in pre-trained large-scale vision language models (VLMs) to improve the generalization capability of affordance grounding to unseen objects in the wild. The method uses a VLM backbone (LLaVA) extended with a mask decoder and a special token to predict an affordance map. Depth maps are also included as additional inputs to enhance geometric reasoning. The model is trained end-to-end on the AGD20K benchmark, demonstrating significant performance gains over competing methods. It can generalize to novel objects and actions, even when both are unseen during training. **Contributions:** 1. The first approach to leverage world knowledge from pre-trained VLMs for affordance grounding. 2. Demonstrates the importance of 3D information in affordance grounding. 3. Generalizes to novel objects and outperforms state-of-the-art methods on the AGD20K benchmark, showing potential for novel actions. **Introduction:** Affordance grounding is a fundamental but challenging task in computer vision, requiring comprehensive understanding of objects, actions, and their interactions. Current methods often rely on human demonstrations, which struggle with generalization to novel objects. This paper addresses this issue by integrating rich world knowledge from VLMs, such as LLaVA, which can answer common-sense questions about object interactions. **Method:** AffordanceLLM uses a VLM backbone (LLaVA) with a mask decoder and a special token to predict an affordance map. Depth maps are included as additional inputs to enhance geometric reasoning. The model is trained using a binary focal loss for the affordance map and cross-entropy loss for text output. **Experiments:** The model is evaluated on the AGD20K benchmark, showing superior performance compared to state-of-the-art methods, both in terms of generalization to unseen objects and handling novel actions. Ablation studies validate the effectiveness of different components, including text prompts, image encoders, and pseudo depth inputs. **Conclusion:** AffordanceLLM demonstrates the potential of leveraging pre-trained VLMs to enhance the generalization of affordance grounding to unseen objects and actions, with implications for building intelligent robots capable of manipulating diverse objects.**AffordanceLLM: Grounding Affordance from Vision Language Models** **Authors:** Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li **Affiliation:** AWS AI, Amazon **Abstract:** Affordance grounding is the task of identifying areas in an image where interaction is possible. This paper introduces AffordanceLLM, a novel approach that leverages the rich world knowledge embedded in pre-trained large-scale vision language models (VLMs) to improve the generalization capability of affordance grounding to unseen objects in the wild. The method uses a VLM backbone (LLaVA) extended with a mask decoder and a special token to predict an affordance map. Depth maps are also included as additional inputs to enhance geometric reasoning. The model is trained end-to-end on the AGD20K benchmark, demonstrating significant performance gains over competing methods. It can generalize to novel objects and actions, even when both are unseen during training. **Contributions:** 1. The first approach to leverage world knowledge from pre-trained VLMs for affordance grounding. 2. Demonstrates the importance of 3D information in affordance grounding. 3. Generalizes to novel objects and outperforms state-of-the-art methods on the AGD20K benchmark, showing potential for novel actions. **Introduction:** Affordance grounding is a fundamental but challenging task in computer vision, requiring comprehensive understanding of objects, actions, and their interactions. Current methods often rely on human demonstrations, which struggle with generalization to novel objects. This paper addresses this issue by integrating rich world knowledge from VLMs, such as LLaVA, which can answer common-sense questions about object interactions. **Method:** AffordanceLLM uses a VLM backbone (LLaVA) with a mask decoder and a special token to predict an affordance map. Depth maps are included as additional inputs to enhance geometric reasoning. The model is trained using a binary focal loss for the affordance map and cross-entropy loss for text output. **Experiments:** The model is evaluated on the AGD20K benchmark, showing superior performance compared to state-of-the-art methods, both in terms of generalization to unseen objects and handling novel actions. Ablation studies validate the effectiveness of different components, including text prompts, image encoders, and pseudo depth inputs. **Conclusion:** AffordanceLLM demonstrates the potential of leveraging pre-trained VLMs to enhance the generalization of affordance grounding to unseen objects and actions, with implications for building intelligent robots capable of manipulating diverse objects.

AffordanceLLM: Grounding Affordance from Vision Language Models

17 Apr 2024 | Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li