AffordanceLLM: Grounding Affordance from Vision Language Models

AffordanceLLM: Grounding Affordance from Vision Language Models

17 Apr 2024 | Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li
AffordanceLLM is a novel approach that leverages the rich world knowledge embedded in large-scale vision language models (VLMs) to ground affordance from images. The task of affordance grounding involves identifying regions of an object that can be interacted with, and it requires comprehensive understanding of the object's 3D geometry, functionality, and spatial configuration. Traditional methods struggle with generalization to unseen objects, but AffordanceLLM improves this by incorporating world knowledge and 3D geometry information. The model uses a VLM backbone (LLaVA) and extends it with a mask decoder and a special token to predict an affordance map. Depth maps are also used as additional inputs to enhance geometric reasoning. The model is trained on the AGD20K benchmark, which is the only large-scale affordance grounding dataset with accurate action and object labels. AffordanceLLM outperforms state-of-the-art baselines on both the easy and hard splits of the dataset, demonstrating its superior generalization ability. The model can also generalize to novel objects and actions, even when they are not seen during training. The approach is evaluated on random Internet images, showing that it can produce reasonable affordance maps for objects very different from those in the training set. The model's performance is further validated through ablation studies, which show the importance of text prompts, image encoders, and pseudo depth inputs. Despite its success, the model has some limitations, such as failing on ambiguous questions and not always referring to the correct object when multiple objects are present. Overall, AffordanceLLM demonstrates the effectiveness of leveraging world knowledge and 3D geometry for affordance grounding, enabling better generalization to in-the-wild objects.AffordanceLLM is a novel approach that leverages the rich world knowledge embedded in large-scale vision language models (VLMs) to ground affordance from images. The task of affordance grounding involves identifying regions of an object that can be interacted with, and it requires comprehensive understanding of the object's 3D geometry, functionality, and spatial configuration. Traditional methods struggle with generalization to unseen objects, but AffordanceLLM improves this by incorporating world knowledge and 3D geometry information. The model uses a VLM backbone (LLaVA) and extends it with a mask decoder and a special token to predict an affordance map. Depth maps are also used as additional inputs to enhance geometric reasoning. The model is trained on the AGD20K benchmark, which is the only large-scale affordance grounding dataset with accurate action and object labels. AffordanceLLM outperforms state-of-the-art baselines on both the easy and hard splits of the dataset, demonstrating its superior generalization ability. The model can also generalize to novel objects and actions, even when they are not seen during training. The approach is evaluated on random Internet images, showing that it can produce reasonable affordance maps for objects very different from those in the training set. The model's performance is further validated through ablation studies, which show the importance of text prompts, image encoders, and pseudo depth inputs. Despite its success, the model has some limitations, such as failing on ambiguous questions and not always referring to the correct object when multiple objects are present. Overall, AffordanceLLM demonstrates the effectiveness of leveraging world knowledge and 3D geometry for affordance grounding, enabling better generalization to in-the-wild objects.
Reach us at info@study.space