PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

13 Feb 2024 | Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek*, Yuki M. Asano*
The paper introduces Positional Insert (PIN), a lightweight module designed to enhance the object localization capabilities of caption-based Vision-Language Models (VLMs) without altering their existing parameters. PIN is a learnable spatial prompt that is inserted into the frozen VLM, enabling it to perform zero-shot object localization. The authors demonstrate that PIN significantly improves the localization performance of VLMs on various datasets, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons. The method is evaluated on OpenFlamingo and BLIP-2 VLMs, showing substantial enhancements in localization accuracy. The PIN module is trained using a synthetic dataset composed of synthesized object renderings superimposed on background images, providing precise ground truth locations. The paper also includes a detailed analysis of the limitations of caption-based VLMs in object localization and discusses the effectiveness of different components of the PIN module, such as the number of layers in the feed-forward neural network and the type of positional embedding.The paper introduces Positional Insert (PIN), a lightweight module designed to enhance the object localization capabilities of caption-based Vision-Language Models (VLMs) without altering their existing parameters. PIN is a learnable spatial prompt that is inserted into the frozen VLM, enabling it to perform zero-shot object localization. The authors demonstrate that PIN significantly improves the localization performance of VLMs on various datasets, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons. The method is evaluated on OpenFlamingo and BLIP-2 VLMs, showing substantial enhancements in localization accuracy. The PIN module is trained using a synthetic dataset composed of synthesized object renderings superimposed on background images, providing precise ground truth locations. The paper also includes a detailed analysis of the limitations of caption-based VLMs in object localization and discusses the effectiveness of different components of the PIN module, such as the number of layers in the feed-forward neural network and the type of positional embedding.
Reach us at info@study.space
Understanding PIN%3A Positional Insert Unlocks Object Localisation Abilities in VLMs