March 21, 2024 | Théophane Vallaeys¹, Mustafa Shukor², Matthieu Cord²,³, Jakob Verbeek¹
This paper presents an extensive experimental evaluation of different mechanisms for interfacing large language models (LLMs) with perceptual backbones for data-efficient perceptual augmentation. The authors propose a unified framework to systematically compare various approaches across multiple tasks, datasets, and backbones, with a focus on low-data settings. They find that existing mechanisms can achieve improved performance over state-of-the-art results, and identify a new mechanism, DePALM, which yields near-optimal results across different tasks while reducing training time by up to 4×. DePALM uses token pooling to compress perceptual features into a few "summary tokens" for injection into the LLM, achieving competitive performance with significantly reduced training time compared to other methods. The study also highlights the importance of data and compute efficiency in multimodal tasks, and shows that parameter-efficient approaches can outperform large-scale models in low-data settings. The authors conduct extensive experiments on various tasks, including image, video, and audio captioning, as well as visual question answering, and find that DePALM consistently outperforms existing data and parameter-efficient methods. The study also demonstrates that using text-aligned perceptual encoders improves cross-modal interaction with LLMs, and that better LLMs are not always better for multimodality. Overall, the paper provides a comprehensive analysis of data-efficient approaches for perceptual augmentation of LLMs, and highlights the effectiveness of DePALM as a promising and efficient method.This paper presents an extensive experimental evaluation of different mechanisms for interfacing large language models (LLMs) with perceptual backbones for data-efficient perceptual augmentation. The authors propose a unified framework to systematically compare various approaches across multiple tasks, datasets, and backbones, with a focus on low-data settings. They find that existing mechanisms can achieve improved performance over state-of-the-art results, and identify a new mechanism, DePALM, which yields near-optimal results across different tasks while reducing training time by up to 4×. DePALM uses token pooling to compress perceptual features into a few "summary tokens" for injection into the LLM, achieving competitive performance with significantly reduced training time compared to other methods. The study also highlights the importance of data and compute efficiency in multimodal tasks, and shows that parameter-efficient approaches can outperform large-scale models in low-data settings. The authors conduct extensive experiments on various tasks, including image, video, and audio captioning, as well as visual question answering, and find that DePALM consistently outperforms existing data and parameter-efficient methods. The study also demonstrates that using text-aligned perceptual encoders improves cross-modal interaction with LLMs, and that better LLMs are not always better for multimodality. Overall, the paper provides a comprehensive analysis of data-efficient approaches for perceptual augmentation of LLMs, and highlights the effectiveness of DePALM as a promising and efficient method.