27 May 2024 | Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Yan Gao, Yao Hu, Enhong Chen
This paper introduces NoteLLM-2, a novel multimodal large representation model for recommendation tasks. The authors aim to enhance multimodal representation in item-to-item (I2I) recommendation by leveraging Large Language Models (LLMs) and vision encoders. Existing methods often rely on pre-trained multimodal large language models (MLLMs), which require extensive data and are costly to train. To address this, the authors propose an end-to-end training method that customizes the integration of any existing LLMs and vision encoders to construct efficient multimodal representation models.
The paper highlights the challenge of fine-tuned LLMs overlooking visual information in multimodal representation tasks. To overcome this, the authors propose a novel training framework, NoteLLM-2, which includes two mechanisms: multimodal In-Context Learning (mICL) and late fusion. mICL separates multimodal content into visual and textual components, enabling the model to focus on both modalities. Late fusion enhances the impact of visual information by delaying the fusion process, allowing the model to retain more visual information.
The authors conduct extensive experiments to validate the effectiveness of their method. Results show that NoteLLM-2 significantly improves performance in multimodal representation tasks, particularly in short pairs. The framework also demonstrates better performance in terms of visual information retention and multimodal representation ability compared to existing methods. The paper concludes that NoteLLM-2 provides a promising solution for enhancing multimodal representation in recommendation scenarios, reducing reliance on pre-trained MLLMs and improving the efficiency and effectiveness of multimodal representation models.This paper introduces NoteLLM-2, a novel multimodal large representation model for recommendation tasks. The authors aim to enhance multimodal representation in item-to-item (I2I) recommendation by leveraging Large Language Models (LLMs) and vision encoders. Existing methods often rely on pre-trained multimodal large language models (MLLMs), which require extensive data and are costly to train. To address this, the authors propose an end-to-end training method that customizes the integration of any existing LLMs and vision encoders to construct efficient multimodal representation models.
The paper highlights the challenge of fine-tuned LLMs overlooking visual information in multimodal representation tasks. To overcome this, the authors propose a novel training framework, NoteLLM-2, which includes two mechanisms: multimodal In-Context Learning (mICL) and late fusion. mICL separates multimodal content into visual and textual components, enabling the model to focus on both modalities. Late fusion enhances the impact of visual information by delaying the fusion process, allowing the model to retain more visual information.
The authors conduct extensive experiments to validate the effectiveness of their method. Results show that NoteLLM-2 significantly improves performance in multimodal representation tasks, particularly in short pairs. The framework also demonstrates better performance in terms of visual information retention and multimodal representation ability compared to existing methods. The paper concludes that NoteLLM-2 provides a promising solution for enhancing multimodal representation in recommendation scenarios, reducing reliance on pre-trained MLLMs and improving the efficiency and effectiveness of multimodal representation models.