Understanding NoteLLM-2%3A Multimodal Large Representation Models for Recommendation

This paper explores the use of Large Language Models (LLMs) to enhance multimodal representation in item-to-item (I2I) recommendations. The authors investigate the potential of Multimodal Large Language Models (MLLMs) to improve multimodal representation tasks, addressing the challenge of complex and costly pre-training procedures for MLLMs. They propose an end-to-end training method that customizes the integration of existing LLMs and vision encoders to construct efficient Multimodal Large Representation Models (MLRMs). Preliminary experiments reveal that fine-tuned MLRMs tend to overlook visual content. To address this, the authors introduce NoteLLM-2, a novel training framework with two mechanisms: multimodal In-Context Learning (mICL) and late fusion. mICL separates multimodal content into visual and textual components, while late fusion delays the fusion of visual information to retain more visual details. Extensive experiments validate the effectiveness of NoteLLM-2, demonstrating improved performance in multimodal recommendation tasks. The paper contributes to the field by exploring the use of LLMs in multimodal representation and proposing practical methods to enhance their multimodal understanding.This paper explores the use of Large Language Models (LLMs) to enhance multimodal representation in item-to-item (I2I) recommendations. The authors investigate the potential of Multimodal Large Language Models (MLLMs) to improve multimodal representation tasks, addressing the challenge of complex and costly pre-training procedures for MLLMs. They propose an end-to-end training method that customizes the integration of existing LLMs and vision encoders to construct efficient Multimodal Large Representation Models (MLRMs). Preliminary experiments reveal that fine-tuned MLRMs tend to overlook visual content. To address this, the authors introduce NoteLLM-2, a novel training framework with two mechanisms: multimodal In-Context Learning (mICL) and late fusion. mICL separates multimodal content into visual and textual components, while late fusion delays the fusion of visual information to retain more visual details. Extensive experiments validate the effectiveness of NoteLLM-2, demonstrating improved performance in multimodal recommendation tasks. The paper contributes to the field by exploring the use of LLMs in multimodal representation and proposing practical methods to enhance their multimodal understanding.

NoteLLM-2: Multimodal Large Representation Models for Recommendation

27 May 2024 | Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Yan Gao, Yao Hu, Enhong Chen