Understanding Towards Unified Multi-Modal Personalization%3A Large Vision-Language Models for Generative Recommendation and Beyond

The paper introduces UniMP (Unified Multi-modal Personalization), a novel framework designed to integrate and leverage multi-modal data for personalized recommendation and beyond. UniMP aims to address the limitations of traditional personalized systems, which often struggle with handling heterogeneous data types such as images, texts, and product IDs. The framework proposes a unified data format that seamlessly incorporates various types of user history information, enabling fine-grained multi-modal information extraction and alignment. The architecture includes a vision model for extracting visual elements and a language model for reasoning and generation based on user history. The multi-modal tasks are integrated into a token generation framework, where each task is formulated as a next-token prediction objective. To optimize multi-task learning, the framework introduces token-level re-weighting and context reconstruction techniques. Extensive experiments on real-world datasets demonstrate that UniMP outperforms specialized methods in various personalized tasks, including recommendation, preference prediction, and image generation. The paper also presents a comprehensive benchmark covering a wide range of user requirements, showcasing the effectiveness and generalizability of UniMP.The paper introduces UniMP (Unified Multi-modal Personalization), a novel framework designed to integrate and leverage multi-modal data for personalized recommendation and beyond. UniMP aims to address the limitations of traditional personalized systems, which often struggle with handling heterogeneous data types such as images, texts, and product IDs. The framework proposes a unified data format that seamlessly incorporates various types of user history information, enabling fine-grained multi-modal information extraction and alignment. The architecture includes a vision model for extracting visual elements and a language model for reasoning and generation based on user history. The multi-modal tasks are integrated into a token generation framework, where each task is formulated as a next-token prediction objective. To optimize multi-task learning, the framework introduces token-level re-weighting and context reconstruction techniques. Extensive experiments on real-world datasets demonstrate that UniMP outperforms specialized methods in various personalized tasks, including recommendation, preference prediction, and image generation. The paper also presents a comprehensive benchmark covering a wide range of user requirements, showcasing the effectiveness and generalizability of UniMP.

TOWARDS UNIFIED MULTI-MODAL PERSONALIZATION: LARGE VISION-LANGUAGE MODELS FOR GENERATIVE RECOMMENDATION AND BEYOND

27 Mar 2024 | Tianxin Wei, Bowen Jin, Ruirui Li, Hansi Zeng, Zhengyang Wang, Jianhui Sun, Qingyu Yin, Hanqing Lu, Suhang Wang, Jingrui He, Xianfeng Tang