On Softmax Direct Preference Optimization for Recommendation

On Softmax Direct Preference Optimization for Recommendation

14 Jun 2024 | Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua
This paper proposes Softmax-DPO (S-DPO), a novel loss function for language model (LM)-based recommenders that explicitly incorporates ranking information from user preference data. Traditional LM-based recommenders rely on language modeling loss, which is not optimized for personalized ranking tasks and fails to fully leverage preference data. S-DPO addresses this by incorporating multiple negatives in user preference data and devising an alternative version of DPO loss tailored for LM-based recommenders, connected to softmax sampling strategies. Theoretically, S-DPO is bridged with the softmax loss over negative sampling, highlighting the critical role of multiple negatives and finding its side effect of mining hard negatives, which assures its exceptional capabilities in recommendation tasks. Empirically, extensive experiments on three real-world datasets demonstrate that S-DPO outperforms traditional and LM-based recommenders in terms of performance and mitigates the data likelihood decline issue of DPO. S-DPO is shown to provide more effective ranking gradients and improve the stability of DPO training. Additionally, S-DPO has the potential to mine hard negative examples, which not only boosts the performance but also accelerates the training process. The paper also discusses the properties of S-DPO, its theoretical connection to softmax loss, and its effectiveness in recommendation tasks. The results show that S-DPO consistently outperforms all traditional recommenders and state-of-the-art LM-based recommenders on all datasets. The study also investigates the effect of explicit ranking optimization and multiple negative samples on S-DPO's performance, showing that S-DPO can further boost performance and achieve the best among all baseline methods and variants. The paper concludes that S-DPO is a principled loss function specially tailored for LM-based recommenders, utilizing multiple negatives in preference data to explicitly instill ranking information into LM.This paper proposes Softmax-DPO (S-DPO), a novel loss function for language model (LM)-based recommenders that explicitly incorporates ranking information from user preference data. Traditional LM-based recommenders rely on language modeling loss, which is not optimized for personalized ranking tasks and fails to fully leverage preference data. S-DPO addresses this by incorporating multiple negatives in user preference data and devising an alternative version of DPO loss tailored for LM-based recommenders, connected to softmax sampling strategies. Theoretically, S-DPO is bridged with the softmax loss over negative sampling, highlighting the critical role of multiple negatives and finding its side effect of mining hard negatives, which assures its exceptional capabilities in recommendation tasks. Empirically, extensive experiments on three real-world datasets demonstrate that S-DPO outperforms traditional and LM-based recommenders in terms of performance and mitigates the data likelihood decline issue of DPO. S-DPO is shown to provide more effective ranking gradients and improve the stability of DPO training. Additionally, S-DPO has the potential to mine hard negative examples, which not only boosts the performance but also accelerates the training process. The paper also discusses the properties of S-DPO, its theoretical connection to softmax loss, and its effectiveness in recommendation tasks. The results show that S-DPO consistently outperforms all traditional recommenders and state-of-the-art LM-based recommenders on all datasets. The study also investigates the effect of explicit ranking optimization and multiple negative samples on S-DPO's performance, showing that S-DPO can further boost performance and achieve the best among all baseline methods and variants. The paper concludes that S-DPO is a principled loss function specially tailored for LM-based recommenders, utilizing multiple negatives in preference data to explicitly instill ranking information into LM.
Reach us at info@study.space
Understanding On Softmax Direct Preference Optimization for Recommendation