[slides and audio] On Softmax Direct Preference Optimization for Recommendation

This paper addresses the limitations of current Language Model (LM)-based recommenders, which primarily use language modeling loss for personalized ranking tasks, failing to fully leverage preference data. To improve recommendation performance, the authors propose Softmax-Direct Preference Optimization (S-DPO), a novel loss function tailored for LM-based recommenders. S-DPO incorporates multiple negatives in user preference data and generalizes the pairwise Direct Preference Optimization (DPO) loss to a softmax ranking loss. The key contributions of S-DPO include: 1. **Theoretical Foundation**: The paper bridges S-DPO with the softmax loss over negative sampling, highlighting its effectiveness in mining hard negatives and improving recommendation performance. 2. **Empirical Validation**: Extensive experiments on three real-world datasets (MovieLens, Goodreads, and LastFM) demonstrate that S-DPO outperforms traditional and LM-based recommenders, achieving significant improvements in Hit Ratio@1. 3. **Ablation Study**: The study shows that S-DPO provides more effective gradients and mitigates the data likelihood decline issue of DPO, making it more stable and efficient. 4. **Parameter Analysis**: The paper explores the impact of the hyperparameter β and the number of negative samples, finding optimal settings for better performance. S-DPO is designed to explicitly instill ranking information into LMs, helping them distinguish preferred items from negatives more effectively. The authors believe that S-DPO has broader implications beyond recommendation systems and can benefit other research areas.This paper addresses the limitations of current Language Model (LM)-based recommenders, which primarily use language modeling loss for personalized ranking tasks, failing to fully leverage preference data. To improve recommendation performance, the authors propose Softmax-Direct Preference Optimization (S-DPO), a novel loss function tailored for LM-based recommenders. S-DPO incorporates multiple negatives in user preference data and generalizes the pairwise Direct Preference Optimization (DPO) loss to a softmax ranking loss. The key contributions of S-DPO include: 1. **Theoretical Foundation**: The paper bridges S-DPO with the softmax loss over negative sampling, highlighting its effectiveness in mining hard negatives and improving recommendation performance. 2. **Empirical Validation**: Extensive experiments on three real-world datasets (MovieLens, Goodreads, and LastFM) demonstrate that S-DPO outperforms traditional and LM-based recommenders, achieving significant improvements in Hit Ratio@1. 3. **Ablation Study**: The study shows that S-DPO provides more effective gradients and mitigates the data likelihood decline issue of DPO, making it more stable and efficient. 4. **Parameter Analysis**: The paper explores the impact of the hyperparameter β and the number of negative samples, finding optimal settings for better performance. S-DPO is designed to explicitly instill ranking information into LMs, helping them distinguish preferred items from negatives more effectively. The authors believe that S-DPO has broader implications beyond recommendation systems and can benefit other research areas.

On Softmax Direct Preference Optimization for Recommendation

14 Jun 2024 | Yuxin Chen1*, Junfei Tan2*, An Zhang1†, Zhengyi Yang2, Leheng Sheng1, Enzhi Zhang3, Xiang Wang2, Tat-Seng Chua1

14 Jun 2024 | Yuxin Chen1, Junfei Tan2, An Zhang1†, Zhengyi Yang2, Leheng Sheng1, Enzhi Zhang3, Xiang Wang2, Tat-Seng Chua1