Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

2024 | Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, Yinghai Lu, Yu Shi
This paper introduces Generative Recommenders (GRs), a new paradigm for recommendation systems that reformulates ranking and retrieval as sequential transduction tasks. GRs are designed to handle high cardinality, non-stationary streaming data and outperform existing Deep Learning Recommendation Models (DLRMs) in terms of performance and efficiency. The key innovation is the Hierarchical Sequential Transduction Unit (HSTU), a new encoder architecture that significantly improves the efficiency of self-attention mechanisms, enabling faster training and inference. HSTU is up to 15.2x faster than FlashAttention2-based Transformers on sequences of length 8192. GRs have been deployed on a large internet platform with billions of users, achieving a 12.4% improvement in metrics through online A/B tests. The model quality of GRs scales as a power-law of training compute, up to GPT-3/LLaMa-2 scale, which reduces the carbon footprint needed for future model developments. GRs also enable the first foundation models in recommendations, with a unified feature space that can be used across domains. The paper also presents a new algorithm, M-FALCON, that enables efficient inference for complex GR models. Experiments show that GRs outperform DLRMs in both offline and online settings, and demonstrate superior scalability with respect to compute and parameters. The results show that GRs can significantly improve recommendation performance while reducing computational costs.This paper introduces Generative Recommenders (GRs), a new paradigm for recommendation systems that reformulates ranking and retrieval as sequential transduction tasks. GRs are designed to handle high cardinality, non-stationary streaming data and outperform existing Deep Learning Recommendation Models (DLRMs) in terms of performance and efficiency. The key innovation is the Hierarchical Sequential Transduction Unit (HSTU), a new encoder architecture that significantly improves the efficiency of self-attention mechanisms, enabling faster training and inference. HSTU is up to 15.2x faster than FlashAttention2-based Transformers on sequences of length 8192. GRs have been deployed on a large internet platform with billions of users, achieving a 12.4% improvement in metrics through online A/B tests. The model quality of GRs scales as a power-law of training compute, up to GPT-3/LLaMa-2 scale, which reduces the carbon footprint needed for future model developments. GRs also enable the first foundation models in recommendations, with a unified feature space that can be used across domains. The paper also presents a new algorithm, M-FALCON, that enables efficient inference for complex GR models. Experiments show that GRs outperform DLRMs in both offline and online settings, and demonstrate superior scalability with respect to compute and parameters. The results show that GRs can significantly improve recommendation performance while reducing computational costs.
Reach us at info@study.space