LongLaMP: A Benchmark for Personalized Long-form Text Generation

LongLaMP: A Benchmark for Personalized Long-form Text Generation

15 Oct 2024 | Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehi, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, Hamed Zamani
The LongLaMP benchmark is designed to evaluate personalized long-text generation, focusing on four distinct tasks: personalized email generation, personalized abstract generation, personalized review generation, and personalized topic writing. The benchmark provides a comprehensive evaluation framework for personalized long-text generation, including two settings: user setting and temporal setting. In the user setting, the model is tested on new users with no overlap between training and validation sets, simulating a cold start scenario. In the temporal setting, the model is tested on previously seen users with overlapping data in decreasing chronological order, assessing the model's ability to adapt to user preferences over time. The benchmark leverages a retrieval-augmented generation (RAG) framework, which retrieves relevant user data and integrates it into the LLM's input prompts to enhance personalization. The framework includes a query generation function, a retriever, a personalized prompt generation function, and the large language model itself. The benchmark evaluates the effectiveness of the framework using metrics such as ROUGE-1, ROUGE-L, and METEOR, which measure the similarity between generated text and the target output. The benchmark includes a diverse set of tasks with varying domains, audience, purpose, writing style, content type, credibility requirements, length constraints, and structural elements. The tasks are curated to ensure high quality and utility, with a rigorous filtering process to evaluate challenges and practical applications. The benchmark is designed to be easily extended with new models, tasks, and evaluation metrics. The results of the benchmark demonstrate the effectiveness of the framework in generating personalized long-text for individual users, with improvements ranging from 5.7% to 128% across various metrics compared to non-personalized baselines. The benchmark also highlights the importance of personalization across a wide variety of long-text generation tasks, emphasizing the need for personalized approaches in practical applications. The benchmark is publicly available for others to use and extend in their own research.The LongLaMP benchmark is designed to evaluate personalized long-text generation, focusing on four distinct tasks: personalized email generation, personalized abstract generation, personalized review generation, and personalized topic writing. The benchmark provides a comprehensive evaluation framework for personalized long-text generation, including two settings: user setting and temporal setting. In the user setting, the model is tested on new users with no overlap between training and validation sets, simulating a cold start scenario. In the temporal setting, the model is tested on previously seen users with overlapping data in decreasing chronological order, assessing the model's ability to adapt to user preferences over time. The benchmark leverages a retrieval-augmented generation (RAG) framework, which retrieves relevant user data and integrates it into the LLM's input prompts to enhance personalization. The framework includes a query generation function, a retriever, a personalized prompt generation function, and the large language model itself. The benchmark evaluates the effectiveness of the framework using metrics such as ROUGE-1, ROUGE-L, and METEOR, which measure the similarity between generated text and the target output. The benchmark includes a diverse set of tasks with varying domains, audience, purpose, writing style, content type, credibility requirements, length constraints, and structural elements. The tasks are curated to ensure high quality and utility, with a rigorous filtering process to evaluate challenges and practical applications. The benchmark is designed to be easily extended with new models, tasks, and evaluation metrics. The results of the benchmark demonstrate the effectiveness of the framework in generating personalized long-text for individual users, with improvements ranging from 5.7% to 128% across various metrics compared to non-personalized baselines. The benchmark also highlights the importance of personalization across a wide variety of long-text generation tasks, emphasizing the need for personalized approaches in practical applications. The benchmark is publicly available for others to use and extend in their own research.
Reach us at info@study.space