This paper explores the biases in ChatGPT-based recommender systems, focusing on provider fairness (item-side fairness). Through extensive experiments and over a thousand API calls, the study investigates the impact of prompt design strategies— including structure, system role, and intent—on evaluation metrics such as provider fairness, catalog coverage, temporal stability, and recency. The first experiment examines these strategies in classical top-K recommendations, while the second evaluates sequential in-context learning (ICL).
In the first experiment, seven distinct prompt scenarios are assessed on top-K recommendation accuracy and fairness. Accuracy-oriented prompts, like Simple and Chain-of-Thought (COT), outperform diversification prompts, which, despite enhancing temporal freshness, reduce accuracy by up to 50%. Embedding fairness into system roles, such as "act as a fair recommender," proved more effective than fairness directives within prompts. Diversification prompts led to recommending newer movies, offering broader genre distribution compared to traditional collaborative filtering (CF) models. The system showed high consistency across multiple runs.
The second experiment explores sequential ICL, comparing zero-shot and few-shot learning scenarios. Results indicate that including user demographic information in prompts affects model biases and stereotypes. However, ICL did not consistently improve item fairness and catalog coverage over zero-shot learning. Zero-shot learning achieved higher NDCG and coverage, while ICL-2 showed slight improvements in hit (HR) when age-group context was included.
Overall, the study provides insights into biases of RecLLMs, particularly in provider fairness and catalog coverage. By examining prompt design, learning strategies, and system roles, the research highlights the potential and challenges of integrating large language models into recommendation systems, paving the way for future research.This paper explores the biases in ChatGPT-based recommender systems, focusing on provider fairness (item-side fairness). Through extensive experiments and over a thousand API calls, the study investigates the impact of prompt design strategies— including structure, system role, and intent—on evaluation metrics such as provider fairness, catalog coverage, temporal stability, and recency. The first experiment examines these strategies in classical top-K recommendations, while the second evaluates sequential in-context learning (ICL).
In the first experiment, seven distinct prompt scenarios are assessed on top-K recommendation accuracy and fairness. Accuracy-oriented prompts, like Simple and Chain-of-Thought (COT), outperform diversification prompts, which, despite enhancing temporal freshness, reduce accuracy by up to 50%. Embedding fairness into system roles, such as "act as a fair recommender," proved more effective than fairness directives within prompts. Diversification prompts led to recommending newer movies, offering broader genre distribution compared to traditional collaborative filtering (CF) models. The system showed high consistency across multiple runs.
The second experiment explores sequential ICL, comparing zero-shot and few-shot learning scenarios. Results indicate that including user demographic information in prompts affects model biases and stereotypes. However, ICL did not consistently improve item fairness and catalog coverage over zero-shot learning. Zero-shot learning achieved higher NDCG and coverage, while ICL-2 showed slight improvements in hit (HR) when age-group context was included.
Overall, the study provides insights into biases of RecLLMs, particularly in provider fairness and catalog coverage. By examining prompt design, learning strategies, and system roles, the research highlights the potential and challenges of integrating large language models into recommendation systems, paving the way for future research.