This paper introduces CDQA, a Chinese Dynamic Question Answering benchmark, designed to evaluate the ability of Large Language Models (LLMs) to answer dynamic questions, particularly those related to the latest news on the Chinese Internet. The dataset is constructed through a semi-automatic pipeline that combines human and model-generated data, ensuring high-quality question-answer pairs. The data is classified based on the frequency of answer changes, enabling a more detailed evaluation of LLMs. The benchmark includes 1,339 question-answer pairs, categorized into fast-changing, slow-changing, and never-changing questions.
The authors evaluate several Chinese LLMs, including GPT-4, Deepseek-67B-Chat, and others, on CDQA. Results show that GPT-4 performs well, especially with search results, while Deepseek-67B-Chat excels in certain scenarios. The study also explores different prompting methods, such as Chain-of-Thought (CoT) and Rephrase-and-Respond (RaR), and their impact on LLM performance. The findings indicate that CoT and RaR can improve performance but may also increase hallucinations.
The dataset is regularly updated to reflect the latest news and is designed to be a valuable resource for improving Chinese LLMs. The authors highlight the importance of dynamic question answering in real-world applications and emphasize the need for further research in this area. The study also discusses the limitations of the current work, including the focus on Chinese language and the need for more comprehensive evaluations. Overall, CDQA provides a new benchmark for evaluating LLMs in the Chinese context and offers insights into improving their ability to handle dynamic questions.This paper introduces CDQA, a Chinese Dynamic Question Answering benchmark, designed to evaluate the ability of Large Language Models (LLMs) to answer dynamic questions, particularly those related to the latest news on the Chinese Internet. The dataset is constructed through a semi-automatic pipeline that combines human and model-generated data, ensuring high-quality question-answer pairs. The data is classified based on the frequency of answer changes, enabling a more detailed evaluation of LLMs. The benchmark includes 1,339 question-answer pairs, categorized into fast-changing, slow-changing, and never-changing questions.
The authors evaluate several Chinese LLMs, including GPT-4, Deepseek-67B-Chat, and others, on CDQA. Results show that GPT-4 performs well, especially with search results, while Deepseek-67B-Chat excels in certain scenarios. The study also explores different prompting methods, such as Chain-of-Thought (CoT) and Rephrase-and-Respond (RaR), and their impact on LLM performance. The findings indicate that CoT and RaR can improve performance but may also increase hallucinations.
The dataset is regularly updated to reflect the latest news and is designed to be a valuable resource for improving Chinese LLMs. The authors highlight the importance of dynamic question answering in real-world applications and emphasize the need for further research in this area. The study also discusses the limitations of the current work, including the focus on Chinese language and the need for more comprehensive evaluations. Overall, CDQA provides a new benchmark for evaluating LLMs in the Chinese context and offers insights into improving their ability to handle dynamic questions.