[slides and audio] Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

The paper introduces *CDQA* (Chinese Dynamic Question Answering), a benchmark designed to evaluate the capabilities of Chinese Large Language Models (LLMs) in handling dynamic questions related to the latest news on the Chinese Internet. The benchmark addresses the challenge of LLMs' ability to answer questions with evolving answers, which is crucial for their practical application in real-world scenarios. *CDQA* is constructed through a semi-automatic data generation pipeline that combines automatic entity extraction and question generation with manual annotation to ensure the quality and relevance of the data. The dataset consists of 1,339 question-answer pairs, categorized by the frequency of answer changes (fast-changing, slow-changing, and never-changing) to provide a fine-grained evaluation of LLMs. The paper evaluates a range of Chinese LLMs, including GPT-4, Deepseek-67B-Chat, and others, using both close-book and open-book settings. Results show that GPT-4 performs best, with nearly 10 F1-recall scores, followed by Deepseek-67B-Chat. The study also explores the impact of different prompting methods (Vanilla, Chain-of-Thought, and Rephrase-and-Respond) and search engines (Google and Bing) on LLM performance. The findings highlight the importance of retrieval-augmented question answering and the need for further research to improve LLMs' ability to handle dynamic questions. The paper concludes by discussing the limitations of the benchmark, such as the focus on Chinese language and the challenge of keeping the dataset updated, and emphasizes the potential of *CDQA* as a valuable resource for advancing LLMs in Chinese contexts.The paper introduces *CDQA* (Chinese Dynamic Question Answering), a benchmark designed to evaluate the capabilities of Chinese Large Language Models (LLMs) in handling dynamic questions related to the latest news on the Chinese Internet. The benchmark addresses the challenge of LLMs' ability to answer questions with evolving answers, which is crucial for their practical application in real-world scenarios. *CDQA* is constructed through a semi-automatic data generation pipeline that combines automatic entity extraction and question generation with manual annotation to ensure the quality and relevance of the data. The dataset consists of 1,339 question-answer pairs, categorized by the frequency of answer changes (fast-changing, slow-changing, and never-changing) to provide a fine-grained evaluation of LLMs. The paper evaluates a range of Chinese LLMs, including GPT-4, Deepseek-67B-Chat, and others, using both close-book and open-book settings. Results show that GPT-4 performs best, with nearly 10 F1-recall scores, followed by Deepseek-67B-Chat. The study also explores the impact of different prompting methods (Vanilla, Chain-of-Thought, and Rephrase-and-Respond) and search engines (Google and Bing) on LLM performance. The findings highlight the importance of retrieval-augmented question answering and the need for further research to improve LLMs' ability to handle dynamic questions. The paper concludes by discussing the limitations of the benchmark, such as the focus on Chinese language and the challenge of keeping the dataset updated, and emphasizes the potential of *CDQA* as a valuable resource for advancing LLMs in Chinese contexts.

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

2 Mar 2024 | Zhikun Xu, Yinghui Li*, Ruixue Ding†, Xinyu Wang, Boli Chen, Yong Jiang† Hai-Tao Zheng, Wenlian Lu, Pengjun Xie, Fei Huang