The paper introduces CaLMQA, a multilingual long-form question answering (LFQA) dataset containing 1,500 culturally specific questions in 23 languages and 51 culturally agnostic questions translated into 22 languages. The dataset includes both naturally occurring questions from community forums and questions written by native speakers for under-resourced languages. The culturally specific questions are defined as those more likely to be asked by people from cultures associated with the question's language. The dataset covers a wide range of topics, including traditions, laws, and news, and reflects the language usage of native speakers.
The authors evaluate seven state-of-the-art models on their ability to answer questions in CaLMQA. Traditional metrics used for short-form QA do not correlate well with human preferences in LFQA, and metrics do not easily transfer between languages. The authors focus on simple surface-level metrics such as language detection and token repetition. They find that models perform poorly in some low-resource languages, and that culturally specific questions are generally rated lower than culturally agnostic ones. Human evaluation reveals that models struggle with culturally specific questions, especially in low-resource languages. The results highlight the need for further research in non-English LFQA and provide an evaluation framework. The authors also release human reference answers in 11 languages to support future research. The study underscores the need for more robust automatic evaluation metrics and improved multilingual generation capabilities.The paper introduces CaLMQA, a multilingual long-form question answering (LFQA) dataset containing 1,500 culturally specific questions in 23 languages and 51 culturally agnostic questions translated into 22 languages. The dataset includes both naturally occurring questions from community forums and questions written by native speakers for under-resourced languages. The culturally specific questions are defined as those more likely to be asked by people from cultures associated with the question's language. The dataset covers a wide range of topics, including traditions, laws, and news, and reflects the language usage of native speakers.
The authors evaluate seven state-of-the-art models on their ability to answer questions in CaLMQA. Traditional metrics used for short-form QA do not correlate well with human preferences in LFQA, and metrics do not easily transfer between languages. The authors focus on simple surface-level metrics such as language detection and token repetition. They find that models perform poorly in some low-resource languages, and that culturally specific questions are generally rated lower than culturally agnostic ones. Human evaluation reveals that models struggle with culturally specific questions, especially in low-resource languages. The results highlight the need for further research in non-English LFQA and provide an evaluation framework. The authors also release human reference answers in 11 languages to support future research. The study underscores the need for more robust automatic evaluation metrics and improved multilingual generation capabilities.