The paper introduces CALMQA, a dataset for long-form question answering (LFQA) that focuses on culturally specific questions across 23 languages. CALMQA includes 1,500 culturally specific questions and 51 culturally agnostic questions translated into 22 other languages. The culturally specific questions are collected from community web forums and written by native speakers for under-resourced languages. The dataset covers a wide range of topics, reflecting cultural topics such as traditions, laws, and news. The authors evaluate seven state-of-the-art models on CALMQA using automatic metrics for language detection and token repetition, finding that models perform poorly for low-resource languages. Human evaluation on a subset of models and languages reveals that model performance is significantly worse for culturally specific questions compared to culturally agnostic questions. The findings highlight the need for further research in non-English LFQA and provide an evaluation framework.The paper introduces CALMQA, a dataset for long-form question answering (LFQA) that focuses on culturally specific questions across 23 languages. CALMQA includes 1,500 culturally specific questions and 51 culturally agnostic questions translated into 22 other languages. The culturally specific questions are collected from community web forums and written by native speakers for under-resourced languages. The dataset covers a wide range of topics, reflecting cultural topics such as traditions, laws, and news. The authors evaluate seven state-of-the-art models on CALMQA using automatic metrics for language detection and token repetition, finding that models perform poorly for low-resource languages. Human evaluation on a subset of models and languages reveals that model performance is significantly worse for culturally specific questions compared to culturally agnostic questions. The findings highlight the need for further research in non-English LFQA and provide an evaluation framework.