June 03-06, 2024 | Mazda Moayeri, Elham Tabassi, Soheil Feizi
WORLDBENCH is a benchmark that quantifies geographic disparities in large language models (LLMs) factual recall. The study uses World Bank data to assess how accurately LLMs recall factual information about different countries. The research evaluates 20 state-of-the-art LLMs, including open-source models like Llama-2, Vicuna, and closed-source models like GPT-4 and Gemini. The results show significant geographic disparities in LLM performance, with higher error rates for countries in Sub-Saharan Africa and low-income regions compared to Western and high-income countries. The study also reveals that LLMs often produce false citations, and some models may be slightly out of date. WORLDBENCH provides a flexible and dynamic benchmark that enables the assessment of LLM performance across different regions and income levels. The findings highlight the need for further research to address geographic biases in LLMs and improve their fairness and reliability.WORLDBENCH is a benchmark that quantifies geographic disparities in large language models (LLMs) factual recall. The study uses World Bank data to assess how accurately LLMs recall factual information about different countries. The research evaluates 20 state-of-the-art LLMs, including open-source models like Llama-2, Vicuna, and closed-source models like GPT-4 and Gemini. The results show significant geographic disparities in LLM performance, with higher error rates for countries in Sub-Saharan Africa and low-income regions compared to Western and high-income countries. The study also reveals that LLMs often produce false citations, and some models may be slightly out of date. WORLDBENCH provides a flexible and dynamic benchmark that enables the assessment of LLM performance across different regions and income levels. The findings highlight the need for further research to address geographic biases in LLMs and improve their fairness and reliability.