WORLD BENCH: Quantifying Geographic Disparities in LLM Factual Recall

WORLD BENCH: Quantifying Geographic Disparities in LLM Factual Recall

June 03–06, 2024, Rio de Janeiro, Brazil | Mazda Moayeri, Elham Tabassi, Soheil Feizi
WorldBench is a benchmark designed to assess geographic disparities in the factual recall capabilities of large language models (LLMs). The study leverages data from the World Bank to evaluate how accurately LLMs can recall factual information about different countries. The research evaluates 20 state-of-the-art LLMs, including both open-source and private models, on 11 global development indicators. The results reveal significant geographic disparities in LLM performance, with error rates being 1.5 times higher for countries in Sub-Saharan Africa compared to North American countries. These disparities are consistent across 20 LLMs and 11 indicators, indicating that LLMs are most accurate for countries in Western regions and high-income categories. The study also identifies issues with citation hallucination, where models cite the World Bank while providing false statistics, and highlights that some LLMs may be slightly out of date. WorldBench provides a flexible and dynamic benchmark that enables the assessment of LLM performance across regions and income levels, aiming to address biases and improve the fairness of LLMs. The findings underscore the need for further research to ensure that LLMs work reliably for all regions and income groups.WorldBench is a benchmark designed to assess geographic disparities in the factual recall capabilities of large language models (LLMs). The study leverages data from the World Bank to evaluate how accurately LLMs can recall factual information about different countries. The research evaluates 20 state-of-the-art LLMs, including both open-source and private models, on 11 global development indicators. The results reveal significant geographic disparities in LLM performance, with error rates being 1.5 times higher for countries in Sub-Saharan Africa compared to North American countries. These disparities are consistent across 20 LLMs and 11 indicators, indicating that LLMs are most accurate for countries in Western regions and high-income categories. The study also identifies issues with citation hallucination, where models cite the World Bank while providing false statistics, and highlights that some LLMs may be slightly out of date. WorldBench provides a flexible and dynamic benchmark that enables the assessment of LLM performance across regions and income levels, aiming to address biases and improve the fairness of LLMs. The findings underscore the need for further research to ensure that LLMs work reliably for all regions and income groups.
Reach us at info@study.space