The paper "Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers" by Gal Yona addresses the issue of factual errors in large language models (LLMs) by proposing a novel evaluation setting called GRANOLA QA. Standard QA evaluation protocols do not account for different levels of granularity in answers, leading to an underestimation of LLMs' knowledge. The authors introduce GRANOLA QA, which evaluates predicted answers against a set of multi-granularity answers, considering both accuracy and informativeness. They present a methodology to enrich existing datasets with multi-granularity answers and create GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset. The paper evaluates various decoding methods, including a new algorithm called Decoding with Response Aggregation (DRAG), which aligns the answer granularity with the model's uncertainty. Experiments show that standard decoding methods tend to generate specific but incorrect answers, while DRAG improves accuracy by nearly 20 points on average, especially for rare entities. The study highlights the importance of considering answer granularity in evaluating LLMs' knowledge and proposes a practical approach to address the knowledge evaluation gap.The paper "Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers" by Gal Yona addresses the issue of factual errors in large language models (LLMs) by proposing a novel evaluation setting called GRANOLA QA. Standard QA evaluation protocols do not account for different levels of granularity in answers, leading to an underestimation of LLMs' knowledge. The authors introduce GRANOLA QA, which evaluates predicted answers against a set of multi-granularity answers, considering both accuracy and informativeness. They present a methodology to enrich existing datasets with multi-granularity answers and create GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset. The paper evaluates various decoding methods, including a new algorithm called Decoding with Response Aggregation (DRAG), which aligns the answer granularity with the model's uncertainty. Experiments show that standard decoding methods tend to generate specific but incorrect answers, while DRAG improves accuracy by nearly 20 points on average, especially for rare entities. The study highlights the importance of considering answer granularity in evaluating LLMs' knowledge and proposes a practical approach to address the knowledge evaluation gap.