This paper introduces GRANOLA QA, a new evaluation setting for open-domain question answering that considers both the accuracy and informativeness of predicted answers. The key insight is that factual questions can be answered correctly at different levels of granularity, and standard evaluation protocols often fail to account for this. To address this, the authors propose GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset, and introduce a new decoding strategy called DRAG, which aims to align the answer granularity with the model's uncertainty.
The authors evaluate various models on GRANOLA-EQ, including standard decoding methods and DRAG. They find that standard decoding methods tend to generate specific answers that are often incorrect, while DRAG significantly improves accuracy by generating coarser answers that match multi-granularity labels. The results show that DRAG outperforms other methods in terms of both accuracy and informativeness, and that standard evaluation may underestimate the knowledge of large language models, especially about rare entities.
The paper also presents a methodology for enriching existing QA datasets with multi-granularity answers using an external knowledge graph and an LLM. This process involves generating coarser versions of answers by leveraging the properties of the entities involved. The resulting GRANOLA-EQ dataset contains 12,452 examples with an average of 2.9 multi-granularity answers per question.
The authors evaluate DRAG against several baselines, including standard decoding methods and IDK (I don't know) strategies. They find that DRAG improves both accuracy and informativeness, and that the knowledge evaluation gap is not observed when using semantic similarity scores against single-granularity reference answers. The results highlight the importance of considering answer granularity in evaluating the performance of large language models on factual questions.This paper introduces GRANOLA QA, a new evaluation setting for open-domain question answering that considers both the accuracy and informativeness of predicted answers. The key insight is that factual questions can be answered correctly at different levels of granularity, and standard evaluation protocols often fail to account for this. To address this, the authors propose GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset, and introduce a new decoding strategy called DRAG, which aims to align the answer granularity with the model's uncertainty.
The authors evaluate various models on GRANOLA-EQ, including standard decoding methods and DRAG. They find that standard decoding methods tend to generate specific answers that are often incorrect, while DRAG significantly improves accuracy by generating coarser answers that match multi-granularity labels. The results show that DRAG outperforms other methods in terms of both accuracy and informativeness, and that standard evaluation may underestimate the knowledge of large language models, especially about rare entities.
The paper also presents a methodology for enriching existing QA datasets with multi-granularity answers using an external knowledge graph and an LLM. This process involves generating coarser versions of answers by leveraging the properties of the entities involved. The resulting GRANOLA-EQ dataset contains 12,452 examples with an average of 2.9 multi-granularity answers per question.
The authors evaluate DRAG against several baselines, including standard decoding methods and IDK (I don't know) strategies. They find that DRAG improves both accuracy and informativeness, and that the knowledge evaluation gap is not observed when using semantic similarity scores against single-granularity reference answers. The results highlight the importance of considering answer granularity in evaluating the performance of large language models on factual questions.