3 Apr 2024 | Suzanne Petryk, David M. Chan, Anish Kachintha, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell
The paper introduces ALOHa, a modernized open-vocabulary metric for object hallucination detection in captioning models. ALOHa leverages large language models (LLMs) to extract groundable objects from candidate captions, measure their semantic similarity to reference objects, and use Hungarian matching to produce a final hallucination score. Compared to CHAIR, ALOHa correctly identifies 13.6% more hallucinated objects on the HAT dataset and 30.8% more on the nocaps dataset, which includes objects beyond the MS COCO categories. ALOHa is reliable, localizable, and generalizable, making it a significant advancement in the evaluation of hallucination in captioning models. The paper also discusses the limitations of ALOHa, including non-determinism, the need for reference captions, and the costs associated with using LLMs.The paper introduces ALOHa, a modernized open-vocabulary metric for object hallucination detection in captioning models. ALOHa leverages large language models (LLMs) to extract groundable objects from candidate captions, measure their semantic similarity to reference objects, and use Hungarian matching to produce a final hallucination score. Compared to CHAIR, ALOHa correctly identifies 13.6% more hallucinated objects on the HAT dataset and 30.8% more on the nocaps dataset, which includes objects beyond the MS COCO categories. ALOHa is reliable, localizable, and generalizable, making it a significant advancement in the evaluation of hallucination in captioning models. The paper also discusses the limitations of ALOHa, including non-determinism, the need for reference captions, and the costs associated with using LLMs.