ALOHa: A New Measure for Hallucination in Captioning Models

ALOHa: A New Measure for Hallucination in Captioning Models

3 Apr 2024 | Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell
ALOHa is a new metric for detecting object hallucination in captioning models. It leverages large language models (LLMs) to measure hallucinations by extracting groundable objects from candidate captions, measuring their semantic similarity to reference objects, and using Hungarian matching to produce a final hallucination score. ALOHa outperforms existing metrics like CHAIR and CLIPScore in detecting hallucinations, especially in cases where objects are not present in the image. It is reliable, localizable, and generalizable, allowing it to detect hallucinations across a wide range of input datasets and object categories. ALOHa is tested on a new gold-standard dataset called HAT, which contains labeled hallucinations in captions. It also performs well on two other datasets, FOIL and nocaps-FOIL, demonstrating its effectiveness in detecting hallucinations beyond the MS COCO object set. ALOHa is able to correctly identify more hallucinated objects than CHAIR on HAT and nocaps-FOIL, and it is able to localize those hallucinations. The method is also robust to missing detections and performs well with open-source models. However, it has limitations, including non-determinism in large language models, the need for reference captions, and the high computational and environmental costs associated with using large language models. ALOHa is intended for research purposes and aims to address these limitations in future iterations.ALOHa is a new metric for detecting object hallucination in captioning models. It leverages large language models (LLMs) to measure hallucinations by extracting groundable objects from candidate captions, measuring their semantic similarity to reference objects, and using Hungarian matching to produce a final hallucination score. ALOHa outperforms existing metrics like CHAIR and CLIPScore in detecting hallucinations, especially in cases where objects are not present in the image. It is reliable, localizable, and generalizable, allowing it to detect hallucinations across a wide range of input datasets and object categories. ALOHa is tested on a new gold-standard dataset called HAT, which contains labeled hallucinations in captions. It also performs well on two other datasets, FOIL and nocaps-FOIL, demonstrating its effectiveness in detecting hallucinations beyond the MS COCO object set. ALOHa is able to correctly identify more hallucinated objects than CHAIR on HAT and nocaps-FOIL, and it is able to localize those hallucinations. The method is also robust to missing detections and performs well with open-source models. However, it has limitations, including non-determinism in large language models, the need for reference captions, and the high computational and environmental costs associated with using large language models. ALOHa is intended for research purposes and aims to address these limitations in future iterations.
Reach us at info@study.space
[slides and audio] ALOHa%3A A New Measure for Hallucination in Captioning Models