18 Mar 2024 | Miriam Wanner*, Seth Ebner*, Zhengping Jiang, Mark Dredze, Benjamin Van Durme
This paper investigates how different methods of claim decomposition affect the evaluation of textual support, particularly in the context of the FACTSCORE metric. The authors find that the decomposition method used significantly influences the results of such evaluations. This is because the metric attributes textual support to the model that generated the text, even though errors can also arise from the decomposition process. To address this, the authors introduce DECOMPSCORE, an adaptation of FACTSCORE that measures decomposition quality. They then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell's theory of logical atomism and neo-Davidsonian semantics, which demonstrates improved decomposition quality over previous methods.
The paper discusses three types of claim decomposition methods: LLM prompting, shallow semantic parsing, and LLM prompting with parse. The authors find that the LLM-based method inspired by Russell and neo-Davidsonian semantics produces more subclaims and maintains higher coherence with the original claim, leading to greater confidence in the evaluation process. They also show that the method of decomposition affects the results of downstream metrics like FACTSCORE, and that the quality of decomposition is crucial for accurate evaluation.
The authors evaluate the performance of different decomposition methods on a dataset of biographies generated by 12 language models. They find that the method inspired by Russell and neo-Davidsonian semantics achieves the highest DECOMPSCORE, indicating the highest number of supported subclaims. However, they also note that this method has lower factual precision compared to other methods. The paper concludes that while the method of decomposition affects the results of downstream metrics, the quality of decomposition is essential for accurate evaluation. The authors also highlight the limitations of current evaluation methods, including their reliance on generated text and the potential for hallucinations in LLMs.This paper investigates how different methods of claim decomposition affect the evaluation of textual support, particularly in the context of the FACTSCORE metric. The authors find that the decomposition method used significantly influences the results of such evaluations. This is because the metric attributes textual support to the model that generated the text, even though errors can also arise from the decomposition process. To address this, the authors introduce DECOMPSCORE, an adaptation of FACTSCORE that measures decomposition quality. They then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell's theory of logical atomism and neo-Davidsonian semantics, which demonstrates improved decomposition quality over previous methods.
The paper discusses three types of claim decomposition methods: LLM prompting, shallow semantic parsing, and LLM prompting with parse. The authors find that the LLM-based method inspired by Russell and neo-Davidsonian semantics produces more subclaims and maintains higher coherence with the original claim, leading to greater confidence in the evaluation process. They also show that the method of decomposition affects the results of downstream metrics like FACTSCORE, and that the quality of decomposition is crucial for accurate evaluation.
The authors evaluate the performance of different decomposition methods on a dataset of biographies generated by 12 language models. They find that the method inspired by Russell and neo-Davidsonian semantics achieves the highest DECOMPSCORE, indicating the highest number of supported subclaims. However, they also note that this method has lower factual precision compared to other methods. The paper concludes that while the method of decomposition affects the results of downstream metrics, the quality of decomposition is essential for accurate evaluation. The authors also highlight the limitations of current evaluation methods, including their reliance on generated text and the potential for hallucinations in LLMs.