QUANTEMP: A real-world open-domain benchmark for fact-checking numerical claims

QUANTEMP: A real-world open-domain benchmark for fact-checking numerical claims

July 14-18, 2024 | Venktesh V, Abhijit Anand, Avishek Anand, Vinay Setty
QUANTEMP is a new benchmark for fact-checking numerical claims, consisting of 15,514 real-world claims from various domains, including comparative, statistical, interval, and temporal aspects. The dataset includes detailed metadata and evidence collected from the web. It addresses the challenge of verifying real-world numerical claims, which are complex and often lack precise information, a gap not filled by existing works that mainly focus on synthetic claims. The dataset is evaluated with various methods, including claim decomposition and natural language inference (NLI) models. The best baseline achieved a macro-F1 of 58.32, demonstrating that QUANTEMP serves as a challenging evaluation set for numerical claim verification. The dataset is designed to evaluate the effectiveness of fact-checking systems in retrieving evidence and making veracity predictions. It includes a diverse set of claims from various fact-checking domains, with a focus on numerical claims. The dataset is also used to evaluate the impact of claim decomposition on claim verification, showing that decomposition improves performance, particularly for the 'Conflicting' category. The dataset is also used to evaluate the effectiveness of different NLI models, showing that models pre-trained for numerical understanding outperform generic models. The dataset is also used to evaluate the impact of model size on performance, showing that larger models improve performance when fine-tuned, but not necessarily in few-shot or zero-shot settings. The dataset is also used to evaluate the performance across different categories of numerical claims, showing that models with a focus on number understanding outperform those trained only on language tasks. The dataset is also used to evaluate the error analysis of the fact-checking pipeline, showing that claims categorized as 'Conflicting' are particularly challenging. The dataset is also used to evaluate the impact of different NLI models, showing that models pre-trained for numerical understanding outperform generic models. The dataset is also used to evaluate the impact of model size on performance, showing that larger models improve performance when fine-tuned, but not necessarily in few-shot or zero-shot settings. The dataset is also used to evaluate the performance across different categories of numerical claims, showing that models with a focus on number understanding outperform those trained only on language tasks. The dataset is also used to evaluate the error analysis of the fact-checking pipeline, showing that claims categorized as 'Conflicting' are particularly challenging.QUANTEMP is a new benchmark for fact-checking numerical claims, consisting of 15,514 real-world claims from various domains, including comparative, statistical, interval, and temporal aspects. The dataset includes detailed metadata and evidence collected from the web. It addresses the challenge of verifying real-world numerical claims, which are complex and often lack precise information, a gap not filled by existing works that mainly focus on synthetic claims. The dataset is evaluated with various methods, including claim decomposition and natural language inference (NLI) models. The best baseline achieved a macro-F1 of 58.32, demonstrating that QUANTEMP serves as a challenging evaluation set for numerical claim verification. The dataset is designed to evaluate the effectiveness of fact-checking systems in retrieving evidence and making veracity predictions. It includes a diverse set of claims from various fact-checking domains, with a focus on numerical claims. The dataset is also used to evaluate the impact of claim decomposition on claim verification, showing that decomposition improves performance, particularly for the 'Conflicting' category. The dataset is also used to evaluate the effectiveness of different NLI models, showing that models pre-trained for numerical understanding outperform generic models. The dataset is also used to evaluate the impact of model size on performance, showing that larger models improve performance when fine-tuned, but not necessarily in few-shot or zero-shot settings. The dataset is also used to evaluate the performance across different categories of numerical claims, showing that models with a focus on number understanding outperform those trained only on language tasks. The dataset is also used to evaluate the error analysis of the fact-checking pipeline, showing that claims categorized as 'Conflicting' are particularly challenging. The dataset is also used to evaluate the impact of different NLI models, showing that models pre-trained for numerical understanding outperform generic models. The dataset is also used to evaluate the impact of model size on performance, showing that larger models improve performance when fine-tuned, but not necessarily in few-shot or zero-shot settings. The dataset is also used to evaluate the performance across different categories of numerical claims, showing that models with a focus on number understanding outperform those trained only on language tasks. The dataset is also used to evaluate the error analysis of the fact-checking pipeline, showing that claims categorized as 'Conflicting' are particularly challenging.
Reach us at info@study.space