18 Dec 2018 | James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal
FEVER is a large-scale dataset for fact extraction and verification, containing 185,445 claims generated from Wikipedia sentences and verified by annotators. The claims are classified as SUPPORTED, REFUTED, or NOTENOUGHINFO, with a Fleiss κ of 0.6841. Annotators also provide evidence sentences for the first two classes. The dataset is challenging, as it requires retrieving evidence from multiple sources and sentences, with 16.82% of claims needing evidence from multiple sentences and 12.15% from multiple pages. The dataset includes guidelines and interfaces to ensure consistency, achieving 95.42% precision and 72.36% recall in evidence retrieval. A pipeline approach was developed to evaluate the dataset, achieving 31.87% accuracy when correct evidence is required and 50.91% when it is not. Oracle experiments showed that selecting evidence sentences is the most challenging part of the task. The dataset is publicly available, along with annotation interfaces and a baseline system, to stimulate further research on claim verification. The dataset is used for evaluating systems in tasks such as document retrieval, sentence selection, and textual entailment. The results show that the task is challenging but feasible, with the best system achieving 31.87% accuracy. The dataset also supports other applications, such as claim extraction and verification against different textual sources. The authors conclude that FEVER provides a valuable resource for advancing fact verification systems.FEVER is a large-scale dataset for fact extraction and verification, containing 185,445 claims generated from Wikipedia sentences and verified by annotators. The claims are classified as SUPPORTED, REFUTED, or NOTENOUGHINFO, with a Fleiss κ of 0.6841. Annotators also provide evidence sentences for the first two classes. The dataset is challenging, as it requires retrieving evidence from multiple sources and sentences, with 16.82% of claims needing evidence from multiple sentences and 12.15% from multiple pages. The dataset includes guidelines and interfaces to ensure consistency, achieving 95.42% precision and 72.36% recall in evidence retrieval. A pipeline approach was developed to evaluate the dataset, achieving 31.87% accuracy when correct evidence is required and 50.91% when it is not. Oracle experiments showed that selecting evidence sentences is the most challenging part of the task. The dataset is publicly available, along with annotation interfaces and a baseline system, to stimulate further research on claim verification. The dataset is used for evaluating systems in tasks such as document retrieval, sentence selection, and textual entailment. The results show that the task is challenging but feasible, with the best system achieving 31.87% accuracy. The dataset also supports other applications, such as claim extraction and verification against different textual sources. The authors conclude that FEVER provides a valuable resource for advancing fact verification systems.