FEVER: a large-scale dataset for Fact Extraction and VERification

FEVER: a large-scale dataset for Fact Extraction and VERification

18 Dec 2018 | James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal
FEVER: A Large-Scale Dataset for Fact Extraction and VERification This paper introduces FEVER, a publicly available dataset for fact extraction and verification against textual sources. The dataset consists of 185,445 claims generated by altering sentences extracted from Wikipedia and verified without knowledge of the original sentence. Claims are classified as SUPPORTED, REFUTED, or NOTENOUGHINFO, with inter-annotator agreement of 0.6841 in Fleiss $\kappa$. Annotators also recorded the sentences forming the evidence for the first two classes. To evaluate the challenge of the dataset, a pipeline approach is developed and compared to oracles. The best accuracy on labeling a claim with correct evidence is 31.87%, while ignoring evidence yields 50.91%. FEVER is designed to stimulate progress in claim verification against textual sources. The dataset was constructed in two stages: Claim Generation and Claim Labeling. Claim Generation involved extracting information from Wikipedia and generating claims, while Claim Labeling required annotators to classify claims as SUPPORTED, REFUTED, or NOTENOUGHINFO and provide the necessary evidence. The annotation process included guidelines and user interfaces to ensure consistency. Data validation through super-annotators and manual checks confirmed the quality of the annotations. The baseline system consists of a document retrieval component, a sentence selection component, and a textual entailment component. Document retrieval uses DrQA to find the nearest documents, sentence selection ranks sentences by TF-IDF similarity, and textual entailment uses decomposable attention models. The pipeline's accuracy on the development set is 31.87% when correct evidence is required and 50.91% when it is ignored. Ablation studies show that the sentence selection module is crucial for performance. FEVER has potential applications in claim extraction and verification systems, and the authors discuss future extensions and use cases. They believe FEVER will provide a challenging testbed for advancing these systems.FEVER: A Large-Scale Dataset for Fact Extraction and VERification This paper introduces FEVER, a publicly available dataset for fact extraction and verification against textual sources. The dataset consists of 185,445 claims generated by altering sentences extracted from Wikipedia and verified without knowledge of the original sentence. Claims are classified as SUPPORTED, REFUTED, or NOTENOUGHINFO, with inter-annotator agreement of 0.6841 in Fleiss $\kappa$. Annotators also recorded the sentences forming the evidence for the first two classes. To evaluate the challenge of the dataset, a pipeline approach is developed and compared to oracles. The best accuracy on labeling a claim with correct evidence is 31.87%, while ignoring evidence yields 50.91%. FEVER is designed to stimulate progress in claim verification against textual sources. The dataset was constructed in two stages: Claim Generation and Claim Labeling. Claim Generation involved extracting information from Wikipedia and generating claims, while Claim Labeling required annotators to classify claims as SUPPORTED, REFUTED, or NOTENOUGHINFO and provide the necessary evidence. The annotation process included guidelines and user interfaces to ensure consistency. Data validation through super-annotators and manual checks confirmed the quality of the annotations. The baseline system consists of a document retrieval component, a sentence selection component, and a textual entailment component. Document retrieval uses DrQA to find the nearest documents, sentence selection ranks sentences by TF-IDF similarity, and textual entailment uses decomposable attention models. The pipeline's accuracy on the development set is 31.87% when correct evidence is required and 50.91% when it is ignored. Ablation studies show that the sentence selection module is crucial for performance. FEVER has potential applications in claim extraction and verification systems, and the authors discuss future extensions and use cases. They believe FEVER will provide a challenging testbed for advancing these systems.
Reach us at info@study.space