A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

19 Feb 2018 | Adina Williams, Nikita Nangia, Samuel R. Bowman
This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a large dataset for evaluating machine learning models in sentence understanding. With 433,000 examples, MultiNLI expands on existing resources by covering ten distinct genres of written and spoken English, providing a broader and more challenging testbed for natural language inference (NLI). Unlike the Stanford NLI Corpus (SNLI), which is limited to image captions, MultiNLI includes diverse text types, making it more representative of real-world language complexity. It also enables evaluation of cross-genre domain adaptation, a critical aspect of NLU research. The MultiNLI corpus was created by collecting premise sentences from ten diverse text sources, including government reports, letters, travel guides, and fiction. Hypotheses were generated by human annotators based on these premises, ensuring a balanced representation of the three NLI classes: entailment, contradiction, and neutrality. The corpus includes both matched and mismatched examples, allowing for evaluation of models across different domains. The paper evaluates the difficulty of MultiNLI by testing three neural network models: a continuous bag-of-words (CBOW) model, a bidirectional LSTM (BiLSTM) model, and the Enhanced Sequential Inference Model (ESIM). Results show that MultiNLI is significantly more challenging than SNLI, with ESIM achieving the highest accuracy on the test sets. The corpus also demonstrates strong inter-annotator agreement, indicating reliable annotations. Analysis of the corpus reveals that it contains a wide range of linguistic phenomena, including quantifiers, belief verbs, time terms, discourse markers, and presupposition triggers. These elements make the corpus particularly challenging for models, especially those that struggle with complex syntactic and semantic structures. The results suggest that MultiNLI provides a valuable benchmark for evaluating NLU models and their ability to generalize across different domains and linguistic phenomena. The corpus is available for research and development, with a focus on advancing methods for sentence understanding and cross-domain adaptation.This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a large dataset for evaluating machine learning models in sentence understanding. With 433,000 examples, MultiNLI expands on existing resources by covering ten distinct genres of written and spoken English, providing a broader and more challenging testbed for natural language inference (NLI). Unlike the Stanford NLI Corpus (SNLI), which is limited to image captions, MultiNLI includes diverse text types, making it more representative of real-world language complexity. It also enables evaluation of cross-genre domain adaptation, a critical aspect of NLU research. The MultiNLI corpus was created by collecting premise sentences from ten diverse text sources, including government reports, letters, travel guides, and fiction. Hypotheses were generated by human annotators based on these premises, ensuring a balanced representation of the three NLI classes: entailment, contradiction, and neutrality. The corpus includes both matched and mismatched examples, allowing for evaluation of models across different domains. The paper evaluates the difficulty of MultiNLI by testing three neural network models: a continuous bag-of-words (CBOW) model, a bidirectional LSTM (BiLSTM) model, and the Enhanced Sequential Inference Model (ESIM). Results show that MultiNLI is significantly more challenging than SNLI, with ESIM achieving the highest accuracy on the test sets. The corpus also demonstrates strong inter-annotator agreement, indicating reliable annotations. Analysis of the corpus reveals that it contains a wide range of linguistic phenomena, including quantifiers, belief verbs, time terms, discourse markers, and presupposition triggers. These elements make the corpus particularly challenging for models, especially those that struggle with complex syntactic and semantic structures. The results suggest that MultiNLI provides a valuable benchmark for evaluating NLU models and their ability to generalize across different domains and linguistic phenomena. The corpus is available for research and development, with a focus on advancing methods for sentence understanding and cross-domain adaptation.
Reach us at info@study.space