19 May 2019 | Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi
The paper "HellaSwag: Can a Machine Really Finish Your Sentence?" by Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi explores the limits of natural language inference (NLI) tasks, particularly in the context of commonsense reasoning. The authors introduce HellaSwag, a new benchmark dataset designed to challenge state-of-the-art models in NLI. Despite the advancements in models like BERT, which achieved near-human performance on the SWAG dataset, HellaSwag reveals that even these models struggle with commonsense inference.
The authors use Adversarial Filtering (AF), a data collection method that iteratively selects adversarial wrong answers to create a challenging dataset. This approach ensures that the dataset is difficult for models but easy for humans. The resulting HellaSwag dataset consists of 70k problems, where human accuracy is over 95%, but model accuracy drops to less than 50%.
The paper also discusses the importance of high-quality generators and discriminators in creating adversarial datasets. It highlights that while BERT performed well on SWAG, it still struggles on HellaSwag, suggesting that the underlying task of commonsense NLI remains unsolved. The authors argue that benchmarks must evolve with the state-of-the-art to remain challenging and that future progress in NLP may require significant computational resources and algorithmic improvements.
Overall, the paper provides insights into the limitations of current deep learning models and suggests a path for future research in NLP, emphasizing the need for benchmarks that co-evolve with evolving models.The paper "HellaSwag: Can a Machine Really Finish Your Sentence?" by Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi explores the limits of natural language inference (NLI) tasks, particularly in the context of commonsense reasoning. The authors introduce HellaSwag, a new benchmark dataset designed to challenge state-of-the-art models in NLI. Despite the advancements in models like BERT, which achieved near-human performance on the SWAG dataset, HellaSwag reveals that even these models struggle with commonsense inference.
The authors use Adversarial Filtering (AF), a data collection method that iteratively selects adversarial wrong answers to create a challenging dataset. This approach ensures that the dataset is difficult for models but easy for humans. The resulting HellaSwag dataset consists of 70k problems, where human accuracy is over 95%, but model accuracy drops to less than 50%.
The paper also discusses the importance of high-quality generators and discriminators in creating adversarial datasets. It highlights that while BERT performed well on SWAG, it still struggles on HellaSwag, suggesting that the underlying task of commonsense NLI remains unsolved. The authors argue that benchmarks must evolve with the state-of-the-art to remain challenging and that future progress in NLP may require significant computational resources and algorithmic improvements.
Overall, the paper provides insights into the limitations of current deep learning models and suggests a path for future research in NLP, emphasizing the need for benchmarks that co-evolve with evolving models.