[slides and audio] Adversarial Examples for Evaluating Reading Comprehension Systems

The paper "Adversarial Examples for Evaluating Reading Comprehension Systems" by Robin Jia and Percy Liang from Stanford University proposes a novel evaluation method to assess the true language understanding capabilities of reading comprehension systems. Standard accuracy metrics often fail to capture the deeper understanding required in language processing tasks. To address this, the authors introduce adversarial evaluation, which involves testing systems on paragraphs with adversarially inserted sentences that are designed to distract the system without changing the correct answer or misleading humans. The evaluation scheme is applied to the Stanford Question Answering Dataset (SQuAD), a dataset of reading comprehension questions about Wikipedia articles. The authors generate adversarial examples by adding distracting sentences to the input paragraphs, ensuring that these sentences do not contradict the correct answer but are semantically similar to the question. The effectiveness of the adversarial examples is demonstrated through experiments on various models, showing that the accuracy of published models drops significantly when exposed to these adversarially modified inputs. Key findings include: - The average F1 score of 16 published models drops from 75% to 36% when adversarial sentences are added. - Adding ungrammatical sequences of words further reduces the average accuracy to 7%. - Human evaluation shows that humans are less affected by the adversarial examples, with only a minor drop in accuracy. The authors argue that these results highlight the need for more sophisticated models that can understand language more precisely and propose the release of their code and data to encourage further research in this area.The paper "Adversarial Examples for Evaluating Reading Comprehension Systems" by Robin Jia and Percy Liang from Stanford University proposes a novel evaluation method to assess the true language understanding capabilities of reading comprehension systems. Standard accuracy metrics often fail to capture the deeper understanding required in language processing tasks. To address this, the authors introduce adversarial evaluation, which involves testing systems on paragraphs with adversarially inserted sentences that are designed to distract the system without changing the correct answer or misleading humans. The evaluation scheme is applied to the Stanford Question Answering Dataset (SQuAD), a dataset of reading comprehension questions about Wikipedia articles. The authors generate adversarial examples by adding distracting sentences to the input paragraphs, ensuring that these sentences do not contradict the correct answer but are semantically similar to the question. The effectiveness of the adversarial examples is demonstrated through experiments on various models, showing that the accuracy of published models drops significantly when exposed to these adversarially modified inputs. Key findings include: - The average F1 score of 16 published models drops from 75% to 36% when adversarial sentences are added. - Adding ungrammatical sequences of words further reduces the average accuracy to 7%. - Human evaluation shows that humans are less affected by the adversarial examples, with only a minor drop in accuracy. The authors argue that these results highlight the need for more sophisticated models that can understand language more precisely and propose the release of their code and data to encourage further research in this area.

Adversarial Examples for Evaluating Reading Comprehension Systems

23 Jul 2017 | Robin Jia, Percy Liang