25 Sep 2020 | Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith
The paper introduces REALTOXICITYPROMPTS, a dataset of 100,000 naturally occurring prompts derived from English web text, paired with toxicity scores from a widely used toxicity classifier. The study investigates how pretrained language models (LMs) can generate toxic text even from seemingly innocuous prompts, and evaluates the effectiveness of controllable text generation methods in preventing such toxic degeneration. The results show that pretrained LMs can generate highly toxic text even when given non-toxic prompts, and that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning "bad" words), no current method is completely effective. The study also analyzes two web text corpora used to pretrain several LMs, finding significant amounts of offensive, factually unreliable, and otherwise toxic content. The findings highlight the difficulty of avoiding toxicity in natural language generation and the need for better data selection processes for pretraining. The paper also explores various methods for detoxifying generations, including data-based and decoding-based approaches, and finds that while these methods reduce toxic behavior, they do not completely eliminate it. The study concludes that the pretraining data used for LMs contains non-negligible amounts of toxic, abusive, and untrustworthy content, and that the choice of pretraining data is crucial for avoiding toxic degeneration. The paper calls for more research into toxicity detection and control, as well as for more transparent and ethical practices in the collection and use of pretraining data.The paper introduces REALTOXICITYPROMPTS, a dataset of 100,000 naturally occurring prompts derived from English web text, paired with toxicity scores from a widely used toxicity classifier. The study investigates how pretrained language models (LMs) can generate toxic text even from seemingly innocuous prompts, and evaluates the effectiveness of controllable text generation methods in preventing such toxic degeneration. The results show that pretrained LMs can generate highly toxic text even when given non-toxic prompts, and that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning "bad" words), no current method is completely effective. The study also analyzes two web text corpora used to pretrain several LMs, finding significant amounts of offensive, factually unreliable, and otherwise toxic content. The findings highlight the difficulty of avoiding toxicity in natural language generation and the need for better data selection processes for pretraining. The paper also explores various methods for detoxifying generations, including data-based and decoding-based approaches, and finds that while these methods reduce toxic behavior, they do not completely eliminate it. The study concludes that the pretraining data used for LMs contains non-negligible amounts of toxic, abusive, and untrustworthy content, and that the choice of pretraining data is crucial for avoiding toxic degeneration. The paper calls for more research into toxicity detection and control, as well as for more transparent and ethical practices in the collection and use of pretraining data.