[slides] The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

The paper addresses the challenge of detecting Audio Language Model (ALM)-based deepfake audio, which is characterized by widespread dissemination, high deception, and versatility. To tackle this issue, the authors propose the Codefake dataset, an open-source large-scale dataset containing over 1 million audio samples from two languages, designed to focus on ALM-based audio detection. The dataset includes various test conditions to evaluate the effectiveness of deepfake detection models. Additionally, the authors introduce the Co-training Sharpness Aware Minimization (CSAM) strategy to enhance the generalizability of deepfake detection models. Experiments demonstrate that models trained with the Codefake dataset achieve significantly lower average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The paper also discusses the limitations and future directions, including the need for more diverse audio types and further improvements in non-speech tasks.The paper addresses the challenge of detecting Audio Language Model (ALM)-based deepfake audio, which is characterized by widespread dissemination, high deception, and versatility. To tackle this issue, the authors propose the Codefake dataset, an open-source large-scale dataset containing over 1 million audio samples from two languages, designed to focus on ALM-based audio detection. The dataset includes various test conditions to evaluate the effectiveness of deepfake detection models. Additionally, the authors introduce the Co-training Sharpness Aware Minimization (CSAM) strategy to enhance the generalizability of deepfake detection models. Experiments demonstrate that models trained with the Codefake dataset achieve significantly lower average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The paper also discusses the limitations and future directions, including the need for more diverse audio types and further improvements in non-speech tasks.