[slides] CNN architectures for large-scale audio classification

This paper explores the application of Convolutional Neural Networks (CNNs) to audio classification, particularly in the context of large-scale datasets. The authors use a dataset of 70 million training videos, each tagged with labels from a set of 30,871 categories, to evaluate various CNN architectures, including fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and ResNet. They investigate the impact of varying training set and label vocabulary sizes on performance and compare the effectiveness of these architectures on the Audio Set Acoustic Event Detection (AED) task. Key findings include: 1. **Architecture Performance**: All CNNs outperform the baseline fully connected DNN, with Inception and ResNet achieving the best results. 2. **Label Set Size**: Training on larger label sets can improve performance, though the improvement is modest. 3. **Training Set Size**: Increasing the number of videos up to 7 million improves performance for the best-performing ResNet-50 architecture. 4. **AED Task**: Models trained with embeddings from the best-performing CNNs significantly outperform raw features on the Audio Set dataset, demonstrating the utility of these architectures in AED. The study highlights the potential of large-scale audio classification using advanced CNN architectures, suggesting that they can effectively capture complex audio patterns and improve performance on tasks like AED.This paper explores the application of Convolutional Neural Networks (CNNs) to audio classification, particularly in the context of large-scale datasets. The authors use a dataset of 70 million training videos, each tagged with labels from a set of 30,871 categories, to evaluate various CNN architectures, including fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and ResNet. They investigate the impact of varying training set and label vocabulary sizes on performance and compare the effectiveness of these architectures on the Audio Set Acoustic Event Detection (AED) task. Key findings include: 1. **Architecture Performance**: All CNNs outperform the baseline fully connected DNN, with Inception and ResNet achieving the best results. 2. **Label Set Size**: Training on larger label sets can improve performance, though the improvement is modest. 3. **Training Set Size**: Increasing the number of videos up to 7 million improves performance for the best-performing ResNet-50 architecture. 4. **AED Task**: Models trained with embeddings from the best-performing CNNs significantly outperform raw features on the Audio Set dataset, demonstrating the utility of these architectures in AED. The study highlights the potential of large-scale audio classification using advanced CNN architectures, suggesting that they can effectively capture complex audio patterns and improve performance on tasks like AED.

CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION

10 Jan 2017 | Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson