10 Jan 2017 | Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson
This paper presents the use of Convolutional Neural Networks (CNNs) for large-scale audio classification. The authors evaluated several CNN architectures, including AlexNet, VGG, Inception, and ResNet, on a large dataset of 70 million training videos with 30,871 video-level labels. They found that CNNs perform well on audio classification tasks, and that larger training and label sets improve performance up to a point. A model using embeddings from these classifiers outperformed raw features on the Audio Set Acoustic Event Detection (AED) task.
The dataset, called YouTube-100M, consists of 70 million training videos, 10 million evaluation videos, and 20 million validation videos. Each video is labeled with one or more topic identifiers from a set of 30,871 labels. The authors evaluated several DNN architectures, including fully connected DNNs, AlexNet, VGG, Inception V3, and ResNet-50. They found that Inception and ResNet achieved the best performance, with high model capacity and efficient convolutional units that can capture common structures in both images and audio.
The authors also investigated how the size of the training set and label vocabulary affects performance. They found that increasing the number of videos up to 7 million improves performance for the best-performing ResNet-50 architecture. They also found that training with a broader set of categories can help regularize even the 400 class subset.
Finally, the authors evaluated the performance of their models on the Audio Set dataset, which contains over 1 million 10-second audio excerpts labeled with a vocabulary of acoustic events. They found that a model using embeddings from their best ResNet model achieved significantly better performance than a baseline model using log-mel patches. This improvement reflects the benefit of the larger YouTube-100M training set embodied in the ResNet classifier outputs.This paper presents the use of Convolutional Neural Networks (CNNs) for large-scale audio classification. The authors evaluated several CNN architectures, including AlexNet, VGG, Inception, and ResNet, on a large dataset of 70 million training videos with 30,871 video-level labels. They found that CNNs perform well on audio classification tasks, and that larger training and label sets improve performance up to a point. A model using embeddings from these classifiers outperformed raw features on the Audio Set Acoustic Event Detection (AED) task.
The dataset, called YouTube-100M, consists of 70 million training videos, 10 million evaluation videos, and 20 million validation videos. Each video is labeled with one or more topic identifiers from a set of 30,871 labels. The authors evaluated several DNN architectures, including fully connected DNNs, AlexNet, VGG, Inception V3, and ResNet-50. They found that Inception and ResNet achieved the best performance, with high model capacity and efficient convolutional units that can capture common structures in both images and audio.
The authors also investigated how the size of the training set and label vocabulary affects performance. They found that increasing the number of videos up to 7 million improves performance for the best-performing ResNet-50 architecture. They also found that training with a broader set of categories can help regularize even the 400 class subset.
Finally, the authors evaluated the performance of their models on the Audio Set dataset, which contains over 1 million 10-second audio excerpts labeled with a vocabulary of acoustic events. They found that a model using embeddings from their best ResNet model achieved significantly better performance than a baseline model using log-mel patches. This improvement reflects the benefit of the larger YouTube-100M training set embodied in the ResNet classifier outputs.