Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

ACCEPTED NOVEMBER 2016 | Justin Salamon and Juan Pablo Bello
This paper presents a deep convolutional neural network (CNN) architecture for environmental sound classification and explores the use of audio data augmentation to address the challenge of limited labeled data. The proposed CNN, combined with data augmentation, achieves state-of-the-art performance in environmental sound classification. The study shows that the improved performance is due to the combination of a deep, high-capacity model and an augmented training set, which outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. The CNN architecture consists of three convolutional layers, two pooling operations, and two fully connected layers. The input to the network is time-frequency patches (TF-patches) extracted from the log-scaled mel-spectrogram of the audio signal. The model is trained using cross-entropy loss with mini-batch stochastic gradient descent. Dropout and L2-regularization are applied to prevent overfitting. The study also investigates the impact of different audio data augmentations on the model's performance. Four types of augmentations are tested: time stretching, pitch shifting, dynamic range compression, and background noise. The results show that the proposed CNN with data augmentation significantly improves classification accuracy compared to the original dataset. However, some augmentations, such as dynamic range compression and background noise, negatively affect certain classes, particularly those characterized by continuous sounds. The study further examines the influence of each augmentation on the model's classification accuracy for each class. It is observed that the performance of the model for each sound class is influenced differently by each augmentation set. This suggests that the performance of the model could be further improved by applying class-conditional data augmentation during training. The results indicate that pitch augmentations have the greatest positive impact on performance and are the only augmentation sets that do not have a negative impact on any of the classes. The study concludes that the combination of a deep, high-capacity model and an augmented training set is key to achieving state-of-the-art results in environmental sound classification.This paper presents a deep convolutional neural network (CNN) architecture for environmental sound classification and explores the use of audio data augmentation to address the challenge of limited labeled data. The proposed CNN, combined with data augmentation, achieves state-of-the-art performance in environmental sound classification. The study shows that the improved performance is due to the combination of a deep, high-capacity model and an augmented training set, which outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. The CNN architecture consists of three convolutional layers, two pooling operations, and two fully connected layers. The input to the network is time-frequency patches (TF-patches) extracted from the log-scaled mel-spectrogram of the audio signal. The model is trained using cross-entropy loss with mini-batch stochastic gradient descent. Dropout and L2-regularization are applied to prevent overfitting. The study also investigates the impact of different audio data augmentations on the model's performance. Four types of augmentations are tested: time stretching, pitch shifting, dynamic range compression, and background noise. The results show that the proposed CNN with data augmentation significantly improves classification accuracy compared to the original dataset. However, some augmentations, such as dynamic range compression and background noise, negatively affect certain classes, particularly those characterized by continuous sounds. The study further examines the influence of each augmentation on the model's classification accuracy for each class. It is observed that the performance of the model for each sound class is influenced differently by each augmentation set. This suggests that the performance of the model could be further improved by applying class-conditional data augmentation during training. The results indicate that pitch augmentations have the greatest positive impact on performance and are the only augmentation sets that do not have a negative impact on any of the classes. The study concludes that the combination of a deep, high-capacity model and an augmented training set is key to achieving state-of-the-art results in environmental sound classification.
Reach us at info@study.space