23 Aug 2020 | Qiuqiang Kong, Student Member, IEEE, Yin Cao, Member, IEEE, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Senior Member, IEEE and Mark D. Plumbley, Fellow, IEEE
The paper introduces Pretrained Audio Neural Networks (PANNs) for audio pattern recognition tasks, leveraging the large-scale AudioSet dataset. PANNs are trained using various convolutional neural networks (CNNs) and are evaluated on AudioSet tagging, achieving a mean average precision (mAP) of 0.439, surpassing previous state-of-the-art systems. The authors propose a Wavegram-Logmel-CNN architecture that combines log-mel spectrograms and waveform inputs, enhancing performance. PANNs are also transferred to other audio tasks, including acoustic scene classification, music classification, and speech emotion classification, demonstrating state-of-the-art performance in several cases. The paper discusses the trade-offs between performance and computational complexity, and provides detailed experimental results and comparisons with previous methods. The source code and pre-trained models are released for further research.The paper introduces Pretrained Audio Neural Networks (PANNs) for audio pattern recognition tasks, leveraging the large-scale AudioSet dataset. PANNs are trained using various convolutional neural networks (CNNs) and are evaluated on AudioSet tagging, achieving a mean average precision (mAP) of 0.439, surpassing previous state-of-the-art systems. The authors propose a Wavegram-Logmel-CNN architecture that combines log-mel spectrograms and waveform inputs, enhancing performance. PANNs are also transferred to other audio tasks, including acoustic scene classification, music classification, and speech emotion classification, demonstrating state-of-the-art performance in several cases. The paper discusses the trade-offs between performance and computational complexity, and provides detailed experimental results and comparisons with previous methods. The source code and pre-trained models are released for further research.