27 Oct 2016 | Yusuf Aytar*, Carl Vondrick*, Antonio Torralba
The paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, and Antonio Torralba introduces a method to learn rich sound representations using large amounts of unlabeled video data. The authors leverage the natural synchronization between vision and sound to transfer discriminative visual knowledge from established visual recognition models into the sound modality. They propose a deep convolutional network that processes raw audio waveforms and is trained using a student-teacher training procedure. The network is trained on over 2 million unlabeled videos, which are collected from Flickr and cover various natural sounds in everyday situations. The authors demonstrate that their method achieves state-of-the-art accuracy on standard acoustic scene classification datasets, outperforming existing methods by around 10%. Visualizations suggest that the network learns high-level detectors, such as recognizing bird chirps or crowds cheering, even without ground truth labels. The paper also includes an ablation study and multi-modal recognition experiments, further validating the effectiveness of the proposed approach.The paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, and Antonio Torralba introduces a method to learn rich sound representations using large amounts of unlabeled video data. The authors leverage the natural synchronization between vision and sound to transfer discriminative visual knowledge from established visual recognition models into the sound modality. They propose a deep convolutional network that processes raw audio waveforms and is trained using a student-teacher training procedure. The network is trained on over 2 million unlabeled videos, which are collected from Flickr and cover various natural sounds in everyday situations. The authors demonstrate that their method achieves state-of-the-art accuracy on standard acoustic scene classification datasets, outperforming existing methods by around 10%. Visualizations suggest that the network learns high-level detectors, such as recognizing bird chirps or crowds cheering, even without ground truth labels. The paper also includes an ablation study and multi-modal recognition experiments, further validating the effectiveness of the proposed approach.