Understanding Look%2C Listen and Learn

The paper introduces a novel "Audio-Visual Correspondence (AVC)" learning task, which aims to train visual and audio networks simultaneously to predict whether a video frame corresponds to a sound clip. The AVC task leverages the co-occurrence of visual and audio events in unlabelled videos, without requiring explicit supervision. The authors train the networks from scratch using only raw, unconstrained videos, and demonstrate that this approach successfully solves the AVC task and results in high-quality visual and audio representations. These representations achieve state-of-the-art performance on sound classification benchmarks and comparable performance to self-supervised approaches on ImageNet classification. The network also shows the ability to localize objects in both modalities and perform fine-grained recognition tasks. The paper discusses the architecture of the network, training details, and evaluates its performance on various datasets, including Flickr-SoundNet and Kinetics-Sounds. Qualitative analysis reveals that the network learns semantic concepts in both visual and audio modalities, such as distinguishing between different musical instruments and recognizing human-related concepts. The results highlight the potential of using unlabelled videos for learning visual and audio representations.The paper introduces a novel "Audio-Visual Correspondence (AVC)" learning task, which aims to train visual and audio networks simultaneously to predict whether a video frame corresponds to a sound clip. The AVC task leverages the co-occurrence of visual and audio events in unlabelled videos, without requiring explicit supervision. The authors train the networks from scratch using only raw, unconstrained videos, and demonstrate that this approach successfully solves the AVC task and results in high-quality visual and audio representations. These representations achieve state-of-the-art performance on sound classification benchmarks and comparable performance to self-supervised approaches on ImageNet classification. The network also shows the ability to localize objects in both modalities and perform fine-grained recognition tasks. The paper discusses the architecture of the network, training details, and evaluates its performance on various datasets, including Flickr-SoundNet and Kinetics-Sounds. Qualitative analysis reveals that the network learns semantic concepts in both visual and audio modalities, such as distinguishing between different musical instruments and recognizing human-related concepts. The results highlight the potential of using unlabelled videos for learning visual and audio representations.

Look, Listen and Learn

1 Aug 2017 | Relja Arandjelović† and Andrew Zisserman†,*