Look, Listen and Learn

Look, Listen and Learn

1 Aug 2017 | Relja Arandjeloviㆠand Andrew Zisserman†,*
This paper introduces a novel Audio-Visual Correspondence (AVC) learning task that leverages the correlation between visual and audio streams in unlabelled videos to train both visual and audio networks from scratch. The goal is to learn semantic information in an unsupervised manner by simply watching and listening to a large number of unlabelled videos. The proposed method, called L³-Net, achieves state-of-the-art performance on two sound classification benchmarks, ESC-50 and DCASE, and performs on par with the best self-supervised approaches on ImageNet classification. The network is also able to localize objects in both modalities and perform fine-grained recognition tasks. The AVC task is a binary classification task where the network must determine whether a pair of (video frame, short audio clip) correspond to each other or not. The network is trained on two video datasets: Flickr-SoundNet and Kinetics-Sounds. The results show that the L³-Net outperforms supervised baselines on the AVC task, and its audio and visual features perform well on standard sound and image classification benchmarks. The visual features of L³-Net are comparable to the best self-supervised methods on ImageNet, while the audio features outperform those trained for audio recognition using visual supervision. The network is able to learn fine-grained visual and audio concepts, such as distinguishing between different instruments or recognizing specific actions. The visual features are also able to detect semantic concepts such as scenes, objects, and human-related actions. The audio features are able to recognize various audio events and classify them into specific categories. The results show that the L³-Net is able to learn meaningful representations without any supervision, and that these representations are effective for a variety of tasks. The paper also discusses the results of a qualitative analysis of what the network has learned. The network is able to recognize a wide range of visual and audio concepts, and the features learned by the network are able to capture the semantics of the data. The results show that the network is able to learn meaningful representations without any supervision, and that these representations are effective for a variety of tasks. The paper concludes that the L³-Net is a promising approach for learning visual and audio representations from unlabelled videos.This paper introduces a novel Audio-Visual Correspondence (AVC) learning task that leverages the correlation between visual and audio streams in unlabelled videos to train both visual and audio networks from scratch. The goal is to learn semantic information in an unsupervised manner by simply watching and listening to a large number of unlabelled videos. The proposed method, called L³-Net, achieves state-of-the-art performance on two sound classification benchmarks, ESC-50 and DCASE, and performs on par with the best self-supervised approaches on ImageNet classification. The network is also able to localize objects in both modalities and perform fine-grained recognition tasks. The AVC task is a binary classification task where the network must determine whether a pair of (video frame, short audio clip) correspond to each other or not. The network is trained on two video datasets: Flickr-SoundNet and Kinetics-Sounds. The results show that the L³-Net outperforms supervised baselines on the AVC task, and its audio and visual features perform well on standard sound and image classification benchmarks. The visual features of L³-Net are comparable to the best self-supervised methods on ImageNet, while the audio features outperform those trained for audio recognition using visual supervision. The network is able to learn fine-grained visual and audio concepts, such as distinguishing between different instruments or recognizing specific actions. The visual features are also able to detect semantic concepts such as scenes, objects, and human-related actions. The audio features are able to recognize various audio events and classify them into specific categories. The results show that the L³-Net is able to learn meaningful representations without any supervision, and that these representations are effective for a variety of tasks. The paper also discusses the results of a qualitative analysis of what the network has learned. The network is able to recognize a wide range of visual and audio concepts, and the features learned by the network are able to capture the semantics of the data. The results show that the network is able to learn meaningful representations without any supervision, and that these representations are effective for a variety of tasks. The paper concludes that the L³-Net is a promising approach for learning visual and audio representations from unlabelled videos.
Reach us at info@study.space
[slides and audio] Look%2C Listen and Learn