SYNCHFORMER: EFFICIENT SYNCHRONIZATION FROM SPARSE CUES

SYNCHFORMER: EFFICIENT SYNCHRONIZATION FROM SPARSE CUES

29 Jan 2024 | Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
This paper introduces Synchformer, a novel audio-visual synchronization model that achieves state-of-the-art performance in both dense and sparse settings. The model is designed for 'in-the-wild' videos, where synchronization cues are sparse, such as those found on YouTube. The key contributions include a new synchronization model and a training approach that decouples feature extraction from synchronization modeling through multi-modal segment-level contrastive pre-training. This approach enables the model to handle both dense and sparse cues effectively. The Synchformer model is trained in two stages: first, segment-level contrastive pre-training of feature extractors, and second, training of a lightweight synchronization module. The model is extended to the AudioSet dataset, a million-scale 'in-the-wild' dataset, and achieves state-of-the-art results. Additionally, the model is enhanced with evidence attribution techniques for interpretability and the ability to predict synchronizability, i.e., whether audio and visual streams can be synchronized. The architecture of Synchformer involves extracting features from shorter temporal segments of the video, followed by aggregation along spatial and frequency dimensions. The synchronization module then predicts the temporal offset between audio and visual streams. The model is trained using a combination of segment-level contrastive learning and cross-entropy loss. The model outperforms previous methods in both dense and sparse settings, demonstrating the effectiveness of the proposed approach. The results show that the model can be trained on sparse data alone, achieving superior performance compared to methods that rely on dense data. The model also exhibits strong performance in predicting synchronizability, a novel capability introduced in this work. The experiments show that Synchformer achieves high accuracy in synchronization tasks, with results reported on multiple datasets, including LRS3, VGGSound, and AudioSet. The model's performance is evaluated using top-1 accuracy across 21 offset classes, with and without a ±1 class tolerance. The results demonstrate that the model is effective in both dense and sparse settings, and that the proposed approach significantly improves synchronization performance. The paper also includes ablation studies that highlight the importance of various components of the model, including initialization, training, segment overlap, and feature extractors. The results show that the model's performance is significantly improved by using segment-level contrastive pre-training and by freezing the feature extractors during synchronization training. The model's ability to handle sparse data and predict synchronizability makes it a valuable tool for audio-visual synchronization tasks.This paper introduces Synchformer, a novel audio-visual synchronization model that achieves state-of-the-art performance in both dense and sparse settings. The model is designed for 'in-the-wild' videos, where synchronization cues are sparse, such as those found on YouTube. The key contributions include a new synchronization model and a training approach that decouples feature extraction from synchronization modeling through multi-modal segment-level contrastive pre-training. This approach enables the model to handle both dense and sparse cues effectively. The Synchformer model is trained in two stages: first, segment-level contrastive pre-training of feature extractors, and second, training of a lightweight synchronization module. The model is extended to the AudioSet dataset, a million-scale 'in-the-wild' dataset, and achieves state-of-the-art results. Additionally, the model is enhanced with evidence attribution techniques for interpretability and the ability to predict synchronizability, i.e., whether audio and visual streams can be synchronized. The architecture of Synchformer involves extracting features from shorter temporal segments of the video, followed by aggregation along spatial and frequency dimensions. The synchronization module then predicts the temporal offset between audio and visual streams. The model is trained using a combination of segment-level contrastive learning and cross-entropy loss. The model outperforms previous methods in both dense and sparse settings, demonstrating the effectiveness of the proposed approach. The results show that the model can be trained on sparse data alone, achieving superior performance compared to methods that rely on dense data. The model also exhibits strong performance in predicting synchronizability, a novel capability introduced in this work. The experiments show that Synchformer achieves high accuracy in synchronization tasks, with results reported on multiple datasets, including LRS3, VGGSound, and AudioSet. The model's performance is evaluated using top-1 accuracy across 21 offset classes, with and without a ±1 class tolerance. The results demonstrate that the model is effective in both dense and sparse settings, and that the proposed approach significantly improves synchronization performance. The paper also includes ablation studies that highlight the importance of various components of the model, including initialization, training, segment overlap, and feature extractors. The results show that the model's performance is significantly improved by using segment-level contrastive pre-training and by freezing the feature extractors during synchronization training. The model's ability to handle sparse data and predict synchronizability makes it a valuable tool for audio-visual synchronization tasks.
Reach us at info@study.space