2024-03-05 | Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou, Siyang Sun, Yun Zheng
This paper proposes a novel semi-supervised learning framework for Audio-Visual Source Localization (AVSL), called Dual Mean-Teacher (DMT). The framework addresses the challenges of inaccurate localization, blurry boundaries, and false positives in AVSL by employing two teacher-student structures to mitigate confirmation bias. DMT pre-trains two teachers on limited labeled data to filter out noisy samples and generate high-quality pseudo-labels through consensus between their predictions. The framework then uses these pseudo-labels to train students, which are updated via exponential moving average (EMA) of the teachers. DMT achieves significant improvements in localization accuracy, with a CIoU of 90.4% on Flickr-SoundNet and 48.8% on VGG-Sound Source, outperforming existing methods by 8.9%, 9.6%, and 4.6%, 6.4% respectively. The framework is extended to existing AVSL methods, consistently boosting their performance. DMT's unbiased approach effectively reduces confirmation bias and improves the utilization of both labeled and unlabeled data, making it a robust solution for AVSL tasks.This paper proposes a novel semi-supervised learning framework for Audio-Visual Source Localization (AVSL), called Dual Mean-Teacher (DMT). The framework addresses the challenges of inaccurate localization, blurry boundaries, and false positives in AVSL by employing two teacher-student structures to mitigate confirmation bias. DMT pre-trains two teachers on limited labeled data to filter out noisy samples and generate high-quality pseudo-labels through consensus between their predictions. The framework then uses these pseudo-labels to train students, which are updated via exponential moving average (EMA) of the teachers. DMT achieves significant improvements in localization accuracy, with a CIoU of 90.4% on Flickr-SoundNet and 48.8% on VGG-Sound Source, outperforming existing methods by 8.9%, 9.6%, and 4.6%, 6.4% respectively. The framework is extended to existing AVSL methods, consistently boosting their performance. DMT's unbiased approach effectively reduces confirmation bias and improves the utilization of both labeled and unlabeled data, making it a robust solution for AVSL tasks.