Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

3 Jun 2024 | Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
The paper addresses the challenge of Audio-Visual Video Parsing (AVVP), which involves identifying and temporally localizing events in audible videos. The task is typically performed in a weakly-supervised manner, where only video event labels are provided, lacking detailed modality and timestamp information. To enhance supervision, the paper proposes a new pseudo label generation strategy that explicitly assigns labels to each video segment using pre-trained models like CLIP and CLAP. These models estimate events in each segment and generate segment-level visual and audio pseudo labels. A novel loss function is introduced to exploit these pseudo labels, considering their category-richness and segment-richness. Additionally, a label denoising strategy is proposed to refine visual pseudo labels by flipping segments with abnormally large forward losses. Extensive experiments on the LLP dataset demonstrate the effectiveness of the proposed methods, achieving state-of-the-art performance in all types of event parsing. The method can also be integrated into existing AVVP frameworks and extended to related tasks like audio-visual event localization.The paper addresses the challenge of Audio-Visual Video Parsing (AVVP), which involves identifying and temporally localizing events in audible videos. The task is typically performed in a weakly-supervised manner, where only video event labels are provided, lacking detailed modality and timestamp information. To enhance supervision, the paper proposes a new pseudo label generation strategy that explicitly assigns labels to each video segment using pre-trained models like CLIP and CLAP. These models estimate events in each segment and generate segment-level visual and audio pseudo labels. A novel loss function is introduced to exploit these pseudo labels, considering their category-richness and segment-richness. Additionally, a label denoising strategy is proposed to refine visual pseudo labels by flipping segments with abnormally large forward losses. Extensive experiments on the LLP dataset demonstrate the effectiveness of the proposed methods, achieving state-of-the-art performance in all types of event parsing. The method can also be integrated into existing AVVP frameworks and extended to related tasks like audio-visual event localization.
Reach us at info@study.space
[slides and audio] Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling