Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

2024 | Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
This paper proposes a new approach for weakly-supervised audio-visual video parsing (AVVP) by introducing a segment-wise pseudo labeling strategy. The goal is to identify and temporally localize events in audio and visual streams of audible videos. Traditional methods rely on video-level labels, but these are insufficient for precise temporal localization. The proposed method generates segment-level pseudo labels using pre-trained models CLIP and CLAP, which estimate events in each video segment and generate visual and audio pseudo labels. A new loss function is introduced to exploit these pseudo labels by considering their category-richness and segment-richness. A label denoising strategy is also adopted to improve the visual pseudo labels by flipping them when abnormally large forward losses occur. The method is evaluated on the LLP dataset, demonstrating state-of-the-art performance in all types of event parsing, including audio, visual, and audio-visual events. The pseudo labels are shown to be flexible and can be combined with other video parsing backbones to improve their performance. The method is also applied to a related weakly-supervised audio-visual event localization task, further verifying the benefits and generalization of the approach. The key contributions include a new pseudo label generation strategy, a pseudo label exploitation strategy with a richness-aware loss, and a pseudo label denoising strategy. The method achieves high-quality segment-level pseudo labels that can be used for effective video parsing.This paper proposes a new approach for weakly-supervised audio-visual video parsing (AVVP) by introducing a segment-wise pseudo labeling strategy. The goal is to identify and temporally localize events in audio and visual streams of audible videos. Traditional methods rely on video-level labels, but these are insufficient for precise temporal localization. The proposed method generates segment-level pseudo labels using pre-trained models CLIP and CLAP, which estimate events in each video segment and generate visual and audio pseudo labels. A new loss function is introduced to exploit these pseudo labels by considering their category-richness and segment-richness. A label denoising strategy is also adopted to improve the visual pseudo labels by flipping them when abnormally large forward losses occur. The method is evaluated on the LLP dataset, demonstrating state-of-the-art performance in all types of event parsing, including audio, visual, and audio-visual events. The pseudo labels are shown to be flexible and can be combined with other video parsing backbones to improve their performance. The method is also applied to a related weakly-supervised audio-visual event localization task, further verifying the benefits and generalization of the approach. The key contributions include a new pseudo label generation strategy, a pseudo label exploitation strategy with a richness-aware loss, and a pseudo label denoising strategy. The method achieves high-quality segment-level pseudo labels that can be used for effective video parsing.
Reach us at info@study.space
[slides] Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling | StudySpace