Understanding Multi-granularity Correspondence Learning from Long-term Noisy Videos

The paper "Multi-Granularity Correspondence Learning from Long-Term Noisy Videos" addresses the challenge of learning temporal dependencies in long videos, which is often overlooked due to the high computational cost. The authors propose a method called NOise Robust Temporal Optimal traNsport (Norton) that uses optimal transport (OT) to address multi-granularity noisy correspondence (MNC) in video-paragraph and clip-caption alignment. MNC includes coarse-grained misalignment (clip-caption) and fine-grained misalignment (frame-word). Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies, filters out irrelevant clips and captions, and identifies crucial words and key frames. It also leverages OT to handle faulty negative samples in clip-caption contrastive learning. Extensive experiments on video retrieval, videoQA, and action segmentation tasks demonstrate the effectiveness of Norton. The method is efficient and robust, making it suitable for real-world applications.The paper "Multi-Granularity Correspondence Learning from Long-Term Noisy Videos" addresses the challenge of learning temporal dependencies in long videos, which is often overlooked due to the high computational cost. The authors propose a method called NOise Robust Temporal Optimal traNsport (Norton) that uses optimal transport (OT) to address multi-granularity noisy correspondence (MNC) in video-paragraph and clip-caption alignment. MNC includes coarse-grained misalignment (clip-caption) and fine-grained misalignment (frame-word). Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies, filters out irrelevant clips and captions, and identifies crucial words and key frames. It also leverages OT to handle faulty negative samples in clip-caption contrastive learning. Extensive experiments on video retrieval, videoQA, and action segmentation tasks demonstrate the effectiveness of Norton. The method is efficient and robust, making it suitable for real-world applications.

MULTI-GRANULARITY CORRESPONDENCE LEARNING FROM LONG-TERM NOISY VIDEOS

30 Jan 2024 | Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng