2024 | Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng
This paper introduces Norton, a noise-robust temporal optimal transport (OT) framework for multi-granularity correspondence learning in long-term video understanding. The main challenge addressed is the multi-granularity noisy correspondence (MNC) problem, which includes coarse-grained misalignment (clip-caption) and fine-granular misalignment (frame-word). Norton employs OT to capture long-term temporal dependencies by learning the correspondence between video clips and captions, and between video clips and paragraphs. It addresses coarse-grained misalignment by filtering out irrelevant clips and captions using an alignable prompt bucket and realigning asynchronous clip-caption pairs based on transport distance. For fine-grained misalignment, it incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation demonstrate the effectiveness of Norton. The method is computationally efficient and robust to noisy correspondence, making it suitable for real-world applications. The framework is based on optimal transport, which has been widely used in various fields such as domain adaptation, clustering, and sequence alignment. Norton's approach effectively addresses both coarse and fine-grained misalignments, improving temporal learning and video understanding. The method is evaluated on multiple tasks, including video-paragraph retrieval, text-to-video retrieval, action segmentation, and videoQA, showing significant improvements over existing methods. The results indicate that Norton not only captures long-term temporal dependencies but also facilitates clip-level representation learning. The framework is designed to be scalable and efficient, with minimal computational cost. The paper also discusses the limitations of existing methods and proposes solutions to address them, such as the use of an alignable prompt bucket and the exploitation of faulty negative samples. Overall, Norton provides a robust and efficient solution for multi-granularity correspondence learning in long-term video understanding.This paper introduces Norton, a noise-robust temporal optimal transport (OT) framework for multi-granularity correspondence learning in long-term video understanding. The main challenge addressed is the multi-granularity noisy correspondence (MNC) problem, which includes coarse-grained misalignment (clip-caption) and fine-granular misalignment (frame-word). Norton employs OT to capture long-term temporal dependencies by learning the correspondence between video clips and captions, and between video clips and paragraphs. It addresses coarse-grained misalignment by filtering out irrelevant clips and captions using an alignable prompt bucket and realigning asynchronous clip-caption pairs based on transport distance. For fine-grained misalignment, it incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation demonstrate the effectiveness of Norton. The method is computationally efficient and robust to noisy correspondence, making it suitable for real-world applications. The framework is based on optimal transport, which has been widely used in various fields such as domain adaptation, clustering, and sequence alignment. Norton's approach effectively addresses both coarse and fine-grained misalignments, improving temporal learning and video understanding. The method is evaluated on multiple tasks, including video-paragraph retrieval, text-to-video retrieval, action segmentation, and videoQA, showing significant improvements over existing methods. The results indicate that Norton not only captures long-term temporal dependencies but also facilitates clip-level representation learning. The framework is designed to be scalable and efficient, with minimal computational cost. The paper also discusses the limitations of existing methods and proposes solutions to address them, such as the use of an alignable prompt bucket and the exploitation of faulty negative samples. Overall, Norton provides a robust and efficient solution for multi-granularity correspondence learning in long-term video understanding.