[slides] TR-DETR%3A Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

The paper "TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection" addresses the challenges of video moment retrieval (MR) and highlight detection (HD) based on natural language queries. It proposes a task-reciprocal transformer (TR-DETR) that leverages the inherent reciprocity between MR and HD to improve performance. The key contributions of the paper include: 1. **Task Reciprocity**: The paper highlights the reciprocal relationship between MR and HD, where highlight scores from HD can assist in MR, and MR results can enhance HD. 2. **Local-Global Multi-Modal Alignment**: A module is introduced to align visual and textual features, ensuring semantic alignment before modal interaction. This module includes local and global regularization components to distinguish irrelevant clips and ensure unified semantic spaces. 3. **Visual Feature Refinement**: A module is designed to refine visual features using textual features, eliminating query-irrelevant information and improving the discrimination of joint features. 4. **Task Cooperation**: A task cooperation module is proposed to enhance prediction outcomes by exploiting the complementarity between MR and HD. This module includes HD2MR and MR2HD components, where HD2MR injects highlight scores into the MR pipeline, and MR2HD uses retrieved moments to refine highlight scores. 5. **Experiments**: Extensive experiments on datasets such as QVHighlights, Charades-STA, and TV-Sum demonstrate that TR-DETR outperforms existing state-of-the-art methods. The paper also discusses related works, method details, and ablation studies to validate the effectiveness of each component. The results show that TR-DETR significantly improves performance in both MR and HD tasks, particularly in handling stringent metrics and high IOU thresholds.The paper "TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection" addresses the challenges of video moment retrieval (MR) and highlight detection (HD) based on natural language queries. It proposes a task-reciprocal transformer (TR-DETR) that leverages the inherent reciprocity between MR and HD to improve performance. The key contributions of the paper include: 1. **Task Reciprocity**: The paper highlights the reciprocal relationship between MR and HD, where highlight scores from HD can assist in MR, and MR results can enhance HD. 2. **Local-Global Multi-Modal Alignment**: A module is introduced to align visual and textual features, ensuring semantic alignment before modal interaction. This module includes local and global regularization components to distinguish irrelevant clips and ensure unified semantic spaces. 3. **Visual Feature Refinement**: A module is designed to refine visual features using textual features, eliminating query-irrelevant information and improving the discrimination of joint features. 4. **Task Cooperation**: A task cooperation module is proposed to enhance prediction outcomes by exploiting the complementarity between MR and HD. This module includes HD2MR and MR2HD components, where HD2MR injects highlight scores into the MR pipeline, and MR2HD uses retrieved moments to refine highlight scores. 5. **Experiments**: Extensive experiments on datasets such as QVHighlights, Charades-STA, and TV-Sum demonstrate that TR-DETR outperforms existing state-of-the-art methods. The paper also discusses related works, method details, and ablation studies to validate the effectiveness of each component. The results show that TR-DETR significantly improves performance in both MR and HD tasks, particularly in handling stringent metrics and high IOU thresholds.

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

5 Jan 2024 | Hao Sun, Mingyao Zhou, Wenjing Chen, Wei Xie