SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

23 May 2024 | Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, Xiaopeng Fan
**SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding** **Authors:** Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, Xiaopeng Fan **Affiliations:** Harbin Institute of Technology, Peking University **Contact Information:** - liwr618@163.com - hongxiaopeng@ieee.org - rxiong@pku.edu.cn - fxp@hit.edu.cn **Abstract:** Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Existing methods often face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, the authors introduce SpikeMba, a multi-modal spiking saliency mamba for temporal video grounding. SpikeMba integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, SNNs are used to develop a spiking saliency detector that generates a dynamic and binary saliency proposal set. Relevant slots, learnable tensors that encode prior knowledge, are introduced to enhance the model's capability to retain and infer contextual information. SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. Experiments demonstrate that SpikeMba consistently outperforms state-of-the-art methods across mainstream benchmarks, significantly improving fine-grained multimodal relationship capture. **Contributions:** 1. A novel spiking saliency detector that uses the threshold mechanism of SNNs to generate a binary sequence of potential saliency proposals. 2. Relevant slots to selectively encode prior knowledge, enhancing the model's deep understanding of video content. 3. SSMs to selectively propagate or forget information, effectively addressing long-term dependency in video content. **Methods:** - **State Space Model (SSM):** Describes the evolution of a linear system’s state over time, enabling efficient sequence processing. - **Contextual Moment Reasoner (CMR):** Utilizes relevant slots to balance the context of current moments with their semantic relevance. - **Spiking Saliency Detector (SSD):** Converts continuous feature sequences into discrete spiking sequences, effectively identifying salient moments in videos. - **Multi-modal Relevant Mamba (MRM):** Integrates processed video and text features using linear transformations and convolutional layers. **Training Strategy:** - Optimized using the Adam optimizer with a weight decay of 1e-4. - Loss function includes a contractive loss, a saliency proposal loss, and an entropy loss. **Experiments:** - Compared with state-of-the-art methods on datasets like QVHighlights, Charades-STA, TACoS, TVSum, and Youtube**SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding** **Authors:** Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, Xiaopeng Fan **Affiliations:** Harbin Institute of Technology, Peking University **Contact Information:** - liwr618@163.com - hongxiaopeng@ieee.org - rxiong@pku.edu.cn - fxp@hit.edu.cn **Abstract:** Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Existing methods often face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, the authors introduce SpikeMba, a multi-modal spiking saliency mamba for temporal video grounding. SpikeMba integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, SNNs are used to develop a spiking saliency detector that generates a dynamic and binary saliency proposal set. Relevant slots, learnable tensors that encode prior knowledge, are introduced to enhance the model's capability to retain and infer contextual information. SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. Experiments demonstrate that SpikeMba consistently outperforms state-of-the-art methods across mainstream benchmarks, significantly improving fine-grained multimodal relationship capture. **Contributions:** 1. A novel spiking saliency detector that uses the threshold mechanism of SNNs to generate a binary sequence of potential saliency proposals. 2. Relevant slots to selectively encode prior knowledge, enhancing the model's deep understanding of video content. 3. SSMs to selectively propagate or forget information, effectively addressing long-term dependency in video content. **Methods:** - **State Space Model (SSM):** Describes the evolution of a linear system’s state over time, enabling efficient sequence processing. - **Contextual Moment Reasoner (CMR):** Utilizes relevant slots to balance the context of current moments with their semantic relevance. - **Spiking Saliency Detector (SSD):** Converts continuous feature sequences into discrete spiking sequences, effectively identifying salient moments in videos. - **Multi-modal Relevant Mamba (MRM):** Integrates processed video and text features using linear transformations and convolutional layers. **Training Strategy:** - Optimized using the Adam optimizer with a weight decay of 1e-4. - Loss function includes a contractive loss, a saliency proposal loss, and an entropy loss. **Experiments:** - Compared with state-of-the-art methods on datasets like QVHighlights, Charades-STA, TACoS, TVSum, and Youtube
Reach us at info@study.space
Understanding SpikeMba%3A Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding