23 May 2024 | Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, Xiaopeng Fan
SpikeMba is a novel multi-modal spiking saliency mamba designed for temporal video grounding. It integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to address challenges such as confidence bias towards salient objects and long-term dependency in video sequences. The SNN-based spiking saliency detector generates a dynamic and binary saliency proposal set by emitting spike signals when input signals exceed a predefined threshold. Relevant slots, which are learnable tensors, encode prior knowledge and help maintain contextual information. The SSMs facilitate selective information propagation, enabling the model to capture long-term dependencies. SpikeMba outperforms state-of-the-art methods on benchmark datasets, demonstrating its effectiveness in capturing fine-grained multimodal relationships. The model's architecture includes a contextual moment reasoner that dynamically leverages relevant slots for semantic association and inference, and a multi-modal relevant mamba block that enhances long-range dependency modeling. SpikeMba's training strategy involves a combination of contractive loss, saliency proposal loss, and entropy loss to improve feature representation and saliency detection. Experiments show that SpikeMba achieves high accuracy and efficiency in video grounding tasks, with significant improvements over previous methods. The model's ability to handle complex video content and maintain contextual information makes it a promising approach for temporal video grounding. Limitations include the challenge of integrating heterogeneous outputs from SNNs and the Mamba framework, requiring more effective strategies for system integration.SpikeMba is a novel multi-modal spiking saliency mamba designed for temporal video grounding. It integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to address challenges such as confidence bias towards salient objects and long-term dependency in video sequences. The SNN-based spiking saliency detector generates a dynamic and binary saliency proposal set by emitting spike signals when input signals exceed a predefined threshold. Relevant slots, which are learnable tensors, encode prior knowledge and help maintain contextual information. The SSMs facilitate selective information propagation, enabling the model to capture long-term dependencies. SpikeMba outperforms state-of-the-art methods on benchmark datasets, demonstrating its effectiveness in capturing fine-grained multimodal relationships. The model's architecture includes a contextual moment reasoner that dynamically leverages relevant slots for semantic association and inference, and a multi-modal relevant mamba block that enhances long-range dependency modeling. SpikeMba's training strategy involves a combination of contractive loss, saliency proposal loss, and entropy loss to improve feature representation and saliency detection. Experiments show that SpikeMba achieves high accuracy and efficiency in video grounding tasks, with significant improvements over previous methods. The model's ability to handle complex video content and maintain contextual information makes it a promising approach for temporal video grounding. Limitations include the challenge of integrating heterogeneous outputs from SNNs and the Mamba framework, requiring more effective strategies for system integration.