State Space Models for Event Cameras

State Space Models for Event Cameras

Seattle, 2024 | Nikola Zubić, Mathias Gehrig, Davide Scaramuzza
The paper introduces State-Space Models (SSMs) for event cameras to address the challenges of training efficiency and performance degradation at different inference frequencies. Traditional recurrent neural networks (RNNs) used in event-based vision suffer from slow training and poor generalization when deployed at frequencies different from those used during training. SSMs, with learnable timescale parameters, enable faster training and better adaptability to varying frequencies without retraining. The authors evaluate their approach on the Gen1 and 1 Mpx event camera datasets, demonstrating a 33% increase in training speed and an average performance drop of only 3.76 mAP between training and testing frequencies, compared to 21.25 mAP for traditional RNNs and 24.53 mAP for Transformer-based models. They also introduce two strategies—frequency-selective masking and $H_2$ norm—to mitigate aliasing effects at higher frequencies. The proposed SSM-ViT model outperforms existing methods in terms of both performance and efficiency, making it a valuable contribution to the field of event-based vision.The paper introduces State-Space Models (SSMs) for event cameras to address the challenges of training efficiency and performance degradation at different inference frequencies. Traditional recurrent neural networks (RNNs) used in event-based vision suffer from slow training and poor generalization when deployed at frequencies different from those used during training. SSMs, with learnable timescale parameters, enable faster training and better adaptability to varying frequencies without retraining. The authors evaluate their approach on the Gen1 and 1 Mpx event camera datasets, demonstrating a 33% increase in training speed and an average performance drop of only 3.76 mAP between training and testing frequencies, compared to 21.25 mAP for traditional RNNs and 24.53 mAP for Transformer-based models. They also introduce two strategies—frequency-selective masking and $H_2$ norm—to mitigate aliasing effects at higher frequencies. The proposed SSM-ViT model outperforms existing methods in terms of both performance and efficiency, making it a valuable contribution to the field of event-based vision.
Reach us at info@study.space
Understanding State Space Models for Event Cameras