[slides and audio] State Space Models for Event Cameras

The paper introduces State-Space Models (SSMs) for event cameras to address the challenges of training efficiency and performance degradation at different inference frequencies. Traditional recurrent neural networks (RNNs) used in event-based vision suffer from slow training and poor generalization when deployed at frequencies different from those used during training. SSMs, with learnable timescale parameters, enable faster training and better adaptability to varying frequencies without retraining. The authors evaluate their approach on the Gen1 and 1 Mpx event camera datasets, demonstrating a 33% increase in training speed and an average performance drop of only 3.76 mAP between training and testing frequencies, compared to 21.25 mAP for traditional RNNs and 24.53 mAP for Transformer-based models. They also introduce two strategies—frequency-selective masking and $H_2$ norm—to mitigate aliasing effects at higher frequencies. The proposed SSM-ViT model outperforms existing methods in terms of both performance and efficiency, making it a valuable contribution to the field of event-based vision.The paper introduces State-Space Models (SSMs) for event cameras to address the challenges of training efficiency and performance degradation at different inference frequencies. Traditional recurrent neural networks (RNNs) used in event-based vision suffer from slow training and poor generalization when deployed at frequencies different from those used during training. SSMs, with learnable timescale parameters, enable faster training and better adaptability to varying frequencies without retraining. The authors evaluate their approach on the Gen1 and 1 Mpx event camera datasets, demonstrating a 33% increase in training speed and an average performance drop of only 3.76 mAP between training and testing frequencies, compared to 21.25 mAP for traditional RNNs and 24.53 mAP for Transformer-based models. They also introduce two strategies—frequency-selective masking and $H_2$ norm—to mitigate aliasing effects at higher frequencies. The proposed SSM-ViT model outperforms existing methods in terms of both performance and efficiency, making it a valuable contribution to the field of event-based vision.

State Space Models for Event Cameras

Seattle, 2024 | Nikola Zubić, Mathias Gehrig, Davide Scaramuzza