State Space Models for Event Cameras

State Space Models for Event Cameras

2024 | Nikola Zubić, Mathias Gehrig, Davide Scaramuzza
This paper introduces state-space models (SSMs) for event cameras, addressing the challenges of training efficiency and generalization across varying temporal frequencies. The proposed SSM-ViT architecture integrates SSMs with Vision Transformers (ViTs), enabling efficient and accurate event-based vision tasks. SSMs, such as S4, S4D, and S5, are used to model temporal dynamics, allowing the model to adapt to different inference frequencies without retraining. The key innovation lies in the use of learnable timescale parameters, which enable the model to generalize across varying temporal resolutions. The SSM-ViT block processes event data through a hierarchical backbone, incorporating SSMs to capture temporal information while maintaining computational efficiency. Two strategies are introduced to mitigate aliasing effects: frequency-selective masking and H2 norm. These techniques ensure the model performs well even at higher inference frequencies. Experimental results show that SSM-based models achieve a 33% faster training speed and a significantly lower performance drop (3.76 mAP) compared to existing methods like RVT and GET. The paper also evaluates the models on the Gen1 and 1 Mpx event camera datasets, demonstrating superior performance and generalization. Ablation studies highlight the importance of SSMs in enhancing detection performance and the effectiveness of the proposed strategies in reducing aliasing. The SSM-ViT model outperforms existing methods in terms of both training efficiency and generalization across different inference frequencies. The results indicate that SSMs offer a promising solution for event-based vision tasks, providing a new direction for research and application in high-speed, dynamic visual environments.This paper introduces state-space models (SSMs) for event cameras, addressing the challenges of training efficiency and generalization across varying temporal frequencies. The proposed SSM-ViT architecture integrates SSMs with Vision Transformers (ViTs), enabling efficient and accurate event-based vision tasks. SSMs, such as S4, S4D, and S5, are used to model temporal dynamics, allowing the model to adapt to different inference frequencies without retraining. The key innovation lies in the use of learnable timescale parameters, which enable the model to generalize across varying temporal resolutions. The SSM-ViT block processes event data through a hierarchical backbone, incorporating SSMs to capture temporal information while maintaining computational efficiency. Two strategies are introduced to mitigate aliasing effects: frequency-selective masking and H2 norm. These techniques ensure the model performs well even at higher inference frequencies. Experimental results show that SSM-based models achieve a 33% faster training speed and a significantly lower performance drop (3.76 mAP) compared to existing methods like RVT and GET. The paper also evaluates the models on the Gen1 and 1 Mpx event camera datasets, demonstrating superior performance and generalization. Ablation studies highlight the importance of SSMs in enhancing detection performance and the effectiveness of the proposed strategies in reducing aliasing. The SSM-ViT model outperforms existing methods in terms of both training efficiency and generalization across different inference frequencies. The results indicate that SSMs offer a promising solution for event-based vision tasks, providing a new direction for research and application in high-speed, dynamic visual environments.
Reach us at info@study.space