Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer

Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer

February 12, 2024 | Mingxuan Liu, Jiankai Tang, Haoxiang Li, Jiahao Qi, Siwei Li, Kegang Wang, Yuntao Wang, Hong Chen
Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer This paper introduces Spiking-PhysFormer, a hybrid neural network (HNN) that integrates spiking neural networks (SNNs) with transformer architecture for efficient global spatio-temporal attention in camera-based remote photoplethysmography (rPPG). The model consists of an ANN-based patch embedding (PE) block, SNN-based transformer blocks, and an ANN-based predictor head. The PE block and predictor head are designed using artificial neural networks (ANNs), while the transformer blocks are implemented with SNNs. To enhance the transformer blocks, a parallel spike-driven transformer is proposed, combining temporal difference convolution (TDC) with spike-driven self-attention (SDSA) mechanisms. Additionally, a simplified spiking self-attention (S3A) mechanism is introduced, omitting the value parameter to reduce computational complexity. Experiments on four datasets—PURE, UBFC-rPPG, UBFC-Phys, and MMPD—show that Spiking-PhysFormer achieves a 12.4% reduction in power consumption compared to PhysFormer, with the transformer block requiring 12.2 times less computational energy. The model maintains performance equivalent to PhysFormer and other ANN-based models. The Spiking-PhysFormer is the first HNN-based model in camera-based rPPG that includes a comprehensive evaluation of multiple public datasets. The model's spatio-temporal attention map highlights its capability to effectively capture facial regions in the spatial dimension and identify pulse wave peaks in the temporal dimension. The model's performance is validated through experiments on four datasets, demonstrating its effectiveness in capturing long-range spatio-temporal attentional rPPG features from facial videos. The model's energy efficiency is further validated through analysis of the spatio-temporal attention map based on spike firing rate (SFR). The model's performance is also validated through cross-dataset testing, showing its adaptability to videos with diverse facial features, backgrounds, and illumination. The model's results demonstrate that it can balance efficiency and accuracy, outperforming existing models on the PURE and MMPD datasets.Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer This paper introduces Spiking-PhysFormer, a hybrid neural network (HNN) that integrates spiking neural networks (SNNs) with transformer architecture for efficient global spatio-temporal attention in camera-based remote photoplethysmography (rPPG). The model consists of an ANN-based patch embedding (PE) block, SNN-based transformer blocks, and an ANN-based predictor head. The PE block and predictor head are designed using artificial neural networks (ANNs), while the transformer blocks are implemented with SNNs. To enhance the transformer blocks, a parallel spike-driven transformer is proposed, combining temporal difference convolution (TDC) with spike-driven self-attention (SDSA) mechanisms. Additionally, a simplified spiking self-attention (S3A) mechanism is introduced, omitting the value parameter to reduce computational complexity. Experiments on four datasets—PURE, UBFC-rPPG, UBFC-Phys, and MMPD—show that Spiking-PhysFormer achieves a 12.4% reduction in power consumption compared to PhysFormer, with the transformer block requiring 12.2 times less computational energy. The model maintains performance equivalent to PhysFormer and other ANN-based models. The Spiking-PhysFormer is the first HNN-based model in camera-based rPPG that includes a comprehensive evaluation of multiple public datasets. The model's spatio-temporal attention map highlights its capability to effectively capture facial regions in the spatial dimension and identify pulse wave peaks in the temporal dimension. The model's performance is validated through experiments on four datasets, demonstrating its effectiveness in capturing long-range spatio-temporal attentional rPPG features from facial videos. The model's energy efficiency is further validated through analysis of the spatio-temporal attention map based on spike firing rate (SFR). The model's performance is also validated through cross-dataset testing, showing its adaptability to videos with diverse facial features, backgrounds, and illumination. The model's results demonstrate that it can balance efficiency and accuracy, outperforming existing models on the PURE and MMPD datasets.
Reach us at info@study.space