Understanding Event-Based Eye Tracking. AIS 2024 Challenge Survey

This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge, which focuses on processing eye movements recorded with event cameras and predicting the pupil center. The challenge emphasizes efficient eye tracking to achieve a good balance between accuracy and efficiency. During the challenge, 38 participants registered for the Kaggle competition, and 8 teams submitted detailed fact sheets. The survey reviews and analyzes the novel and diverse methods submitted to advance future event-based eye tracking research. The development of augmented reality (AR) and virtual reality (VR) technologies has increased the demand for precise and efficient eye-tracking systems. Eye-tracking and related tasks have significant potential in wearable healthcare technology, offering new approaches for diagnosing and monitoring conditions like Parkinson’s and Alzheimer’s diseases through eye movement patterns. Event cameras, or Dynamic Vision Sensors (DVS), provide a unique sensory modality for eye-tracking applications on mobile devices. Unlike traditional cameras, event cameras asynchronously record intensity changes exceeding a threshold, resulting in sparse spatiotemporal event streams. This sparsity can significantly reduce computation and energy demands, making event cameras suitable for mobile platforms. The challenge aims to explore algorithms for event-based eye tracking using the 3ET+ dataset, which contains real events recorded with a DVXplorer Mini event camera. The dataset includes 13 subjects performing five classes of activities: random, saccades, read text, smooth pursuit, and blinks. The primary evaluation metric is p-accuracy, with a tolerance of 10 pixels. The challenge is divided into three phases: preparation, the Kaggle competition, and submission of factsheets. The survey reviews the methods proposed by the participating teams, highlighting stateful models, spatial-temporal processing, computation and parameter efficiency, and event representations. Teams used various architectures, including GRU, ConvLSTM, BiLSTM, and Mamba, to handle event data. Computation and parameter efficiency were emphasized, with some teams implementing sparse convolution and temporal causal layers for efficient inference. The survey includes detailed descriptions of the best challenge solutions, such as the MambaPupil method by USTCEventGroup, the CETM by FreeEvs, the lightweight spatio-temporal network by bigBrains, the FPGA-based system by Go Sparse, the memory channel-based approach by MeMo, the Efficient Recurrent Vision Transformer (ERVT) by ERVT, and the Efficient Point-based Eye Tracking Method by EFFICIENT. Each team's methodology, implementation details, and results are discussed in detail. The survey concludes with insights from the challenge, emphasizing the emerging nature of event-based visual processing, the importance of hardware consideration, and the feasibility of using event cameras for eye-tracking tasks. It also highlights the need for prototyping and more realistic settings to advance event-based eye tracking systems.This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge, which focuses on processing eye movements recorded with event cameras and predicting the pupil center. The challenge emphasizes efficient eye tracking to achieve a good balance between accuracy and efficiency. During the challenge, 38 participants registered for the Kaggle competition, and 8 teams submitted detailed fact sheets. The survey reviews and analyzes the novel and diverse methods submitted to advance future event-based eye tracking research. The development of augmented reality (AR) and virtual reality (VR) technologies has increased the demand for precise and efficient eye-tracking systems. Eye-tracking and related tasks have significant potential in wearable healthcare technology, offering new approaches for diagnosing and monitoring conditions like Parkinson’s and Alzheimer’s diseases through eye movement patterns. Event cameras, or Dynamic Vision Sensors (DVS), provide a unique sensory modality for eye-tracking applications on mobile devices. Unlike traditional cameras, event cameras asynchronously record intensity changes exceeding a threshold, resulting in sparse spatiotemporal event streams. This sparsity can significantly reduce computation and energy demands, making event cameras suitable for mobile platforms. The challenge aims to explore algorithms for event-based eye tracking using the 3ET+ dataset, which contains real events recorded with a DVXplorer Mini event camera. The dataset includes 13 subjects performing five classes of activities: random, saccades, read text, smooth pursuit, and blinks. The primary evaluation metric is p-accuracy, with a tolerance of 10 pixels. The challenge is divided into three phases: preparation, the Kaggle competition, and submission of factsheets. The survey reviews the methods proposed by the participating teams, highlighting stateful models, spatial-temporal processing, computation and parameter efficiency, and event representations. Teams used various architectures, including GRU, ConvLSTM, BiLSTM, and Mamba, to handle event data. Computation and parameter efficiency were emphasized, with some teams implementing sparse convolution and temporal causal layers for efficient inference. The survey includes detailed descriptions of the best challenge solutions, such as the MambaPupil method by USTCEventGroup, the CETM by FreeEvs, the lightweight spatio-temporal network by bigBrains, the FPGA-based system by Go Sparse, the memory channel-based approach by MeMo, the Efficient Recurrent Vision Transformer (ERVT) by ERVT, and the Efficient Point-based Eye Tracking Method by EFFICIENT. Each team's methodology, implementation details, and results are discussed in detail. The survey concludes with insights from the challenge, emphasizing the emerging nature of event-based visual processing, the importance of hardware consideration, and the feasibility of using event cameras for eye-tracking tasks. It also highlights the need for prototyping and more realistic settings to advance event-based eye tracking systems.