8 Mar 2024 | Pengwei Yin*, Guanzhong Zeng*, Jingjing Wang, Di Xie
The paper "CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model" addresses the challenge of domain generalization in gaze estimation, which is often hindered by the domain gap between training and testing data. The authors propose a novel framework called CLIP-Gaze that leverages a pre-trained vision-language model (VLM) to enhance the generalization capability of gaze estimation models. Specifically, CLIP-Gaze extracts gaze-relevant features by separating them from gaze-irrelevant features, which are constructed using language descriptions. The framework introduces a personalized context optimization method for text prompt tuning and a feature rank loss to refine the distribution of gaze-relevant features. Extensive experiments demonstrate that CLIP-Gaze outperforms existing methods on four cross-domain evaluations, showcasing its effectiveness in handling diverse data types and improving the robustness of gaze estimation models.The paper "CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model" addresses the challenge of domain generalization in gaze estimation, which is often hindered by the domain gap between training and testing data. The authors propose a novel framework called CLIP-Gaze that leverages a pre-trained vision-language model (VLM) to enhance the generalization capability of gaze estimation models. Specifically, CLIP-Gaze extracts gaze-relevant features by separating them from gaze-irrelevant features, which are constructed using language descriptions. The framework introduces a personalized context optimization method for text prompt tuning and a feature rank loss to refine the distribution of gaze-relevant features. Extensive experiments demonstrate that CLIP-Gaze outperforms existing methods on four cross-domain evaluations, showcasing its effectiveness in handling diverse data types and improving the robustness of gaze estimation models.