CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

8 Mar 2024 | Pengwei Yin*, Guanzhong Zeng*, Jingjing Wang, Di Xie
CLIP-Gaze is a novel framework for general gaze estimation that leverages a pre-trained vision-language model (CLIP) to enhance the generalization ability of gaze estimation models. The framework introduces a visual-linguistic cross-modality approach to address the domain gap between training and testing data, which is a major challenge in gaze estimation. The method constructs gaze-irrelevant features using diverse language descriptions and pushes gaze-relevant features away from these irrelevant features in the feature space. It also proposes a personalized context optimization method for text prompt tuning and a feature rank loss to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the model. Extensive experiments show that CLIP-Gaze outperforms existing methods on four cross-domain evaluations. The framework is the first to use a vision-language model for gaze estimation and effectively handles various gaze-irrelevant factors. The method also introduces a personalized text prompt tuning approach to avoid prompt engineering issues and improve adaptation to the gaze estimation task. The results demonstrate that CLIP-Gaze achieves remarkable performance improvement compared to the baseline model and outperforms state-of-the-art domain generalization approaches. The framework is evaluated on four cross-domain tasks, including ETH-XGaze to MPI-IFaceGaze, ETH-XGaze to EyeDiap, Gaze360 to MPI-IFaceGaze, and Gaze360 to EyeDiap. The results show that CLIP-Gaze achieves state-of-the-art performance on three of these tasks and performs similarly to the best method on the fourth. The method also demonstrates strong performance in unsupervised domain adaptation tasks, surpassing other methods in some cases. The framework is evaluated using angular error as the metric, and the results show that CLIP-Gaze achieves the best performance in most cases. The method also includes an ablation study that shows the effectiveness of the proposed features and loss functions. The visualization of extracted features further confirms the effectiveness of the framework. Overall, CLIP-Gaze provides a novel and effective approach to gaze estimation that leverages the power of vision-language models to handle diverse target domains.CLIP-Gaze is a novel framework for general gaze estimation that leverages a pre-trained vision-language model (CLIP) to enhance the generalization ability of gaze estimation models. The framework introduces a visual-linguistic cross-modality approach to address the domain gap between training and testing data, which is a major challenge in gaze estimation. The method constructs gaze-irrelevant features using diverse language descriptions and pushes gaze-relevant features away from these irrelevant features in the feature space. It also proposes a personalized context optimization method for text prompt tuning and a feature rank loss to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the model. Extensive experiments show that CLIP-Gaze outperforms existing methods on four cross-domain evaluations. The framework is the first to use a vision-language model for gaze estimation and effectively handles various gaze-irrelevant factors. The method also introduces a personalized text prompt tuning approach to avoid prompt engineering issues and improve adaptation to the gaze estimation task. The results demonstrate that CLIP-Gaze achieves remarkable performance improvement compared to the baseline model and outperforms state-of-the-art domain generalization approaches. The framework is evaluated on four cross-domain tasks, including ETH-XGaze to MPI-IFaceGaze, ETH-XGaze to EyeDiap, Gaze360 to MPI-IFaceGaze, and Gaze360 to EyeDiap. The results show that CLIP-Gaze achieves state-of-the-art performance on three of these tasks and performs similarly to the best method on the fourth. The method also demonstrates strong performance in unsupervised domain adaptation tasks, surpassing other methods in some cases. The framework is evaluated using angular error as the metric, and the results show that CLIP-Gaze achieves the best performance in most cases. The method also includes an ablation study that shows the effectiveness of the proposed features and loss functions. The visualization of extracted features further confirms the effectiveness of the framework. Overall, CLIP-Gaze provides a novel and effective approach to gaze estimation that leverages the power of vision-language models to handle diverse target domains.
Reach us at info@study.space