Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

2024-04-12 | Zhiwei Yang, Jing Liu*, Peng Wu
This paper proposes a novel framework called Text Prompt with Normality Guidance (TPWNG) for weakly supervised video anomaly detection (WSVAD). The framework leverages the contrastive language-image pre-training (CLIP) model to generate pseudo-labels by aligning video event descriptions with corresponding video frames. The CLIP model is fine-tuned with ranking losses and distributional inconsistency losses to adapt to the WSVAD task. A learnable text prompt mechanism and normality visual prompt are introduced to improve the alignment accuracy between video event descriptions and video frames. A pseudo-label generation module based on normality guidance is designed to infer reliable frame-level pseudo-labels. A temporal context self-adaptive learning module is introduced to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments on two benchmark datasets, UCF-Crime and XD-Violence, show that the proposed method achieves state-of-the-art performance, demonstrating the effectiveness of the framework. The method outperforms existing approaches by utilizing both visual and textual modalities to generate more accurate pseudo-labels and by adaptively learning temporal dependencies. The framework also incorporates a self-adaptive learning mechanism for temporal context, enabling more accurate and flexible modeling of temporal dependencies in video events. The results show that the proposed method is effective in detecting anomalies in videos with weak supervision.This paper proposes a novel framework called Text Prompt with Normality Guidance (TPWNG) for weakly supervised video anomaly detection (WSVAD). The framework leverages the contrastive language-image pre-training (CLIP) model to generate pseudo-labels by aligning video event descriptions with corresponding video frames. The CLIP model is fine-tuned with ranking losses and distributional inconsistency losses to adapt to the WSVAD task. A learnable text prompt mechanism and normality visual prompt are introduced to improve the alignment accuracy between video event descriptions and video frames. A pseudo-label generation module based on normality guidance is designed to infer reliable frame-level pseudo-labels. A temporal context self-adaptive learning module is introduced to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments on two benchmark datasets, UCF-Crime and XD-Violence, show that the proposed method achieves state-of-the-art performance, demonstrating the effectiveness of the framework. The method outperforms existing approaches by utilizing both visual and textual modalities to generate more accurate pseudo-labels and by adaptively learning temporal dependencies. The framework also incorporates a self-adaptive learning mechanism for temporal context, enabling more accurate and flexible modeling of temporal dependencies in video events. The results show that the proposed method is effective in detecting anomalies in videos with weak supervision.
Reach us at info@study.space
[slides and audio] Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection