[slides] Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

The paper presents a novel framework called Text Prompt with Normality Guidance (TPWNG) for weakly supervised video anomaly detection (WSVAD). The main challenge in WSVAD is the generation of fine-grained pseudo-labels based on weak labels, which is currently addressed by self-training classifiers. However, existing methods primarily use only RGB visual modality, neglecting the utilization of category text information, leading to less accurate pseudo-labels and suboptimal performance. To address this, TPWNG integrates the rich language-visual knowledge of the CLIP model to align video event descriptions with corresponding video frames, generating more accurate pseudo-labels. The key contributions of TPWNG include: 1. **Text Prompt Mechanism**: A learnable text prompt mechanism is introduced to improve the alignment accuracy between video event descriptions and video frames. 2. **Normality Visual Prompt (NVP)**: A NVP mechanism is proposed to reduce the interference of normal frames in anomalous videos, enhancing the accuracy of pseudo-labels. 3. **Pseudo-Label Generation Module (PLG)**: The PLG module infers frame-level pseudo-labels by incorporating the match similarities between video event descriptions and video frames, guided by normality guidance. 4. **Temporal Context Self-Adaptive Learning (TCSAL)**: A TCSAL module is introduced to learn temporal dependencies of different video events more flexibly and accurately. The method is evaluated on two benchmark datasets, UCF-Crime and XD-Violence, demonstrating superior performance compared to state-of-the-art methods. Extensive experiments show that TPWNG achieves state-of-the-art performance, validating the effectiveness of the proposed framework.The paper presents a novel framework called Text Prompt with Normality Guidance (TPWNG) for weakly supervised video anomaly detection (WSVAD). The main challenge in WSVAD is the generation of fine-grained pseudo-labels based on weak labels, which is currently addressed by self-training classifiers. However, existing methods primarily use only RGB visual modality, neglecting the utilization of category text information, leading to less accurate pseudo-labels and suboptimal performance. To address this, TPWNG integrates the rich language-visual knowledge of the CLIP model to align video event descriptions with corresponding video frames, generating more accurate pseudo-labels. The key contributions of TPWNG include: 1. **Text Prompt Mechanism**: A learnable text prompt mechanism is introduced to improve the alignment accuracy between video event descriptions and video frames. 2. **Normality Visual Prompt (NVP)**: A NVP mechanism is proposed to reduce the interference of normal frames in anomalous videos, enhancing the accuracy of pseudo-labels. 3. **Pseudo-Label Generation Module (PLG)**: The PLG module infers frame-level pseudo-labels by incorporating the match similarities between video event descriptions and video frames, guided by normality guidance. 4. **Temporal Context Self-Adaptive Learning (TCSAL)**: A TCSAL module is introduced to learn temporal dependencies of different video events more flexibly and accurately. The method is evaluated on two benchmark datasets, UCF-Crime and XD-Violence, demonstrating superior performance compared to state-of-the-art methods. Extensive experiments show that TPWNG achieves state-of-the-art performance, validating the effectiveness of the proposed framework.

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

12 Apr 2024 | Zhiwei Yang1, Jing Liu1*, Peng Wu2