The paper presents a novel framework called Text Prompt with Normality Guidance (TPWNG) for weakly supervised video anomaly detection (WSVAD). The main challenge in WSVAD is the generation of fine-grained pseudo-labels based on weak labels, which is currently addressed by self-training classifiers. However, existing methods primarily use only RGB visual modality, neglecting the utilization of category text information, leading to less accurate pseudo-labels and suboptimal performance.
To address this, TPWNG integrates the rich language-visual knowledge of the CLIP model to align video event descriptions with corresponding video frames, generating more accurate pseudo-labels. The key contributions of TPWNG include:
1. **Text Prompt Mechanism**: A learnable text prompt mechanism is introduced to improve the alignment accuracy between video event descriptions and video frames.
2. **Normality Visual Prompt (NVP)**: A NVP mechanism is proposed to reduce the interference of normal frames in anomalous videos, enhancing the accuracy of pseudo-labels.
3. **Pseudo-Label Generation Module (PLG)**: The PLG module infers frame-level pseudo-labels by incorporating the match similarities between video event descriptions and video frames, guided by normality guidance.
4. **Temporal Context Self-Adaptive Learning (TCSAL)**: A TCSAL module is introduced to learn temporal dependencies of different video events more flexibly and accurately.
The method is evaluated on two benchmark datasets, UCF-Crime and XD-Violence, demonstrating superior performance compared to state-of-the-art methods. Extensive experiments show that TPWNG achieves state-of-the-art performance, validating the effectiveness of the proposed framework.The paper presents a novel framework called Text Prompt with Normality Guidance (TPWNG) for weakly supervised video anomaly detection (WSVAD). The main challenge in WSVAD is the generation of fine-grained pseudo-labels based on weak labels, which is currently addressed by self-training classifiers. However, existing methods primarily use only RGB visual modality, neglecting the utilization of category text information, leading to less accurate pseudo-labels and suboptimal performance.
To address this, TPWNG integrates the rich language-visual knowledge of the CLIP model to align video event descriptions with corresponding video frames, generating more accurate pseudo-labels. The key contributions of TPWNG include:
1. **Text Prompt Mechanism**: A learnable text prompt mechanism is introduced to improve the alignment accuracy between video event descriptions and video frames.
2. **Normality Visual Prompt (NVP)**: A NVP mechanism is proposed to reduce the interference of normal frames in anomalous videos, enhancing the accuracy of pseudo-labels.
3. **Pseudo-Label Generation Module (PLG)**: The PLG module infers frame-level pseudo-labels by incorporating the match similarities between video event descriptions and video frames, guided by normality guidance.
4. **Temporal Context Self-Adaptive Learning (TCSAL)**: A TCSAL module is introduced to learn temporal dependencies of different video events more flexibly and accurately.
The method is evaluated on two benchmark datasets, UCF-Crime and XD-Violence, demonstrating superior performance compared to state-of-the-art methods. Extensive experiments show that TPWNG achieves state-of-the-art performance, validating the effectiveness of the proposed framework.