This paper proposes VAD-LLaMA, a novel video anomaly detection (VAD) approach that integrates video-based large language models (VLLMs) into the VAD framework, enabling the VAD model to be threshold-free and capable of explaining detected anomalies. The key contributions include: 1) A new approach called VAD-LLaMA that introduces VLLMs for tackling the task of VAD. 2) A novel Long-Term Context (LTC) module that enhances the long video representation ability of existing VLLMs. 3) A novel three-phase training method for the proposed VAD-LLaMA, by resolving the issues of lacking VAD data and instruction-tuning data.
The LTC module is designed to address the limitations of VLLMs in long-range context modeling. It collects and stacks clip-level features based on anomaly scores, and integrates these features into the video representation through cross-attention and weighted-sum operations. A three-phase training method is proposed to improve the efficiency of fine-tuning VLLMs by minimizing the requirements for VAD data and lowering the costs of annotating instruction-tuning data. The first phase trains a baseline VADor, the second phase co-trains VADor and LTC, and the third phase fine-tunes Video-LLaMA using instruction-tuning data.
The proposed method achieves top performance on the UCF-Crime and TAD benchmarks, with AUC improvements of +3.86% and +4.96%, respectively. Moreover, the approach can provide textual explanations for detected anomalies. The method is evaluated on two standard WS-VAD datasets, UCF-Crime and TAD, and shows significant improvements in AUC metrics. The results demonstrate the effectiveness of the LTC module in enhancing long-range video representation and the three-phase training method in improving the efficiency of training VLLMs for VAD. The method also shows strong performance in anomaly localization and description, and is capable of multi-turn dialogues based on video content.This paper proposes VAD-LLaMA, a novel video anomaly detection (VAD) approach that integrates video-based large language models (VLLMs) into the VAD framework, enabling the VAD model to be threshold-free and capable of explaining detected anomalies. The key contributions include: 1) A new approach called VAD-LLaMA that introduces VLLMs for tackling the task of VAD. 2) A novel Long-Term Context (LTC) module that enhances the long video representation ability of existing VLLMs. 3) A novel three-phase training method for the proposed VAD-LLaMA, by resolving the issues of lacking VAD data and instruction-tuning data.
The LTC module is designed to address the limitations of VLLMs in long-range context modeling. It collects and stacks clip-level features based on anomaly scores, and integrates these features into the video representation through cross-attention and weighted-sum operations. A three-phase training method is proposed to improve the efficiency of fine-tuning VLLMs by minimizing the requirements for VAD data and lowering the costs of annotating instruction-tuning data. The first phase trains a baseline VADor, the second phase co-trains VADor and LTC, and the third phase fine-tunes Video-LLaMA using instruction-tuning data.
The proposed method achieves top performance on the UCF-Crime and TAD benchmarks, with AUC improvements of +3.86% and +4.96%, respectively. Moreover, the approach can provide textual explanations for detected anomalies. The method is evaluated on two standard WS-VAD datasets, UCF-Crime and TAD, and shows significant improvements in AUC metrics. The results demonstrate the effectiveness of the LTC module in enhancing long-range video representation and the three-phase training method in improving the efficiency of training VLLMs for VAD. The method also shows strong performance in anomaly localization and description, and is capable of multi-turn dialogues based on video content.