This paper presents a novel approach called VAD-LLaMA, which integrates video-based large language models (VLLMs) into the framework of Video Anomaly Detection (VAD). The goal is to make VAD models free from manual threshold selection and capable of providing textual explanations for detected anomalies. The key contributions include:
1. **VAD-LLaMA**: A new approach that leverages VLLMs to enhance VAD performance.
2. **Long-Term Context (LTC) Module**: Introduces a module to improve the long-range context modeling ability of VLLMs.
3. **Three-Phase Training Method**: Aims to improve the efficiency of fine-tuning VLLMs by minimizing the requirements for VAD data and reducing the costs of annotating instruction-tuning data.
The LTC module addresses the challenge of long-range context modeling by collecting and integrating normal and abnormal feature lists, enhancing the video representation. The three-phase training method involves:
- **Phase 1**: Training a baseline VADor using clip-level features.
- **Phase 2**: Co-training VADor and LTC to incorporate long-term contextual information.
- **Phase 3**: Fine-tuning the projection layer of VLLM using instruction-tuning data generated from anomaly scores.
The proposed method achieves top performance on the UCF-Crime and TAD benchmarks, with significant improvements in AUC. Additionally, it can provide textual explanations for detected anomalies, making it a comprehensive solution for VAD tasks.This paper presents a novel approach called VAD-LLaMA, which integrates video-based large language models (VLLMs) into the framework of Video Anomaly Detection (VAD). The goal is to make VAD models free from manual threshold selection and capable of providing textual explanations for detected anomalies. The key contributions include:
1. **VAD-LLaMA**: A new approach that leverages VLLMs to enhance VAD performance.
2. **Long-Term Context (LTC) Module**: Introduces a module to improve the long-range context modeling ability of VLLMs.
3. **Three-Phase Training Method**: Aims to improve the efficiency of fine-tuning VLLMs by minimizing the requirements for VAD data and reducing the costs of annotating instruction-tuning data.
The LTC module addresses the challenge of long-range context modeling by collecting and integrating normal and abnormal feature lists, enhancing the video representation. The three-phase training method involves:
- **Phase 1**: Training a baseline VADor using clip-level features.
- **Phase 2**: Co-training VADor and LTC to incorporate long-term contextual information.
- **Phase 3**: Fine-tuning the projection layer of VLLM using instruction-tuning data generated from anomaly scores.
The proposed method achieves top performance on the UCF-Crime and TAD benchmarks, with significant improvements in AUC. Additionally, it can provide textual explanations for detected anomalies, making it a comprehensive solution for VAD tasks.