Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

29 Jun 2024 | Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, Nong Sang
Holmes-VAD is a novel framework for video anomaly detection (VAD) that addresses the issues of bias and lack of interpretability in existing methods. The framework leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. A large-scale multimodal VAD instruction-tuning benchmark, VAD-Instruct50k, is constructed, which includes single-frame annotations for untrimmed videos and instruction data for trimmed videos. This dataset is created using a semi-automated labeling paradigm, combining efficient single-frame annotations with a robust video captioner and large language model (LLM) to generate high-quality analyses of both abnormal and normal video clips. Building upon VAD-Instruct50k, a customized solution for interpretable video anomaly detection is developed. This solution includes a lightweight temporal sampler to select frames with high anomaly response and a fine-tuned multimodal LLM to generate explanatory content. Extensive experiments validate the generality and interpretability of Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. The benchmark and model are publicly available at https://holmesvad.github.io/. The framework addresses two main challenges in VAD: biased anomaly space and lack of explainability. Biased anomaly space arises due to the lack of reliable frame-level abnormal supervision, leading to a prevalent bias toward unseen or easily-confused normality. Lack of explainability is addressed by constructing a large amount of anomaly-awarded instruction conversation data for the fine-tuning of multimodal LLMs. This data is generated using a semi-automated data engine that includes data collection, annotation enhancement, and instruction construction. Holmes-VAD uses a Video Encoder, Temporal Sampler, and Multi-modal LLM with tunable LoRA modules to achieve interpretable video anomaly detection. The Video Encoder and Multi-modal LLM are used to encode the input video and generate text responses to input text prompts. The Temporal Sampler is used to predict the abnormal scores of video frames and sample high-responsive parts as input for the Multi-modal LLM. The framework is trained on the VAD-Instruct50k dataset, which includes single-frame annotations and explanatory text descriptions. Extensive experiments demonstrate that Holmes-VAD achieves outstanding performance in video anomaly detection and can provide detailed explanations for detected abnormal events. The framework is capable of identifying anomalies and providing insightful explanations across even hour-long videos. The proposed method is a valuable tool for real-world applications, addressing the biases and lack of interpretability in existing anomaly detection methods.Holmes-VAD is a novel framework for video anomaly detection (VAD) that addresses the issues of bias and lack of interpretability in existing methods. The framework leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. A large-scale multimodal VAD instruction-tuning benchmark, VAD-Instruct50k, is constructed, which includes single-frame annotations for untrimmed videos and instruction data for trimmed videos. This dataset is created using a semi-automated labeling paradigm, combining efficient single-frame annotations with a robust video captioner and large language model (LLM) to generate high-quality analyses of both abnormal and normal video clips. Building upon VAD-Instruct50k, a customized solution for interpretable video anomaly detection is developed. This solution includes a lightweight temporal sampler to select frames with high anomaly response and a fine-tuned multimodal LLM to generate explanatory content. Extensive experiments validate the generality and interpretability of Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. The benchmark and model are publicly available at https://holmesvad.github.io/. The framework addresses two main challenges in VAD: biased anomaly space and lack of explainability. Biased anomaly space arises due to the lack of reliable frame-level abnormal supervision, leading to a prevalent bias toward unseen or easily-confused normality. Lack of explainability is addressed by constructing a large amount of anomaly-awarded instruction conversation data for the fine-tuning of multimodal LLMs. This data is generated using a semi-automated data engine that includes data collection, annotation enhancement, and instruction construction. Holmes-VAD uses a Video Encoder, Temporal Sampler, and Multi-modal LLM with tunable LoRA modules to achieve interpretable video anomaly detection. The Video Encoder and Multi-modal LLM are used to encode the input video and generate text responses to input text prompts. The Temporal Sampler is used to predict the abnormal scores of video frames and sample high-responsive parts as input for the Multi-modal LLM. The framework is trained on the VAD-Instruct50k dataset, which includes single-frame annotations and explanatory text descriptions. Extensive experiments demonstrate that Holmes-VAD achieves outstanding performance in video anomaly detection and can provide detailed explanations for detected abnormal events. The framework is capable of identifying anomalies and providing insightful explanations across even hour-long videos. The proposed method is a valuable tool for real-world applications, addressing the biases and lack of interpretability in existing anomaly detection methods.
Reach us at info@study.space
Understanding Holmes-VAD%3A Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM