Understanding Holmes-VAD%3A Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

The paper introduces Holmes-VAD, a novel framework for unbiased and explainable video anomaly detection (VAD) that leverages precise temporal supervision and rich multimodal instructions. To address the limitations of existing VAD methods, which often exhibit biased detection and lack interpretability, Holmes-VAD constructs the first large-scale multimodal VAD instruction-tuning benchmark, *VAD-Instruct50k*. This dataset is created using a semi-automatic labeling paradigm, where single-frame annotations are applied to untrimmed videos, and high-quality analyses of both abnormal and normal video clips are generated using a robust video captioner and a large language model (LLM). The framework consists of three key components: a Video Encoder, a Temporal Sampler, and a Multi-modal LLM. The Video Encoder encodes the input video, the Temporal Sampler predicts anomaly scores and selects frames with high anomaly response, and the Multi-modal LLM generates explanatory content. Extensive experiments demonstrate that Holmes-VAD achieves superior performance and interpretability compared to state-of-the-art methods, making it a valuable tool for real-world applications. The paper also discusses the limitations and future work, including the need to enhance data quality and address long-term video anomaly detection.The paper introduces Holmes-VAD, a novel framework for unbiased and explainable video anomaly detection (VAD) that leverages precise temporal supervision and rich multimodal instructions. To address the limitations of existing VAD methods, which often exhibit biased detection and lack interpretability, Holmes-VAD constructs the first large-scale multimodal VAD instruction-tuning benchmark, *VAD-Instruct50k*. This dataset is created using a semi-automatic labeling paradigm, where single-frame annotations are applied to untrimmed videos, and high-quality analyses of both abnormal and normal video clips are generated using a robust video captioner and a large language model (LLM). The framework consists of three key components: a Video Encoder, a Temporal Sampler, and a Multi-modal LLM. The Video Encoder encodes the input video, the Temporal Sampler predicts anomaly scores and selects frames with high anomaly response, and the Multi-modal LLM generates explanatory content. Extensive experiments demonstrate that Holmes-VAD achieves superior performance and interpretability compared to state-of-the-art methods, making it a valuable tool for real-world applications. The paper also discusses the limitations and future work, including the need to enhance data quality and address long-term video anomaly detection.

Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

29 Jun 2024 | Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, Nong Sang