Understanding Harnessing Large Language Models for Training-Free Video Anomaly Detection

The paper introduces LAVAD (LAnguage-based VAlue Detection), a novel training-free method for video anomaly detection (VAD) that leverages pre-trained large language models (LLMs) and existing vision-language models (VLMs). Unlike traditional VAD methods that rely on training deep models with video-level, one-class, or unsupervised supervision, LAVAD uses VLMs to generate textual descriptions for each frame of a test video and then employs LLMs to estimate anomaly scores based on these descriptions. The method addresses the limitations of noisy and context-lacking captions by cleaning them using cross-modal similarity and refining the anomaly scores through temporal aggregation and video-text alignment. Evaluations on two large datasets, UCF-Crime and XD-Violence, demonstrate that LAVAD outperforms both unsupervised and one-class methods without requiring any training or data collection. The paper also includes a comprehensive ablation study and discusses the broader societal impacts and limitations of the proposed approach.The paper introduces LAVAD (LAnguage-based VAlue Detection), a novel training-free method for video anomaly detection (VAD) that leverages pre-trained large language models (LLMs) and existing vision-language models (VLMs). Unlike traditional VAD methods that rely on training deep models with video-level, one-class, or unsupervised supervision, LAVAD uses VLMs to generate textual descriptions for each frame of a test video and then employs LLMs to estimate anomaly scores based on these descriptions. The method addresses the limitations of noisy and context-lacking captions by cleaning them using cross-modal similarity and refining the anomaly scores through temporal aggregation and video-text alignment. Evaluations on two large datasets, UCF-Crime and XD-Violence, demonstrate that LAVAD outperforms both unsupervised and one-class methods without requiring any training or data collection. The paper also includes a comprehensive ablation study and discusses the broader societal impacts and limitations of the proposed approach.

Harnessing Large Language Models for Training-free Video Anomaly Detection

1 Apr 2024 | Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, Elisa Ricci