Harnessing Large Language Models for Training-free Video Anomaly Detection

Harnessing Large Language Models for Training-free Video Anomaly Detection

1 Apr 2024 | Luca Zanella¹ Willi Menapace¹ Massimiliano Mancini¹ Yiming Wang² Elisa Ricci¹,²
This paper introduces LAVAD, a training-free video anomaly detection (VAD) method that leverages pre-trained large language models (LLMs) and vision-language models (VLMs) to detect anomalies without requiring training or data collection. Unlike traditional VAD methods that rely on supervised or unsupervised learning, LAVAD uses a language-based approach to generate textual descriptions of video frames and then employs LLMs to estimate anomaly scores based on these descriptions. The method first generates captions for each frame using a VLM, then cleans the captions by aligning them with the corresponding video frames using cross-modal similarity. It then uses an LLM to summarize the captions within a temporal window and estimate anomaly scores based on these summaries. Finally, it refines the scores by aggregating them across semantically similar frames. LAVAD is evaluated on two large datasets, UCF-Crime and XD-Violence, and outperforms both unsupervised and one-class VAD methods without requiring any training or data collection. The method demonstrates the potential of using pre-trained LLMs and VLMs for training-free VAD, offering a promising approach for real-world applications where data collection is challenging.This paper introduces LAVAD, a training-free video anomaly detection (VAD) method that leverages pre-trained large language models (LLMs) and vision-language models (VLMs) to detect anomalies without requiring training or data collection. Unlike traditional VAD methods that rely on supervised or unsupervised learning, LAVAD uses a language-based approach to generate textual descriptions of video frames and then employs LLMs to estimate anomaly scores based on these descriptions. The method first generates captions for each frame using a VLM, then cleans the captions by aligning them with the corresponding video frames using cross-modal similarity. It then uses an LLM to summarize the captions within a temporal window and estimate anomaly scores based on these summaries. Finally, it refines the scores by aggregating them across semantically similar frames. LAVAD is evaluated on two large datasets, UCF-Crime and XD-Violence, and outperforms both unsupervised and one-class VAD methods without requiring any training or data collection. The method demonstrates the potential of using pre-trained LLMs and VLMs for training-free VAD, offering a promising approach for real-world applications where data collection is challenging.
Reach us at info@study.space
Understanding Harnessing Large Language Models for Training-Free Video Anomaly Detection