Abductive Ego-View Accident Video Understanding for Safe Driving Perception

Abductive Ego-View Accident Video Understanding for Safe Driving Perception

1 Mar 2024 | Jianwu Fang, Lei-lei Li, Junfei Zhou, Junbin Xiao, Hongkai Yu, Chen Lv, Jianru Xue, and Tat-Seng Chua
This paper introduces MM-AU, a large-scale multi-modal dataset for ego-view accident video understanding, containing 11,727 in-the-wild ego-view accident videos with temporally aligned text descriptions. The dataset includes over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. It supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. The authors propose AdVersa-SD, an abductive accident video understanding framework for safe driving perception, which uses an Object-Centric Video Diffusion (OAVD) method driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, and accident frames with corresponding text descriptions. OAVD enforces causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. The dataset and code are released at www.lotvsmmau.net.This paper introduces MM-AU, a large-scale multi-modal dataset for ego-view accident video understanding, containing 11,727 in-the-wild ego-view accident videos with temporally aligned text descriptions. The dataset includes over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. It supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. The authors propose AdVersa-SD, an abductive accident video understanding framework for safe driving perception, which uses an Object-Centric Video Diffusion (OAVD) method driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, and accident frames with corresponding text descriptions. OAVD enforces causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. The dataset and code are released at www.lotvsmmau.net.
Reach us at info@study.space