This paper introduces MM-AU, a large-scale multi-modal dataset for ego-view accident video understanding, containing 11,727 in-the-wild ego-view accident videos with temporally aligned text descriptions. The dataset includes over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. It supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. The authors propose AdVersa-SD, an abductive accident video understanding framework for safe driving perception, which uses an Object-Centric Video Diffusion (OAVD) method driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, and accident frames with corresponding text descriptions. OAVD enforces causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. The dataset and code are released at www.lotvsmmau.net.This paper introduces MM-AU, a large-scale multi-modal dataset for ego-view accident video understanding, containing 11,727 in-the-wild ego-view accident videos with temporally aligned text descriptions. The dataset includes over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. It supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. The authors propose AdVersa-SD, an abductive accident video understanding framework for safe driving perception, which uses an Object-Centric Video Diffusion (OAVD) method driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, and accident frames with corresponding text descriptions. OAVD enforces causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. The dataset and code are released at www.lotvsmmau.net.