Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

2024 | Maged Shoman, Dongdong Wang, Armstrong Aboah, Mohamed Abdel-Aty
This paper presents a solution for Track 2 of the AI City Challenge 2024, focusing on traffic safety description and analysis using the Woven Traffic Safety (WTS) dataset. The solution integrates Parallel Decoding for Video Captioning (PDVC) with CLIP visual features to enhance dense video captioning for traffic safety scenarios. The approach addresses challenges such as domain shift, efficient tokenization, and the need for accurate and coherent captions. Key components include domain-specific model adaptation, knowledge transfer from BDD-5K to WTS, and post-processing techniques to improve text fluency. The solution achieves 6th place in the competition on the test set. The method involves dense video captioning, where PDVC is used to generate event-based captions by simultaneously localizing events and generating captions. CLIP is employed to extract visual features, enabling cross-modal training between visual and textual representations. Domain-specific training and knowledge transfer are used to mitigate domain shift. The solution also includes video synchronization through trimming and post-processing to enhance the coherence and accuracy of generated captions. Experiments on the WTS and BDD-5K datasets demonstrate the effectiveness of the approach in generating accurate and meaningful video captions. The results show that the proposed solution outperforms other methods in terms of caption accuracy and fluency, contributing to the field of video captioning by providing an end-to-end solution for traffic safety analysis.This paper presents a solution for Track 2 of the AI City Challenge 2024, focusing on traffic safety description and analysis using the Woven Traffic Safety (WTS) dataset. The solution integrates Parallel Decoding for Video Captioning (PDVC) with CLIP visual features to enhance dense video captioning for traffic safety scenarios. The approach addresses challenges such as domain shift, efficient tokenization, and the need for accurate and coherent captions. Key components include domain-specific model adaptation, knowledge transfer from BDD-5K to WTS, and post-processing techniques to improve text fluency. The solution achieves 6th place in the competition on the test set. The method involves dense video captioning, where PDVC is used to generate event-based captions by simultaneously localizing events and generating captions. CLIP is employed to extract visual features, enabling cross-modal training between visual and textual representations. Domain-specific training and knowledge transfer are used to mitigate domain shift. The solution also includes video synchronization through trimming and post-processing to enhance the coherence and accuracy of generated captions. Experiments on the WTS and BDD-5K datasets demonstrate the effectiveness of the approach in generating accurate and meaningful video captions. The results show that the proposed solution outperforms other methods in terms of caption accuracy and fluency, contributing to the field of video captioning by providing an end-to-end solution for traffic safety analysis.
Reach us at info@study.space