This paper presents a solution for Track 2 of the AI City Challenge 2024, focusing on traffic safety description and analysis using the Woven Traffic Safety (WTS) dataset. The solution integrates Parallel Decoding for Video Captioning (PDVC) with CLIP visual features to improve dense captioning of traffic safety videos, particularly in scenarios involving pedestrian and vehicle interactions. Key contributions include:
1. **PDVC with CLIP**: Utilizes PDVC to model visual-language sequences and generate dense captions by chapters, leveraging CLIP for efficient cross-modality training.
2. **Domain-Specific Model Adaptation**: Conducts domain-specific training and knowledge transfer to mitigate domain shift issues in video understanding.
3. **BDD-5K Knowledge Transfer**: Uses BDD-5K captioned videos for knowledge transfer to enhance understanding of WTS videos and improve captioning accuracy.
The solution achieved 6th place in the competition, demonstrating its effectiveness in generating precise and meaningful video captions. The paper also discusses related works, methodology, experiments, and concludes with a discussion on the significance of model configuration and the overall performance of the proposed framework.This paper presents a solution for Track 2 of the AI City Challenge 2024, focusing on traffic safety description and analysis using the Woven Traffic Safety (WTS) dataset. The solution integrates Parallel Decoding for Video Captioning (PDVC) with CLIP visual features to improve dense captioning of traffic safety videos, particularly in scenarios involving pedestrian and vehicle interactions. Key contributions include:
1. **PDVC with CLIP**: Utilizes PDVC to model visual-language sequences and generate dense captions by chapters, leveraging CLIP for efficient cross-modality training.
2. **Domain-Specific Model Adaptation**: Conducts domain-specific training and knowledge transfer to mitigate domain shift issues in video understanding.
3. **BDD-5K Knowledge Transfer**: Uses BDD-5K captioned videos for knowledge transfer to enhance understanding of WTS videos and improve captioning accuracy.
The solution achieved 6th place in the competition, demonstrating its effectiveness in generating precise and meaningful video captions. The paper also discusses related works, methodology, experiments, and concludes with a discussion on the significance of model configuration and the overall performance of the proposed framework.