22 Apr 2024 | Tengda Han1 Max Bain1 Arsha Nagrani1† Gül Varol1,2 Weidi Xie1,3 Andrew Zisserman1
This paper addresses the challenging task of generating Audio Description (AD) for movies, which requires fine-grained visual understanding and awareness of characters and their names. The authors propose three main contributions: (i) They develop two new datasets, CMD-AD and HowTo-AD, that align video data with AD descriptions, using pixel-level video data and pseudo-ground-truth annotations. (ii) They introduce a new architecture, based on the Q-former, that ingests raw video and character bank proposals to generate character-aware AD using frozen pre-trained visual encoders and large language models. (iii) They propose new evaluation metrics, CRITIC and LLM-AD-eval, tailored to assess the quality of AD, focusing on character naming and holistic semantics, respectively. The proposed methods improve the state-of-the-art on AD generation tasks, demonstrating the effectiveness of their approach in generating accurate and contextually relevant AD.This paper addresses the challenging task of generating Audio Description (AD) for movies, which requires fine-grained visual understanding and awareness of characters and their names. The authors propose three main contributions: (i) They develop two new datasets, CMD-AD and HowTo-AD, that align video data with AD descriptions, using pixel-level video data and pseudo-ground-truth annotations. (ii) They introduce a new architecture, based on the Q-former, that ingests raw video and character bank proposals to generate character-aware AD using frozen pre-trained visual encoders and large language models. (iii) They propose new evaluation metrics, CRITIC and LLM-AD-eval, tailored to assess the quality of AD, focusing on character naming and holistic semantics, respectively. The proposed methods improve the state-of-the-art on AD generation tasks, demonstrating the effectiveness of their approach in generating accurate and contextually relevant AD.