Making Short-Form Videos Accessible with Hierarchical Video Summaries

Making Short-Form Videos Accessible with Hierarchical Video Summaries

May 11–16, 2024, Honolulu, HI, USA | Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C Derry, Mina Huh, and Amy Pavel
**ShortScribe: Making Short-Form Videos Accessible with Hierarchical Video Summaries** **Authors:** Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C Derry, Mina Huh, and Amy Pavel **Abstract:** Short videos on platforms like TikTok, Instagram Reels, and YouTube Shorts have become a primary source of information and entertainment. However, many of these videos are inaccessible to blind and low vision (BLV) viewers due to rapid visual changes, on-screen text, and music or meme-audio overlays. In a formative study, 7 BLV viewers reported frequently skipping inaccessible content. ShortScribe is a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding videos. The system segments videos into shots, extracts visual information using vision language models (BLIP-2, OCR), and uses a large language model (GPT-4) to generate descriptions. A user study with 10 BLV participants showed improved comprehension and more accurate summaries when using ShortScribe compared to a baseline interface. **Key Contributions:** - Formative study revealing current practices and challenges of watching short-form videos for BLV users. - Design and development of ShortScribe, a system providing hierarchical visual descriptions. - User study demonstrating improved experience and video selection with ShortScribe. **System Overview:** ShortScribe includes a mobile interface with a video pane and a description pane. The video pane mimics existing short-form video platforms, while the description pane provides access to short, long, and shot-by-shot descriptions. The system uses a pipeline that transcribes audio, segments videos, and generates descriptions using GPT-4. **Evaluation:** The pipeline was evaluated for accuracy and coverage using a dataset of 58 short-form videos. Results showed that ShortScribe descriptions were generally accurate and covered important details. A user study with 10 BLV participants found that ShortScribe improved video comprehension and provided more accurate summaries, with participants reporting higher willingness to use the system in the future.**ShortScribe: Making Short-Form Videos Accessible with Hierarchical Video Summaries** **Authors:** Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C Derry, Mina Huh, and Amy Pavel **Abstract:** Short videos on platforms like TikTok, Instagram Reels, and YouTube Shorts have become a primary source of information and entertainment. However, many of these videos are inaccessible to blind and low vision (BLV) viewers due to rapid visual changes, on-screen text, and music or meme-audio overlays. In a formative study, 7 BLV viewers reported frequently skipping inaccessible content. ShortScribe is a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding videos. The system segments videos into shots, extracts visual information using vision language models (BLIP-2, OCR), and uses a large language model (GPT-4) to generate descriptions. A user study with 10 BLV participants showed improved comprehension and more accurate summaries when using ShortScribe compared to a baseline interface. **Key Contributions:** - Formative study revealing current practices and challenges of watching short-form videos for BLV users. - Design and development of ShortScribe, a system providing hierarchical visual descriptions. - User study demonstrating improved experience and video selection with ShortScribe. **System Overview:** ShortScribe includes a mobile interface with a video pane and a description pane. The video pane mimics existing short-form video platforms, while the description pane provides access to short, long, and shot-by-shot descriptions. The system uses a pipeline that transcribes audio, segments videos, and generates descriptions using GPT-4. **Evaluation:** The pipeline was evaluated for accuracy and coverage using a dataset of 58 short-form videos. Results showed that ShortScribe descriptions were generally accurate and covered important details. A user study with 10 BLV participants found that ShortScribe improved video comprehension and provided more accurate summaries, with participants reporting higher willingness to use the system in the future.
Reach us at info@study.space