May 11–16, 2024, Honolulu, HI, USA | Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C Derry, Mina Huh, and Amy Pavel
**ShortScribe: Making Short-Form Videos Accessible with Hierarchical Video Summaries**
**Authors:** Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C Derry, Mina Huh, and Amy Pavel
**Abstract:**
Short videos on platforms like TikTok, Instagram Reels, and YouTube Shorts have become a primary source of information and entertainment. However, many of these videos are inaccessible to blind and low vision (BLV) viewers due to rapid visual changes, on-screen text, and music or meme-audio overlays. In a formative study, 7 BLV viewers reported frequently skipping inaccessible content. ShortScribe is a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding videos. The system segments videos into shots, extracts visual information using vision language models (BLIP-2, OCR), and uses a large language model (GPT-4) to generate descriptions. A user study with 10 BLV participants showed improved comprehension and more accurate summaries when using ShortScribe compared to a baseline interface.
**Key Contributions:**
- Formative study revealing current practices and challenges of watching short-form videos for BLV users.
- Design and development of ShortScribe, a system providing hierarchical visual descriptions.
- User study demonstrating improved experience and video selection with ShortScribe.
**System Overview:**
ShortScribe includes a mobile interface with a video pane and a description pane. The video pane mimics existing short-form video platforms, while the description pane provides access to short, long, and shot-by-shot descriptions. The system uses a pipeline that transcribes audio, segments videos, and generates descriptions using GPT-4.
**Evaluation:**
The pipeline was evaluated for accuracy and coverage using a dataset of 58 short-form videos. Results showed that ShortScribe descriptions were generally accurate and covered important details. A user study with 10 BLV participants found that ShortScribe improved video comprehension and provided more accurate summaries, with participants reporting higher willingness to use the system in the future.**ShortScribe: Making Short-Form Videos Accessible with Hierarchical Video Summaries**
**Authors:** Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C Derry, Mina Huh, and Amy Pavel
**Abstract:**
Short videos on platforms like TikTok, Instagram Reels, and YouTube Shorts have become a primary source of information and entertainment. However, many of these videos are inaccessible to blind and low vision (BLV) viewers due to rapid visual changes, on-screen text, and music or meme-audio overlays. In a formative study, 7 BLV viewers reported frequently skipping inaccessible content. ShortScribe is a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding videos. The system segments videos into shots, extracts visual information using vision language models (BLIP-2, OCR), and uses a large language model (GPT-4) to generate descriptions. A user study with 10 BLV participants showed improved comprehension and more accurate summaries when using ShortScribe compared to a baseline interface.
**Key Contributions:**
- Formative study revealing current practices and challenges of watching short-form videos for BLV users.
- Design and development of ShortScribe, a system providing hierarchical visual descriptions.
- User study demonstrating improved experience and video selection with ShortScribe.
**System Overview:**
ShortScribe includes a mobile interface with a video pane and a description pane. The video pane mimics existing short-form video platforms, while the description pane provides access to short, long, and shot-by-shot descriptions. The system uses a pipeline that transcribes audio, segments videos, and generates descriptions using GPT-4.
**Evaluation:**
The pipeline was evaluated for accuracy and coverage using a dataset of 58 short-form videos. Results showed that ShortScribe descriptions were generally accurate and covered important details. A user study with 10 BLV participants found that ShortScribe improved video comprehension and provided more accurate summaries, with participants reporting higher willingness to use the system in the future.