Making Short-Form Videos Accessible with Hierarchical Video Summaries

Making Short-Form Videos Accessible with Hierarchical Video Summaries

May 11–16, 2024 | Tess Van Daele, Akhil Iyer, Yuning Zhang, Jaly C Derry, Mina Huh, Amy Pavel
ShortScribe is a system that provides hierarchical video summaries to support blind and low vision (BLV) viewers in understanding and selecting short-form videos. Short-form videos on platforms like TikTok, Instagram Reels, and YouTube Shorts are often inaccessible due to rapid visual changes, on-screen text, and audio overlays. In a formative study with 7 BLV participants, participants reported frequently skipping inaccessible content and struggling to determine what was happening on screen. ShortScribe extracts video data by identifying key frames, applying automatic speech recognition (ASR), automated description (BLIP-2), and optical character recognition (OCR). A large language model (GPT-4) then generates multiple descriptions, including short, long, and shot-by-shot descriptions. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. In a user study with 10 BLV participants, participants reported higher comprehension and provided more accurate summaries of video content when using ShortScribe compared to a baseline interface. ShortScribe provides BLV users with hierarchical visual descriptions, enabling them to flexibly explore video details. The system was evaluated for accuracy and coverage, with long descriptions and shot-by-shot descriptions capturing all important details, while short and 50-word descriptions captured 75% and 90% of the important details, respectively. ShortScribe's descriptions are comparable to human video descriptions in terms of coverage but are generally more verbose. The system was found to improve video comprehension, selection, and preference compared to a baseline interface. ShortScribe's interface includes a video pane and a description pane, allowing users to access short, long, and shot-by-shot descriptions. The system was implemented using React.js and a real-time Firebase database, and tested for compatibility with popular screen readers. The pipeline for ShortScribe involves extracting audio and visual information, generating descriptions, and evaluating the accuracy and coverage of the descriptions. The system was found to be effective in making short-form videos accessible to BLV viewers.ShortScribe is a system that provides hierarchical video summaries to support blind and low vision (BLV) viewers in understanding and selecting short-form videos. Short-form videos on platforms like TikTok, Instagram Reels, and YouTube Shorts are often inaccessible due to rapid visual changes, on-screen text, and audio overlays. In a formative study with 7 BLV participants, participants reported frequently skipping inaccessible content and struggling to determine what was happening on screen. ShortScribe extracts video data by identifying key frames, applying automatic speech recognition (ASR), automated description (BLIP-2), and optical character recognition (OCR). A large language model (GPT-4) then generates multiple descriptions, including short, long, and shot-by-shot descriptions. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. In a user study with 10 BLV participants, participants reported higher comprehension and provided more accurate summaries of video content when using ShortScribe compared to a baseline interface. ShortScribe provides BLV users with hierarchical visual descriptions, enabling them to flexibly explore video details. The system was evaluated for accuracy and coverage, with long descriptions and shot-by-shot descriptions capturing all important details, while short and 50-word descriptions captured 75% and 90% of the important details, respectively. ShortScribe's descriptions are comparable to human video descriptions in terms of coverage but are generally more verbose. The system was found to improve video comprehension, selection, and preference compared to a baseline interface. ShortScribe's interface includes a video pane and a description pane, allowing users to access short, long, and shot-by-shot descriptions. The system was implemented using React.js and a real-time Firebase database, and tested for compatibility with popular screen readers. The pipeline for ShortScribe involves extracting audio and visual information, generating descriptions, and evaluating the accuracy and coverage of the descriptions. The system was found to be effective in making short-form videos accessible to BLV viewers.
Reach us at info@study.space
[slides] Making Short-Form Videos Accessible with Hierarchical Video Summaries | StudySpace