[slides] WorldScribe%3A Towards Context-Aware Live Visual Descriptions

**WorldScribe: Towards Context-Aware Live Visual Descriptions** WorldScribe is a system designed to provide automated, real-time visual descriptions for blind and visually impaired (BVI) individuals, enhancing their understanding of their surroundings. The system dynamically combines vision, language, and sound recognition models to generate context-aware and adaptive descriptions. Key features include: 1. **Context-Aware Descriptions**: WorldScribe tailors descriptions to users' intents and prioritizes them based on semantic relevance. 2. **Adaptive to Visual Contexts**: It provides succinct descriptions for dynamic scenes and detailed descriptions for stable settings. 3. **Adaptive to Sound Contexts**: It adjusts volume and pauses based on environmental sounds. **System Architecture**: - **Intent Specification Layer**: Users specify their intent, which is decomposed into specific visual attributes and relevant objects. - **Keyframe Extraction Layer**: Identifies keyframes based on camera orientation and visual similarity. - **Description Generation Layer**: Uses a suite of vision and language models to generate descriptions, balancing richness and latency. - **Description Prioritization Layer**: Selects the most relevant and up-to-date descriptions based on user intent, proximity, and context. - **Presentation Layer**: Manipulates the presentation of descriptions based on sound context, such as pausing or increasing volume. **User Study**: - **Participants**: Six BVI participants. - **Scenarios**: Specific intent, general intent, and user-defined intent. - **Results**: Participants found WorldScribe's descriptions accurate and useful, but some expressed skepticism due to occasional errors and the need for further customization. **Conclusion**: WorldScribe represents a significant step towards making live visual descriptions more context-aware and user-centric, enhancing accessibility for BVI individuals. Future work will focus on further improving the humanization and adaptability of AI-generated descriptions.**WorldScribe: Towards Context-Aware Live Visual Descriptions** WorldScribe is a system designed to provide automated, real-time visual descriptions for blind and visually impaired (BVI) individuals, enhancing their understanding of their surroundings. The system dynamically combines vision, language, and sound recognition models to generate context-aware and adaptive descriptions. Key features include: 1. **Context-Aware Descriptions**: WorldScribe tailors descriptions to users' intents and prioritizes them based on semantic relevance. 2. **Adaptive to Visual Contexts**: It provides succinct descriptions for dynamic scenes and detailed descriptions for stable settings. 3. **Adaptive to Sound Contexts**: It adjusts volume and pauses based on environmental sounds. **System Architecture**: - **Intent Specification Layer**: Users specify their intent, which is decomposed into specific visual attributes and relevant objects. - **Keyframe Extraction Layer**: Identifies keyframes based on camera orientation and visual similarity. - **Description Generation Layer**: Uses a suite of vision and language models to generate descriptions, balancing richness and latency. - **Description Prioritization Layer**: Selects the most relevant and up-to-date descriptions based on user intent, proximity, and context. - **Presentation Layer**: Manipulates the presentation of descriptions based on sound context, such as pausing or increasing volume. **User Study**: - **Participants**: Six BVI participants. - **Scenarios**: Specific intent, general intent, and user-defined intent. - **Results**: Participants found WorldScribe's descriptions accurate and useful, but some expressed skepticism due to occasional errors and the need for further customization. **Conclusion**: WorldScribe represents a significant step towards making live visual descriptions more context-aware and user-centric, enhancing accessibility for BVI individuals. Future work will focus on further improving the humanization and adaptability of AI-generated descriptions.

WorldScribe: Towards Context-Aware Live Visual Descriptions

October 13–16, 2024, Pittsburgh, PA, USA | Ruei-Che Chang, Yuxuan Liu, Anhong Guo