October 13–16, 2024, Pittsburgh, PA, USA | Ruei-Che Chang, Yuxuan Liu, Anhong Guo
**WorldScribe: Towards Context-Aware Live Visual Descriptions**
WorldScribe is a system designed to provide automated, real-time visual descriptions for blind and visually impaired (BVI) individuals, enhancing their understanding of their surroundings. The system dynamically combines vision, language, and sound recognition models to generate context-aware and adaptive descriptions. Key features include:
1. **Context-Aware Descriptions**: WorldScribe tailors descriptions to users' intents and prioritizes them based on semantic relevance.
2. **Adaptive to Visual Contexts**: It provides succinct descriptions for dynamic scenes and detailed descriptions for stable settings.
3. **Adaptive to Sound Contexts**: It adjusts volume and pauses based on environmental sounds.
**System Architecture**:
- **Intent Specification Layer**: Users specify their intent, which is decomposed into specific visual attributes and relevant objects.
- **Keyframe Extraction Layer**: Identifies keyframes based on camera orientation and visual similarity.
- **Description Generation Layer**: Uses a suite of vision and language models to generate descriptions, balancing richness and latency.
- **Description Prioritization Layer**: Selects the most relevant and up-to-date descriptions based on user intent, proximity, and context.
- **Presentation Layer**: Manipulates the presentation of descriptions based on sound context, such as pausing or increasing volume.
**User Study**:
- **Participants**: Six BVI participants.
- **Scenarios**: Specific intent, general intent, and user-defined intent.
- **Results**: Participants found WorldScribe's descriptions accurate and useful, but some expressed skepticism due to occasional errors and the need for further customization.
**Conclusion**:
WorldScribe represents a significant step towards making live visual descriptions more context-aware and user-centric, enhancing accessibility for BVI individuals. Future work will focus on further improving the humanization and adaptability of AI-generated descriptions.**WorldScribe: Towards Context-Aware Live Visual Descriptions**
WorldScribe is a system designed to provide automated, real-time visual descriptions for blind and visually impaired (BVI) individuals, enhancing their understanding of their surroundings. The system dynamically combines vision, language, and sound recognition models to generate context-aware and adaptive descriptions. Key features include:
1. **Context-Aware Descriptions**: WorldScribe tailors descriptions to users' intents and prioritizes them based on semantic relevance.
2. **Adaptive to Visual Contexts**: It provides succinct descriptions for dynamic scenes and detailed descriptions for stable settings.
3. **Adaptive to Sound Contexts**: It adjusts volume and pauses based on environmental sounds.
**System Architecture**:
- **Intent Specification Layer**: Users specify their intent, which is decomposed into specific visual attributes and relevant objects.
- **Keyframe Extraction Layer**: Identifies keyframes based on camera orientation and visual similarity.
- **Description Generation Layer**: Uses a suite of vision and language models to generate descriptions, balancing richness and latency.
- **Description Prioritization Layer**: Selects the most relevant and up-to-date descriptions based on user intent, proximity, and context.
- **Presentation Layer**: Manipulates the presentation of descriptions based on sound context, such as pausing or increasing volume.
**User Study**:
- **Participants**: Six BVI participants.
- **Scenarios**: Specific intent, general intent, and user-defined intent.
- **Results**: Participants found WorldScribe's descriptions accurate and useful, but some expressed skepticism due to occasional errors and the need for further customization.
**Conclusion**:
WorldScribe represents a significant step towards making live visual descriptions more context-aware and user-centric, enhancing accessibility for BVI individuals. Future work will focus on further improving the humanization and adaptability of AI-generated descriptions.