WorldScribe: Towards Context-Aware Live Visual Descriptions

WorldScribe: Towards Context-Aware Live Visual Descriptions

October 13-16, 2024 | Ruei-Che Chang, Yuxuan Liu, Anhong Guo
WorldScribe is a system that provides automated live visual descriptions tailored to users' contexts, enabling blind or visually impaired (BVI) individuals to understand their surroundings autonomously. The system dynamically combines vision-language models (VLMs) and large language models (LLMs) to generate descriptions that are customizable and adaptive to users' intents, visual, and sound contexts. For instance, when users move quickly, WorldScribe provides succinct descriptions, while static scenes receive longer, detailed ones. It also adjusts descriptions based on sound contexts, such as increasing volume in noisy environments or pausing during conversations. WorldScribe is powered by a suite of vision, language, and sound recognition models, enabling a description generation pipeline that balances the trade-offs between richness and latency to support real-time use. The system prioritizes descriptions based on semantic relevance, user intent, and proximity to the user. It also keeps spoken descriptions up-to-date by examining object compositions and changes in user orientation. A formative study with five BVI participants identified key design considerations, including providing descriptions with an overview first, then adaptive details, prioritizing descriptions based on semantic relevance, and enabling customizability based on varied user needs. These insights informed the development of WorldScribe, which was evaluated with six BVI participants. The study showed that WorldScribe can provide real-time and fairly accurate visual descriptions, facilitating adaptive and customized environment understanding. WorldScribe's design is informed by prior work on visual descriptions and a formative study with blind participants. The system's technical approach may find applications broadly for enhancing real-time visual assistance to promote real-world and digital media accessibility. The study also highlighted the need for future work to make AI-generated descriptions more humanized, user-centric, and context-aware.WorldScribe is a system that provides automated live visual descriptions tailored to users' contexts, enabling blind or visually impaired (BVI) individuals to understand their surroundings autonomously. The system dynamically combines vision-language models (VLMs) and large language models (LLMs) to generate descriptions that are customizable and adaptive to users' intents, visual, and sound contexts. For instance, when users move quickly, WorldScribe provides succinct descriptions, while static scenes receive longer, detailed ones. It also adjusts descriptions based on sound contexts, such as increasing volume in noisy environments or pausing during conversations. WorldScribe is powered by a suite of vision, language, and sound recognition models, enabling a description generation pipeline that balances the trade-offs between richness and latency to support real-time use. The system prioritizes descriptions based on semantic relevance, user intent, and proximity to the user. It also keeps spoken descriptions up-to-date by examining object compositions and changes in user orientation. A formative study with five BVI participants identified key design considerations, including providing descriptions with an overview first, then adaptive details, prioritizing descriptions based on semantic relevance, and enabling customizability based on varied user needs. These insights informed the development of WorldScribe, which was evaluated with six BVI participants. The study showed that WorldScribe can provide real-time and fairly accurate visual descriptions, facilitating adaptive and customized environment understanding. WorldScribe's design is informed by prior work on visual descriptions and a formative study with blind participants. The system's technical approach may find applications broadly for enhancing real-time visual assistance to promote real-world and digital media accessibility. The study also highlighted the need for future work to make AI-generated descriptions more humanized, user-centric, and context-aware.
Reach us at info@study.space