October 13–16, 2024 | Mustafa Doga Dogan, Eric J. Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, David Kim
Augmented Object Intelligence (AOI) with XR-Objects is a novel interaction paradigm that enables real-world objects to act as digital entities, allowing users to interact with them as if they were digital objects. This approach leverages real-time object segmentation and classification, combined with Multimodal Large Language Models (MLLMs), to facilitate interactions without the need for object pre-registration. The system, XR-Objects, is an open-source prototype that provides a platform for users to engage with their physical environment in contextually relevant ways using object-based context menus. This system enables analog objects to not only convey information but also to initiate digital actions, such as querying for details or executing tasks. The contributions of this work are threefold: (1) defining the AOI concept and detailing its advantages over traditional AI assistants, (2) detailing the open-source design and implementation of XR-Objects, and (3) demonstrating its versatility through various use cases and a user study.
The system integrates a MLLM to enhance the ability to automate the recognition of objects and their specific semantic information within XR spaces. The system's design prioritizes object-centric interactions, where users can directly engage with objects in their environment without the need to navigate through an app or input additional information. This approach provides a more natural interaction flow, minimizes the operational steps required to access digital functionalities, and enables multi-tasking. The system also employs a world-space UI, where digital elements are anchored to physical objects, ensuring that interactions remain contextually grounded within the user's real-world environment.
The system's implementation involves several steps, including object detection and classification, 3D localization and anchoring, coupling each object with an MLLM for metadata retrieval, and executing actions based on user input. The system's architecture is built using Unity and its AR Foundation, and it includes components for object detection, 3D localization, and menu interaction. The system's design also includes a fixed number of top-level categories and actions, which are based on usability and cognitive efficiency considerations.
The system was evaluated through a user study comparing XR-Objects to a state-of-the-art MLLM assistant interface (Gemini app). The results showed that participants using XR-Objects required significantly less time to complete tasks compared to those using the Gemini app. The study also found that both approaches of MLLM-enabled real-world search were positively rated, with XR-Objects showing a stronger case for MLLM-enabled information retrieval. The results also indicated a clear preference for XR-Objects in the context of the HMD form factor, while preferences between the AI tools were split when using a phone.
The system has a wide range of applications, including cooking, shopping, discovery, productivity, learning, and IoT connectivity. In cooking, the system can provide real-time information about ingredients and guide users through the cooking process. In shopping, the system can help users compare products and find the most suitable options. In discovery, theAugmented Object Intelligence (AOI) with XR-Objects is a novel interaction paradigm that enables real-world objects to act as digital entities, allowing users to interact with them as if they were digital objects. This approach leverages real-time object segmentation and classification, combined with Multimodal Large Language Models (MLLMs), to facilitate interactions without the need for object pre-registration. The system, XR-Objects, is an open-source prototype that provides a platform for users to engage with their physical environment in contextually relevant ways using object-based context menus. This system enables analog objects to not only convey information but also to initiate digital actions, such as querying for details or executing tasks. The contributions of this work are threefold: (1) defining the AOI concept and detailing its advantages over traditional AI assistants, (2) detailing the open-source design and implementation of XR-Objects, and (3) demonstrating its versatility through various use cases and a user study.
The system integrates a MLLM to enhance the ability to automate the recognition of objects and their specific semantic information within XR spaces. The system's design prioritizes object-centric interactions, where users can directly engage with objects in their environment without the need to navigate through an app or input additional information. This approach provides a more natural interaction flow, minimizes the operational steps required to access digital functionalities, and enables multi-tasking. The system also employs a world-space UI, where digital elements are anchored to physical objects, ensuring that interactions remain contextually grounded within the user's real-world environment.
The system's implementation involves several steps, including object detection and classification, 3D localization and anchoring, coupling each object with an MLLM for metadata retrieval, and executing actions based on user input. The system's architecture is built using Unity and its AR Foundation, and it includes components for object detection, 3D localization, and menu interaction. The system's design also includes a fixed number of top-level categories and actions, which are based on usability and cognitive efficiency considerations.
The system was evaluated through a user study comparing XR-Objects to a state-of-the-art MLLM assistant interface (Gemini app). The results showed that participants using XR-Objects required significantly less time to complete tasks compared to those using the Gemini app. The study also found that both approaches of MLLM-enabled real-world search were positively rated, with XR-Objects showing a stronger case for MLLM-enabled information retrieval. The results also indicated a clear preference for XR-Objects in the context of the HMD form factor, while preferences between the AI tools were split when using a phone.
The system has a wide range of applications, including cooking, shopping, discovery, productivity, learning, and IoT connectivity. In cooking, the system can provide real-time information about ingredients and guide users through the cooking process. In shopping, the system can help users compare products and find the most suitable options. In discovery, the