OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

8 May 2024 | Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cottereau, Wei Tsang Ooi
**Event-based Semantic Segmentation (ESS)** is a challenging task for event camera sensing, characterized by the difficulty in interpreting and annotating event data. This paper introduces OpenESS, a novel framework that synergizes information from image, text, and event data domains to enable scalable ESS in an open-world, annotation-efficient manner. By transferring semantically rich CLIP knowledge from image-text pairs to event streams, OpenESS addresses the limitations of traditional ESS methods, which often rely on expensive annotations and closed-set learning. **Key Contributions:** 1. **OpenESS Framework:** A versatile event-based semantic segmentation framework capable of generating open-world dense event predictions given arbitrary text queries. 2. **First Attempt:** The first attempt at distilling large vision-language models to assist event-based semantic scene understanding tasks. 3. **Cross-Modality Regularization:** Proposes a frame-to-event (F2E) contrastive distillation and a text-to-event (T2E) consistency regularization to encourage effective cross-modality knowledge transfer. **Methodology:** - **CLIP Model:** Utilizes the CLIP model to associate images with textual descriptions through a contrastive learning framework. - **Open-Vocabulary ESS:** Seeks to segment events into semantic classes using raw events and text prompts as inputs. - **F2E Contrastive Distillation:** Leverages calibrated frames to generate coarse, instance-level superpixels and distills knowledge from a pre-trained image backbone to the event segmentation network. - **T2E Consistency Regularization:** Uses CLIP's text encoder to generate semantically consistent text-frame pairs, ensuring alignment between events and texts. **Experiments:** - **Datasets:** Conducted experiments on DDD17-Seg and DSEC-Semantic datasets. - **Performance:** Achieves state-of-the-art results in annotation-free and annotation-efficient ESS settings, outperforming existing methods by significant margins. - **Qualitative Assessment:** Visual comparisons show more consistent semantic information and better instance boundary predictions. **Conclusion:** OpenESS addresses the scalability and annotation efficiency challenges in ESS by leveraging cross-modality representation learning. The framework's ability to perform open-vocabulary predictions without using event labels makes it a promising solution for practical applications.**Event-based Semantic Segmentation (ESS)** is a challenging task for event camera sensing, characterized by the difficulty in interpreting and annotating event data. This paper introduces OpenESS, a novel framework that synergizes information from image, text, and event data domains to enable scalable ESS in an open-world, annotation-efficient manner. By transferring semantically rich CLIP knowledge from image-text pairs to event streams, OpenESS addresses the limitations of traditional ESS methods, which often rely on expensive annotations and closed-set learning. **Key Contributions:** 1. **OpenESS Framework:** A versatile event-based semantic segmentation framework capable of generating open-world dense event predictions given arbitrary text queries. 2. **First Attempt:** The first attempt at distilling large vision-language models to assist event-based semantic scene understanding tasks. 3. **Cross-Modality Regularization:** Proposes a frame-to-event (F2E) contrastive distillation and a text-to-event (T2E) consistency regularization to encourage effective cross-modality knowledge transfer. **Methodology:** - **CLIP Model:** Utilizes the CLIP model to associate images with textual descriptions through a contrastive learning framework. - **Open-Vocabulary ESS:** Seeks to segment events into semantic classes using raw events and text prompts as inputs. - **F2E Contrastive Distillation:** Leverages calibrated frames to generate coarse, instance-level superpixels and distills knowledge from a pre-trained image backbone to the event segmentation network. - **T2E Consistency Regularization:** Uses CLIP's text encoder to generate semantically consistent text-frame pairs, ensuring alignment between events and texts. **Experiments:** - **Datasets:** Conducted experiments on DDD17-Seg and DSEC-Semantic datasets. - **Performance:** Achieves state-of-the-art results in annotation-free and annotation-efficient ESS settings, outperforming existing methods by significant margins. - **Qualitative Assessment:** Visual comparisons show more consistent semantic information and better instance boundary predictions. **Conclusion:** OpenESS addresses the scalability and annotation efficiency challenges in ESS by leveraging cross-modality representation learning. The framework's ability to perform open-vocabulary predictions without using event labels makes it a promising solution for practical applications.
Reach us at info@study.space