8 May 2024 | Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cottereau, Wei Tsang Ooi
**Event-based Semantic Segmentation (ESS)** is a challenging task for event camera sensing, characterized by the difficulty in interpreting and annotating event data. This paper introduces OpenESS, a novel framework that synergizes information from image, text, and event data domains to enable scalable ESS in an open-world, annotation-efficient manner. By transferring semantically rich CLIP knowledge from image-text pairs to event streams, OpenESS addresses the limitations of traditional ESS methods, which often rely on expensive annotations and closed-set learning.
**Key Contributions:**
1. **OpenESS Framework:** A versatile event-based semantic segmentation framework capable of generating open-world dense event predictions given arbitrary text queries.
2. **First Attempt:** The first attempt at distilling large vision-language models to assist event-based semantic scene understanding tasks.
3. **Cross-Modality Regularization:** Proposes a frame-to-event (F2E) contrastive distillation and a text-to-event (T2E) consistency regularization to encourage effective cross-modality knowledge transfer.
**Methodology:**
- **CLIP Model:** Utilizes the CLIP model to associate images with textual descriptions through a contrastive learning framework.
- **Open-Vocabulary ESS:** Seeks to segment events into semantic classes using raw events and text prompts as inputs.
- **F2E Contrastive Distillation:** Leverages calibrated frames to generate coarse, instance-level superpixels and distills knowledge from a pre-trained image backbone to the event segmentation network.
- **T2E Consistency Regularization:** Uses CLIP's text encoder to generate semantically consistent text-frame pairs, ensuring alignment between events and texts.
**Experiments:**
- **Datasets:** Conducted experiments on DDD17-Seg and DSEC-Semantic datasets.
- **Performance:** Achieves state-of-the-art results in annotation-free and annotation-efficient ESS settings, outperforming existing methods by significant margins.
- **Qualitative Assessment:** Visual comparisons show more consistent semantic information and better instance boundary predictions.
**Conclusion:**
OpenESS addresses the scalability and annotation efficiency challenges in ESS by leveraging cross-modality representation learning. The framework's ability to perform open-vocabulary predictions without using event labels makes it a promising solution for practical applications.**Event-based Semantic Segmentation (ESS)** is a challenging task for event camera sensing, characterized by the difficulty in interpreting and annotating event data. This paper introduces OpenESS, a novel framework that synergizes information from image, text, and event data domains to enable scalable ESS in an open-world, annotation-efficient manner. By transferring semantically rich CLIP knowledge from image-text pairs to event streams, OpenESS addresses the limitations of traditional ESS methods, which often rely on expensive annotations and closed-set learning.
**Key Contributions:**
1. **OpenESS Framework:** A versatile event-based semantic segmentation framework capable of generating open-world dense event predictions given arbitrary text queries.
2. **First Attempt:** The first attempt at distilling large vision-language models to assist event-based semantic scene understanding tasks.
3. **Cross-Modality Regularization:** Proposes a frame-to-event (F2E) contrastive distillation and a text-to-event (T2E) consistency regularization to encourage effective cross-modality knowledge transfer.
**Methodology:**
- **CLIP Model:** Utilizes the CLIP model to associate images with textual descriptions through a contrastive learning framework.
- **Open-Vocabulary ESS:** Seeks to segment events into semantic classes using raw events and text prompts as inputs.
- **F2E Contrastive Distillation:** Leverages calibrated frames to generate coarse, instance-level superpixels and distills knowledge from a pre-trained image backbone to the event segmentation network.
- **T2E Consistency Regularization:** Uses CLIP's text encoder to generate semantically consistent text-frame pairs, ensuring alignment between events and texts.
**Experiments:**
- **Datasets:** Conducted experiments on DDD17-Seg and DSEC-Semantic datasets.
- **Performance:** Achieves state-of-the-art results in annotation-free and annotation-efficient ESS settings, outperforming existing methods by significant margins.
- **Qualitative Assessment:** Visual comparisons show more consistent semantic information and better instance boundary predictions.
**Conclusion:**
OpenESS addresses the scalability and annotation efficiency challenges in ESS by leveraging cross-modality representation learning. The framework's ability to perform open-vocabulary predictions without using event labels makes it a promising solution for practical applications.