OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

8 May 2024 | Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cottreau, Wei Tsang Ooi
OpenESS is an open-vocabulary event-based semantic segmentation framework that enables zero-shot semantic segmentation of event data streams with open vocabularies. Given raw events and text prompts as inputs, OpenESS outputs semantically coherent open-world predictions across various adjective, fine-grained, and coarse categories. The framework leverages CLIP knowledge from image-text pairs to event streams, and introduces frame-to-event contrastive distillation and text-to-event semantic consistency regularization to encourage effective cross-modality knowledge transfer. Experimental results on popular ESS benchmarks showed that OpenESS outperforms existing methods, achieving 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels. OpenESS is designed to be annotation-efficient and annotation-free, and can generate open-vocabulary predictions beyond the closed sets of predictions of existing methods. The framework is capable of handling sparse, asynchronous, and high-temporal-resolution event streams, and can be extended to more open-ended texts such as "adjectives", "fine-grained", and "coarse-grained" descriptions. The framework is evaluated on two popular ESS datasets, DDD17-Seg and DSEC-Semantic, and shows significant performance improvements over existing methods. OpenESS is also compared to state-of-the-art methods and shows competitive results. The framework is evaluated under different settings, including annotation-free, annotation-efficient, and fully-supervised learning. The results show that OpenESS achieves significant performance improvements under linear probing and few-shot fine-tuning settings. The framework is also evaluated under cross-dataset knowledge transfer and shows appealing improvements over the random initialization baseline. The framework is also evaluated under single-modality OpenESS representation learning and shows promising results. The framework is also evaluated under the scenario where the frame camera becomes unavailable, and shows subpar performance compared to the frame-based knowledge transfer. The framework is designed to be efficient and scalable, and is capable of handling a wide range of event-based semantic segmentation tasks.OpenESS is an open-vocabulary event-based semantic segmentation framework that enables zero-shot semantic segmentation of event data streams with open vocabularies. Given raw events and text prompts as inputs, OpenESS outputs semantically coherent open-world predictions across various adjective, fine-grained, and coarse categories. The framework leverages CLIP knowledge from image-text pairs to event streams, and introduces frame-to-event contrastive distillation and text-to-event semantic consistency regularization to encourage effective cross-modality knowledge transfer. Experimental results on popular ESS benchmarks showed that OpenESS outperforms existing methods, achieving 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels. OpenESS is designed to be annotation-efficient and annotation-free, and can generate open-vocabulary predictions beyond the closed sets of predictions of existing methods. The framework is capable of handling sparse, asynchronous, and high-temporal-resolution event streams, and can be extended to more open-ended texts such as "adjectives", "fine-grained", and "coarse-grained" descriptions. The framework is evaluated on two popular ESS datasets, DDD17-Seg and DSEC-Semantic, and shows significant performance improvements over existing methods. OpenESS is also compared to state-of-the-art methods and shows competitive results. The framework is evaluated under different settings, including annotation-free, annotation-efficient, and fully-supervised learning. The results show that OpenESS achieves significant performance improvements under linear probing and few-shot fine-tuning settings. The framework is also evaluated under cross-dataset knowledge transfer and shows appealing improvements over the random initialization baseline. The framework is also evaluated under single-modality OpenESS representation learning and shows promising results. The framework is also evaluated under the scenario where the frame camera becomes unavailable, and shows subpar performance compared to the frame-based knowledge transfer. The framework is designed to be efficient and scalable, and is capable of handling a wide range of event-based semantic segmentation tasks.
Reach us at info@study.space
[slides] OpenESS%3A Event-Based Semantic Scene Understanding with Open Vocabularies | StudySpace