6 May 2024 | Niccolò Cavagnero*, Gabriele Rosi*, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, Fabio Cermelli
PEM: Prototype-based Efficient MaskFormer for Image Segmentation
PEM is a novel transformer-based architecture designed for efficient image segmentation. It introduces a prototype-based cross-attention mechanism that reduces computational complexity while maintaining performance. The architecture also features an efficient multi-scale feature pyramid network, which enhances feature extraction through deformable convolutions and context-based self-modulation. PEM is evaluated on semantic and panoptic segmentation tasks on the Cityscapes and ADE20K datasets, demonstrating superior performance compared to task-specific architectures and being competitive with computationally expensive baselines. PEM achieves high efficiency, with a significant speed advantage over existing methods, making it suitable for deployment on edge devices. The architecture's efficiency is achieved through a prototype selection mechanism that reduces the number of input tokens in attention layers and an efficient multi-scale pixel decoder that leverages context-based self-modulation and deformable convolutions. PEM's performance is validated through extensive experiments, showing its effectiveness in both semantic and panoptic segmentation tasks. The results demonstrate that PEM provides a favorable trade-off between performance and speed, making it a promising solution for efficient image segmentation.PEM: Prototype-based Efficient MaskFormer for Image Segmentation
PEM is a novel transformer-based architecture designed for efficient image segmentation. It introduces a prototype-based cross-attention mechanism that reduces computational complexity while maintaining performance. The architecture also features an efficient multi-scale feature pyramid network, which enhances feature extraction through deformable convolutions and context-based self-modulation. PEM is evaluated on semantic and panoptic segmentation tasks on the Cityscapes and ADE20K datasets, demonstrating superior performance compared to task-specific architectures and being competitive with computationally expensive baselines. PEM achieves high efficiency, with a significant speed advantage over existing methods, making it suitable for deployment on edge devices. The architecture's efficiency is achieved through a prototype selection mechanism that reduces the number of input tokens in attention layers and an efficient multi-scale pixel decoder that leverages context-based self-modulation and deformable convolutions. PEM's performance is validated through extensive experiments, showing its effectiveness in both semantic and panoptic segmentation tasks. The results demonstrate that PEM provides a favorable trade-off between performance and speed, making it a promising solution for efficient image segmentation.