OMG-Seg: Is One Model Good Enough For All Segmentation?

OMG-Seg: Is One Model Good Enough For All Segmentation?

18 Jan 2024 | Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Yining Li, Kai Chen, Chen Change Loy
OMG-Seg is a unified segmentation model that can handle over ten different segmentation tasks, including image and video segmentation, interactive segmentation, and open-vocabulary segmentation. It is the first model to unify these four directions and achieve satisfactory performance. The model uses a transformer-based encoder-decoder architecture with task-specific queries and outputs, allowing it to support multiple segmentation tasks while significantly reducing computational and parameter overhead. The model is trained on combined image and video datasets, enabling it to handle up to ten diverse segmentation tasks across different datasets. OMG-Seg achieves comparable results on image, video, open-vocabulary, and interactive segmentation settings across eight different datasets, including COCO, ADE-20k, VIPSeg, Youtube-VIS-2019, Youtube-VIS-2021, and DAVIS-17. The model uses a shared encoder-decoder architecture, with task outputs unified as a single query representation. This allows for efficient open-vocabulary inference without additional costs. The model is trained on a frozen CLIP visual encoder, which enables it to perform open-vocabulary segmentation. The model is also capable of handling video segmentation and interactive segmentation tasks. The model's architecture is based on the Mask2Former framework, with modifications to support multiple segmentation tasks. The model is trained on a variety of datasets, including COCO, VIPSeg, and Youtube-VIS-2019, and achieves competitive performance across different segmentation tasks. The model's performance is evaluated on multiple datasets, including COCO, VIPSeg, and Youtube-VIS-2019, and it outperforms several existing models in open-vocabulary segmentation tasks. The model's architecture allows for efficient training and inference, with a shared decoder that processes all types of queries. The model is capable of handling a wide range of segmentation tasks, including image, video, open-vocabulary, and interactive segmentation. The model's performance is validated through extensive experiments on multiple datasets, demonstrating its effectiveness and versatility in handling various segmentation challenges.OMG-Seg is a unified segmentation model that can handle over ten different segmentation tasks, including image and video segmentation, interactive segmentation, and open-vocabulary segmentation. It is the first model to unify these four directions and achieve satisfactory performance. The model uses a transformer-based encoder-decoder architecture with task-specific queries and outputs, allowing it to support multiple segmentation tasks while significantly reducing computational and parameter overhead. The model is trained on combined image and video datasets, enabling it to handle up to ten diverse segmentation tasks across different datasets. OMG-Seg achieves comparable results on image, video, open-vocabulary, and interactive segmentation settings across eight different datasets, including COCO, ADE-20k, VIPSeg, Youtube-VIS-2019, Youtube-VIS-2021, and DAVIS-17. The model uses a shared encoder-decoder architecture, with task outputs unified as a single query representation. This allows for efficient open-vocabulary inference without additional costs. The model is trained on a frozen CLIP visual encoder, which enables it to perform open-vocabulary segmentation. The model is also capable of handling video segmentation and interactive segmentation tasks. The model's architecture is based on the Mask2Former framework, with modifications to support multiple segmentation tasks. The model is trained on a variety of datasets, including COCO, VIPSeg, and Youtube-VIS-2019, and achieves competitive performance across different segmentation tasks. The model's performance is evaluated on multiple datasets, including COCO, VIPSeg, and Youtube-VIS-2019, and it outperforms several existing models in open-vocabulary segmentation tasks. The model's architecture allows for efficient training and inference, with a shared decoder that processes all types of queries. The model is capable of handling a wide range of segmentation tasks, including image, video, open-vocabulary, and interactive segmentation. The model's performance is validated through extensive experiments on multiple datasets, demonstrating its effectiveness and versatility in handling various segmentation challenges.
Reach us at info@study.space
Understanding OMG-Seg%3A Is One Model Good Enough for all Segmentation%3F