5 Apr 2020 | Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, Nong Sang
BiSeNet V2 is a bilateral network designed for real-time semantic segmentation, which separately handles spatial details and categorical semantics to achieve high accuracy and efficiency. The architecture consists of a Detail Branch with wide channels and shallow layers for capturing low-level details and a Semantic Branch with narrow channels and deep layers for capturing high-level semantics. The Semantic Branch is lightweight due to reduced channel capacity and fast-downsampling. A Guided Aggregation Layer enhances mutual connections between the two branches, while a booster training strategy improves segmentation performance without extra inference cost. The proposed architecture outperforms state-of-the-art real-time methods on the Cityscapes dataset, achieving 72.6% mean IoU with 156 FPS on a NVIDIA GeForce GTX 1080 Ti card. The architecture is also effective on CamVid and COCO-Stuff datasets. The method balances accuracy and speed by treating spatial details and semantics separately, using a two-pathway structure with efficient components. The architecture is compatible with various lightweight models and can be generalized to larger models. Experimental results show that BiSeNet V2 achieves state-of-the-art performance on multiple benchmarks.BiSeNet V2 is a bilateral network designed for real-time semantic segmentation, which separately handles spatial details and categorical semantics to achieve high accuracy and efficiency. The architecture consists of a Detail Branch with wide channels and shallow layers for capturing low-level details and a Semantic Branch with narrow channels and deep layers for capturing high-level semantics. The Semantic Branch is lightweight due to reduced channel capacity and fast-downsampling. A Guided Aggregation Layer enhances mutual connections between the two branches, while a booster training strategy improves segmentation performance without extra inference cost. The proposed architecture outperforms state-of-the-art real-time methods on the Cityscapes dataset, achieving 72.6% mean IoU with 156 FPS on a NVIDIA GeForce GTX 1080 Ti card. The architecture is also effective on CamVid and COCO-Stuff datasets. The method balances accuracy and speed by treating spatial details and semantics separately, using a two-pathway structure with efficient components. The architecture is compatible with various lightweight models and can be generalized to larger models. Experimental results show that BiSeNet V2 achieves state-of-the-art performance on multiple benchmarks.