Understanding MambaVC%3A Learned Visual Compression with Selective State Spaces

MambaVC: Learned Visual Compression with Selective State Spaces **Authors:** Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shutao Xia, Yaowei Wang **Institution:** Tsinghua Shenzhen International Graduate School, Tsinghua University; Harbin Institute of Technology, Shenzhen; Huawei Technologies Company Ltd.; Shenzhen University; Peng Cheng Laboratory; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies **Abstract:** Learned visual compression is a significant and active area in multimedia processing. Existing methods, primarily based on CNNs and Transformers, have explored various designs to model content distribution and eliminate redundancy. However, balancing efficacy (rate-distortion trade-off) and efficiency remains a challenge. State-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, MambaVC is introduced, a simple, strong, and efficient compression network based on SSMs. MambaVC introduces a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, enhancing global context modeling and compression efficiency. On benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on the Kodak dataset, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC also demonstrates even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. **Contributions:** - Develop MambaVC, the first visual compression network with selective state spaces. - Extensive experiments show superior performance and competitive efficiency on image and video compression. - Highlight MambaVC's effectiveness and scalability in high-resolution compression. - Provide a comprehensive comparison of different network designs, emphasizing MambaVC's advantages. **Methods:** - **Preliminaries:** State-space models (SSMs) map input to output through a hidden state, with linear ordinary differential equations (ODEs) and discretization. - **MambaVC Architecture:** MambaVC uses a VSS block with 2DSS for spatial modeling, improving global context modeling and compression efficiency. - **2D Selective Scan (2DSS):** Expands 4 unfolding patterns for selective scanning, enhancing spatial context modeling. - **Extension to Video Compression:** MambaVC-SSF is extended to video compression, showing potential in this domain. **Experiments:** - **Image Compression:** MambaVC outperforms state-of-the-art methods and variants in rate-distortion performance, with lower computational and memory overheads. - **High-Resolution Image Compression:** MambaVC shows significant advantages in high-resolutionMambaVC: Learned Visual Compression with Selective State Spaces **Authors:** Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shutao Xia, Yaowei Wang **Institution:** Tsinghua Shenzhen International Graduate School, Tsinghua University; Harbin Institute of Technology, Shenzhen; Huawei Technologies Company Ltd.; Shenzhen University; Peng Cheng Laboratory; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies **Abstract:** Learned visual compression is a significant and active area in multimedia processing. Existing methods, primarily based on CNNs and Transformers, have explored various designs to model content distribution and eliminate redundancy. However, balancing efficacy (rate-distortion trade-off) and efficiency remains a challenge. State-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, MambaVC is introduced, a simple, strong, and efficient compression network based on SSMs. MambaVC introduces a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, enhancing global context modeling and compression efficiency. On benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on the Kodak dataset, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC also demonstrates even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. **Contributions:** - Develop MambaVC, the first visual compression network with selective state spaces. - Extensive experiments show superior performance and competitive efficiency on image and video compression. - Highlight MambaVC's effectiveness and scalability in high-resolution compression. - Provide a comprehensive comparison of different network designs, emphasizing MambaVC's advantages. **Methods:** - **Preliminaries:** State-space models (SSMs) map input to output through a hidden state, with linear ordinary differential equations (ODEs) and discretization. - **MambaVC Architecture:** MambaVC uses a VSS block with 2DSS for spatial modeling, improving global context modeling and compression efficiency. - **2D Selective Scan (2DSS):** Expands 4 unfolding patterns for selective scanning, enhancing spatial context modeling. - **Extension to Video Compression:** MambaVC-SSF is extended to video compression, showing potential in this domain. **Experiments:** - **Image Compression:** MambaVC outperforms state-of-the-art methods and variants in rate-distortion performance, with lower computational and memory overheads. - **High-Resolution Image Compression:** MambaVC shows significant advantages in high-resolution

MambaVC: Learned Visual Compression with Selective State Spaces

28 May 2024 | Shiyu Qin1, Jinpeng Wang1, Yimin Zhou2, Bin Chen2,5,6, Tianci Luo 2, Baoyi An3, Tao Dai4, Shutao Xia1, Yaowei Wang5