28 May 2024 | Shiyu Qin1, Jinpeng Wang1, Yimin Zhou2, Bin Chen2,5,6, Tianci Luo 2, Baoyi An3, Tao Dai4, Shutao Xia1, Yaowei Wang5
MambaVC: Learned Visual Compression with Selective State Spaces
**Authors:** Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shutao Xia, Yaowei Wang
**Institution:** Tsinghua Shenzhen International Graduate School, Tsinghua University; Harbin Institute of Technology, Shenzhen; Huawei Technologies Company Ltd.; Shenzhen University; Peng Cheng Laboratory; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
**Abstract:**
Learned visual compression is a significant and active area in multimedia processing. Existing methods, primarily based on CNNs and Transformers, have explored various designs to model content distribution and eliminate redundancy. However, balancing efficacy (rate-distortion trade-off) and efficiency remains a challenge. State-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, MambaVC is introduced, a simple, strong, and efficient compression network based on SSMs. MambaVC introduces a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, enhancing global context modeling and compression efficiency. On benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on the Kodak dataset, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC also demonstrates even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications.
**Contributions:**
- Develop MambaVC, the first visual compression network with selective state spaces.
- Extensive experiments show superior performance and competitive efficiency on image and video compression.
- Highlight MambaVC's effectiveness and scalability in high-resolution compression.
- Provide a comprehensive comparison of different network designs, emphasizing MambaVC's advantages.
**Methods:**
- **Preliminaries:** State-space models (SSMs) map input to output through a hidden state, with linear ordinary differential equations (ODEs) and discretization.
- **MambaVC Architecture:** MambaVC uses a VSS block with 2DSS for spatial modeling, improving global context modeling and compression efficiency.
- **2D Selective Scan (2DSS):** Expands 4 unfolding patterns for selective scanning, enhancing spatial context modeling.
- **Extension to Video Compression:** MambaVC-SSF is extended to video compression, showing potential in this domain.
**Experiments:**
- **Image Compression:** MambaVC outperforms state-of-the-art methods and variants in rate-distortion performance, with lower computational and memory overheads.
- **High-Resolution Image Compression:** MambaVC shows significant advantages in high-resolutionMambaVC: Learned Visual Compression with Selective State Spaces
**Authors:** Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shutao Xia, Yaowei Wang
**Institution:** Tsinghua Shenzhen International Graduate School, Tsinghua University; Harbin Institute of Technology, Shenzhen; Huawei Technologies Company Ltd.; Shenzhen University; Peng Cheng Laboratory; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
**Abstract:**
Learned visual compression is a significant and active area in multimedia processing. Existing methods, primarily based on CNNs and Transformers, have explored various designs to model content distribution and eliminate redundancy. However, balancing efficacy (rate-distortion trade-off) and efficiency remains a challenge. State-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, MambaVC is introduced, a simple, strong, and efficient compression network based on SSMs. MambaVC introduces a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, enhancing global context modeling and compression efficiency. On benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on the Kodak dataset, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC also demonstrates even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications.
**Contributions:**
- Develop MambaVC, the first visual compression network with selective state spaces.
- Extensive experiments show superior performance and competitive efficiency on image and video compression.
- Highlight MambaVC's effectiveness and scalability in high-resolution compression.
- Provide a comprehensive comparison of different network designs, emphasizing MambaVC's advantages.
**Methods:**
- **Preliminaries:** State-space models (SSMs) map input to output through a hidden state, with linear ordinary differential equations (ODEs) and discretization.
- **MambaVC Architecture:** MambaVC uses a VSS block with 2DSS for spatial modeling, improving global context modeling and compression efficiency.
- **2D Selective Scan (2DSS):** Expands 4 unfolding patterns for selective scanning, enhancing spatial context modeling.
- **Extension to Video Compression:** MambaVC-SSF is extended to video compression, showing potential in this domain.
**Experiments:**
- **Image Compression:** MambaVC outperforms state-of-the-art methods and variants in rate-distortion performance, with lower computational and memory overheads.
- **High-Resolution Image Compression:** MambaVC shows significant advantages in high-resolution