July 14–18, 2024 | Junchen Fu, Xuri Ge†, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Jie Wang, Joemon M. Jose
The paper "IISAN: Efficiently Adapting Multimodal Representation for Sequential Recommendation with Decoupled PEFT" introduces IISAN (Intra- and Inter-modal Side Adapted Network), a novel architecture designed to efficiently adapt multimodal foundation models for sequential recommendation tasks. IISAN leverages decoupled parameter-efficient fine-tuning (DPEFT) to reduce the computational graph and GPU memory usage while maintaining performance comparable to full fine-tuning (FFT). The key contributions of the paper include:
1. **IISAN Architecture**: IISAN decouples the trainable parameters into intra-modal and inter-modal side-adapted networks, allowing for efficient adaptation of multimodal representations. It also introduces caching strategies to further enhance efficiency.
2. **TPME Metric**: A new composite efficiency metric, TPME (Training-time, Parameter, and GPU Memory Efficiency), is proposed to evaluate the practical efficiency of different methods by considering training time, trainable parameters, and GPU memory usage.
3. **Performance and Efficiency Analysis**: Extensive experiments on three multimodal recommendation datasets show that IISAN outperforms both FFT and state-of-the-art PEFT methods in terms of both performance and efficiency. IISAN reduces training time and GPU memory usage significantly, achieving up to 94% and 93% reduction, respectively, with the caching strategy.
4. **Robustness and Ablation Studies**: IISAN demonstrates robustness across different multimodal encoders and performs well in various ablation studies, highlighting the importance of both intra- and inter-modal adaptation.
5. **Multimodality vs. Unimodality**: Experiments comparing multimodal and unimodal scenarios show that multimodality is more advantageous, with IISAN achieving superior performance due to its ability to integrate multiple modalities effectively.
The paper addresses the limitations of existing PEFT methods, particularly in terms of GPU memory and training speed, and provides a comprehensive solution for efficient and effective adaptation of multimodal models in sequential recommendation tasks.The paper "IISAN: Efficiently Adapting Multimodal Representation for Sequential Recommendation with Decoupled PEFT" introduces IISAN (Intra- and Inter-modal Side Adapted Network), a novel architecture designed to efficiently adapt multimodal foundation models for sequential recommendation tasks. IISAN leverages decoupled parameter-efficient fine-tuning (DPEFT) to reduce the computational graph and GPU memory usage while maintaining performance comparable to full fine-tuning (FFT). The key contributions of the paper include:
1. **IISAN Architecture**: IISAN decouples the trainable parameters into intra-modal and inter-modal side-adapted networks, allowing for efficient adaptation of multimodal representations. It also introduces caching strategies to further enhance efficiency.
2. **TPME Metric**: A new composite efficiency metric, TPME (Training-time, Parameter, and GPU Memory Efficiency), is proposed to evaluate the practical efficiency of different methods by considering training time, trainable parameters, and GPU memory usage.
3. **Performance and Efficiency Analysis**: Extensive experiments on three multimodal recommendation datasets show that IISAN outperforms both FFT and state-of-the-art PEFT methods in terms of both performance and efficiency. IISAN reduces training time and GPU memory usage significantly, achieving up to 94% and 93% reduction, respectively, with the caching strategy.
4. **Robustness and Ablation Studies**: IISAN demonstrates robustness across different multimodal encoders and performs well in various ablation studies, highlighting the importance of both intra- and inter-modal adaptation.
5. **Multimodality vs. Unimodality**: Experiments comparing multimodal and unimodal scenarios show that multimodality is more advantageous, with IISAN achieving superior performance due to its ability to integrate multiple modalities effectively.
The paper addresses the limitations of existing PEFT methods, particularly in terms of GPU memory and training speed, and provides a comprehensive solution for efficient and effective adaptation of multimodal models in sequential recommendation tasks.