7 Jun 2024 | Hongyu Li, Liang Ding*, Meng Fang, Dacheng Tao
This paper investigates catastrophic forgetting (CF) in large language models (LLMs) during fine-tuning and reveals its direct link to the flatness of the model's loss landscape (LLS). The study introduces Sharpness-Aware Minimization (SAM), an optimization technique that flattens the LLS to mitigate CF. Experiments on three widely-used fine-tuning datasets demonstrate that SAM effectively reduces CF, complementing existing anti-forgetting strategies and enhancing LLM resilience to forgetting. The results show that SAM significantly improves performance across different model sizes and tasks, outperforming other anti-forgetting methods like Wise-FT and Rehearsal. The study also highlights the positive correlation between the flatness of the LLS and the severity of CF, showing that flatter landscapes reduce forgetting. SAM is shown to be orthogonal to existing methods, providing incremental benefits in mitigating CF. The findings suggest that SAM could become a standard strategy for mitigating CF during LLM fine-tuning. The paper also discusses limitations, including the focus on a specific aspect of CF and the potential applicability only to certain phases of the LLM lifecycle. Ethical considerations are addressed, and the research is reproducible with detailed experimental setups and code provided.This paper investigates catastrophic forgetting (CF) in large language models (LLMs) during fine-tuning and reveals its direct link to the flatness of the model's loss landscape (LLS). The study introduces Sharpness-Aware Minimization (SAM), an optimization technique that flattens the LLS to mitigate CF. Experiments on three widely-used fine-tuning datasets demonstrate that SAM effectively reduces CF, complementing existing anti-forgetting strategies and enhancing LLM resilience to forgetting. The results show that SAM significantly improves performance across different model sizes and tasks, outperforming other anti-forgetting methods like Wise-FT and Rehearsal. The study also highlights the positive correlation between the flatness of the LLS and the severity of CF, showing that flatter landscapes reduce forgetting. SAM is shown to be orthogonal to existing methods, providing incremental benefits in mitigating CF. The findings suggest that SAM could become a standard strategy for mitigating CF during LLM fine-tuning. The paper also discusses limitations, including the focus on a specific aspect of CF and the potential applicability only to certain phases of the LLM lifecycle. Ethical considerations are addressed, and the research is reproducible with detailed experimental setups and code provided.