Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

20 Jun 2024 | Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay
Model merging and safety alignment: One bad model spoils the bunch This paper investigates the impact of model merging on safety alignment in large language models (LLMs). While merging multiple expert LLMs can combine their domain expertise into a single model, existing techniques often fail to preserve safety alignment, leading to misaligned merged models. The authors propose a safety-aware merging approach that addresses this issue by incorporating safety alignment data into the merging process. The authors demonstrate that merging models with existing techniques can result in the loss of safety alignment, which is critical for the safe deployment of LLMs. To address this, they propose a two-step approach: (1) generating synthetic safety and domain-specific data, and (2) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows them to treat alignment as a skill that can be maximized in the resulting merged LLM. The authors evaluate their approach on several benchmarks, showing that their safety-aware merging method significantly improves alignment without compromising domain accuracy. They also show that their approach is robust across different conditions and that it outperforms existing methods in terms of both alignment and domain performance. The authors highlight the importance of safety alignment in model merging and propose a simple yet effective approach to combine expert models while preserving alignment. Their approach treats safety alignment as a task in its own right, similar to domain-specific expertise, and optimizes for it during merging. They generate synthetic data to use for merging, including safety data and domain-specific data, and use these data to perform a data-driven merging optimization procedure that preserves both alignment and expertise. The authors also discuss the limitations of their approach, including the assumption that at least one model in the merging pool is sufficiently aligned and the restrictions on model architectures and prompt templates. Despite these limitations, they believe their work opens a new research direction in the intersection of model merging and safety alignment.Model merging and safety alignment: One bad model spoils the bunch This paper investigates the impact of model merging on safety alignment in large language models (LLMs). While merging multiple expert LLMs can combine their domain expertise into a single model, existing techniques often fail to preserve safety alignment, leading to misaligned merged models. The authors propose a safety-aware merging approach that addresses this issue by incorporating safety alignment data into the merging process. The authors demonstrate that merging models with existing techniques can result in the loss of safety alignment, which is critical for the safe deployment of LLMs. To address this, they propose a two-step approach: (1) generating synthetic safety and domain-specific data, and (2) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows them to treat alignment as a skill that can be maximized in the resulting merged LLM. The authors evaluate their approach on several benchmarks, showing that their safety-aware merging method significantly improves alignment without compromising domain accuracy. They also show that their approach is robust across different conditions and that it outperforms existing methods in terms of both alignment and domain performance. The authors highlight the importance of safety alignment in model merging and propose a simple yet effective approach to combine expert models while preserving alignment. Their approach treats safety alignment as a task in its own right, similar to domain-specific expertise, and optimizes for it during merging. They generate synthetic data to use for merging, including safety data and domain-specific data, and use these data to perform a data-driven merging optimization procedure that preserves both alignment and expertise. The authors also discuss the limitations of their approach, including the assumption that at least one model in the merging pool is sufficiently aligned and the restrictions on model architectures and prompt templates. Despite these limitations, they believe their work opens a new research direction in the intersection of model merging and safety alignment.
Reach us at info@study.space