2024 | Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi
This paper introduces FUSELLM, a novel approach for knowledge fusion of large language models (LLMs), which aims to combine the capabilities of multiple LLMs into a single, more powerful model. Unlike traditional methods that rely on direct parameter merging or ensemble techniques, FUSELLM leverages the probabilistic distributions generated by source LLMs to externalize their knowledge and transfer it to a target LLM through lightweight continual training. By aligning the tokenizations of different LLMs and fusing their probabilistic distributions, FUSELLM effectively integrates the strengths of multiple models, leading to improved performance across various tasks such as reasoning, commonsense, and code generation.
The key idea of FUSELLM is to use the generative distributions of source LLMs to create a unified probabilistic representation that captures the collective knowledge and unique strengths of the source models. This approach minimizes the divergence between the target LLM's probabilistic distributions and those of the source LLMs, enabling the target model to benefit from the combined knowledge of multiple models. The method is validated using three popular LLMs—Llama-2, MPT, and OpenLLaMA—across various benchmarks and tasks, demonstrating that the fused model outperforms individual source models and baseline methods in most tasks.
The implementation of FUSELLM involves token alignment strategies to ensure proper mapping of probabilistic distribution matrices across different LLMs, followed by fusion functions that combine these distributions. Two fusion functions are introduced: MinCE, which selects the distribution matrix with the minimum cross-entropy score, and AvgCE, which produces a weighted average of the distribution matrices based on cross-entropy scores. The results show that FUSELLM consistently outperforms traditional ensemble and weight merging methods, particularly in tasks requiring complex reasoning and code generation.
The study also compares FUSELLM with knowledge distillation and ensemble methods, highlighting the effectiveness of FUSELLM in leveraging collective knowledge to enhance model performance. The findings suggest that FUSELLM is a promising approach for LLM fusion, especially given the diverse structures and substantial model sizes of LLMs. The results demonstrate that FUSELLM achieves significant improvements in performance across various benchmarks, making it a valuable method for creating unified models that combine the strengths of multiple LLMs.This paper introduces FUSELLM, a novel approach for knowledge fusion of large language models (LLMs), which aims to combine the capabilities of multiple LLMs into a single, more powerful model. Unlike traditional methods that rely on direct parameter merging or ensemble techniques, FUSELLM leverages the probabilistic distributions generated by source LLMs to externalize their knowledge and transfer it to a target LLM through lightweight continual training. By aligning the tokenizations of different LLMs and fusing their probabilistic distributions, FUSELLM effectively integrates the strengths of multiple models, leading to improved performance across various tasks such as reasoning, commonsense, and code generation.
The key idea of FUSELLM is to use the generative distributions of source LLMs to create a unified probabilistic representation that captures the collective knowledge and unique strengths of the source models. This approach minimizes the divergence between the target LLM's probabilistic distributions and those of the source LLMs, enabling the target model to benefit from the combined knowledge of multiple models. The method is validated using three popular LLMs—Llama-2, MPT, and OpenLLaMA—across various benchmarks and tasks, demonstrating that the fused model outperforms individual source models and baseline methods in most tasks.
The implementation of FUSELLM involves token alignment strategies to ensure proper mapping of probabilistic distribution matrices across different LLMs, followed by fusion functions that combine these distributions. Two fusion functions are introduced: MinCE, which selects the distribution matrix with the minimum cross-entropy score, and AvgCE, which produces a weighted average of the distribution matrices based on cross-entropy scores. The results show that FUSELLM consistently outperforms traditional ensemble and weight merging methods, particularly in tasks requiring complex reasoning and code generation.
The study also compares FUSELLM with knowledge distillation and ensemble methods, highlighting the effectiveness of FUSELLM in leveraging collective knowledge to enhance model performance. The findings suggest that FUSELLM is a promising approach for LLM fusion, especially given the diverse structures and substantial model sizes of LLMs. The results demonstrate that FUSELLM achieves significant improvements in performance across various benchmarks, making it a valuable method for creating unified models that combine the strengths of multiple LLMs.