Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration

30 May 2024 | Yichong Huang, Xiaocheng Feng, Bao hang Li, Yang Xiang, Hui Wang, Ting Liu, Bing Qin
This paper proposes DEEPEN, a training-free ensemble framework for heterogeneous large language models (LLMs), which fuses probability distributions from different LLMs at each decoding step. The main challenge is the vocabulary discrepancy between LLMs, which makes direct distribution averaging infeasible due to token misalignment. To address this, DEEPEN maps each model's probability distribution to a universal relative space using relative representation theory, aggregates the distributions, and then transforms the result back to the probability space of one of the LLMs to determine the next token. DEEPEN uses relative representations, which are invariant across models, to align output semantics in different absolute spaces. The framework constructs relative transformation matrices by calculating token embeddings and normalizes them using softmax. At each decoding step, the probability distributions from different models are transformed into the relative space, aggregated, and then transformed back to the main model's probability space. Experiments on six benchmarks show that DEEPEN consistently improves performance across various tasks, including subject examination, reasoning, and knowledge. It also demonstrates complementary strengths when combined with other ensemble methods like voting. DEEPEN achieves better stability than baselines and outperforms them in most settings. The framework is tested on ensembles of different numbers of models, including dense and sparse models, and LLMs with specialist models. Results show that DEEPEN performs well across a wide range of models and tasks. The method is also effective in handling vocabulary discrepancies by aligning tokens through relative representations. The paper also analyzes the impact of anchor selection, normalization of relative representations, and the number of ensemble learning steps on performance. It shows that increasing the number of anchor words and using relative ensemble learning rates can improve performance. However, too many steps can lead to performance degradation. DEEPEN is compared with other ensemble methods like LLM-BLENDER, VOTING, and MBR, and it outperforms them in most cases. The framework is also evaluated for latency, showing that it incurs additional inference time but can be optimized by skipping fusion steps in many decoding steps. Overall, DEEPEN provides a training-free, effective method for ensembling heterogeneous LLMs, demonstrating complementary strengths and stable performance across various tasks and models.This paper proposes DEEPEN, a training-free ensemble framework for heterogeneous large language models (LLMs), which fuses probability distributions from different LLMs at each decoding step. The main challenge is the vocabulary discrepancy between LLMs, which makes direct distribution averaging infeasible due to token misalignment. To address this, DEEPEN maps each model's probability distribution to a universal relative space using relative representation theory, aggregates the distributions, and then transforms the result back to the probability space of one of the LLMs to determine the next token. DEEPEN uses relative representations, which are invariant across models, to align output semantics in different absolute spaces. The framework constructs relative transformation matrices by calculating token embeddings and normalizes them using softmax. At each decoding step, the probability distributions from different models are transformed into the relative space, aggregated, and then transformed back to the main model's probability space. Experiments on six benchmarks show that DEEPEN consistently improves performance across various tasks, including subject examination, reasoning, and knowledge. It also demonstrates complementary strengths when combined with other ensemble methods like voting. DEEPEN achieves better stability than baselines and outperforms them in most settings. The framework is tested on ensembles of different numbers of models, including dense and sparse models, and LLMs with specialist models. Results show that DEEPEN performs well across a wide range of models and tasks. The method is also effective in handling vocabulary discrepancies by aligning tokens through relative representations. The paper also analyzes the impact of anchor selection, normalization of relative representations, and the number of ensemble learning steps on performance. It shows that increasing the number of anchor words and using relative ensemble learning rates can improve performance. However, too many steps can lead to performance degradation. DEEPEN is compared with other ensemble methods like LLM-BLENDER, VOTING, and MBR, and it outperforms them in most cases. The framework is also evaluated for latency, showing that it incurs additional inference time but can be optimized by skipping fusion steps in many decoding steps. Overall, DEEPEN provides a training-free, effective method for ensembling heterogeneous LLMs, demonstrating complementary strengths and stable performance across various tasks and models.
Reach us at info@study.space