This paper proposes a novel method called Ensemble via Vocabulary Alignment (EVA) to enable fine-grained ensemble of large language models (LLMs) at each generation step. The main challenge addressed is the vocabulary discrepancy among different LLMs, which has limited previous studies to either selecting or blending completely generated outputs. EVA bridges the lexical gap between LLMs by aligning their vocabularies, enabling meticulous ensemble at each generation step. The method first learns mappings between the vocabularies of different LLMs using overlapping tokens. These mappings are then used to project output distributions of LLMs into a unified space, facilitating fine-grained ensemble. A filtering strategy is also designed to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of EVA compared to individual LLMs and previous ensemble methods. Further analyses confirm that EVA can leverage knowledge from different language models and yield consistent improvement. The method requires only an additional projection matrix, eliminating the need for extra fusion models or supervised training corpora. EVA is evaluated on various NLP tasks, including Commonsense Reasoning, Arithmetic Reasoning, Machine Translation, and Data-to-Text Generation. The results show that EVA significantly improves overall performance on various natural language processing tasks. The contributions include proposing a novel LLM ensemble method for fine-grained ensemble, devising an effective filtering strategy to exclude unfaithful tokens, and demonstrating the effectiveness and superiority of the method through empirical results. The method is effective in bridging the lexical gap between LLMs and facilitates fine-grained ensemble during generation. It requires only an additional projection matrix, eliminating the need for extra fusion models or supervised training corpora. The method is evaluated on various NLP tasks, including Commonsense Reasoning, Arithmetic Reasoning, Machine Translation, and Data-to-Text Generation. The results show that EVA significantly improves overall performance on various natural language processing tasks.This paper proposes a novel method called Ensemble via Vocabulary Alignment (EVA) to enable fine-grained ensemble of large language models (LLMs) at each generation step. The main challenge addressed is the vocabulary discrepancy among different LLMs, which has limited previous studies to either selecting or blending completely generated outputs. EVA bridges the lexical gap between LLMs by aligning their vocabularies, enabling meticulous ensemble at each generation step. The method first learns mappings between the vocabularies of different LLMs using overlapping tokens. These mappings are then used to project output distributions of LLMs into a unified space, facilitating fine-grained ensemble. A filtering strategy is also designed to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of EVA compared to individual LLMs and previous ensemble methods. Further analyses confirm that EVA can leverage knowledge from different language models and yield consistent improvement. The method requires only an additional projection matrix, eliminating the need for extra fusion models or supervised training corpora. EVA is evaluated on various NLP tasks, including Commonsense Reasoning, Arithmetic Reasoning, Machine Translation, and Data-to-Text Generation. The results show that EVA significantly improves overall performance on various natural language processing tasks. The contributions include proposing a novel LLM ensemble method for fine-grained ensemble, devising an effective filtering strategy to exclude unfaithful tokens, and demonstrating the effectiveness and superiority of the method through empirical results. The method is effective in bridging the lexical gap between LLMs and facilitates fine-grained ensemble during generation. It requires only an additional projection matrix, eliminating the need for extra fusion models or supervised training corpora. The method is evaluated on various NLP tasks, including Commonsense Reasoning, Arithmetic Reasoning, Machine Translation, and Data-to-Text Generation. The results show that EVA significantly improves overall performance on various natural language processing tasks.