Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

20 Feb 2024 | Nicolas Boizard, Kevin El Haddad, Céline Hudelot, Pierre Colombo
This paper introduces the Universal Logit Distillation (ULD) loss, a novel approach for knowledge distillation in large language models (LLMs). The ULD loss is designed to enable distillation across models with different architectures and tokenizers, overcoming the limitations of traditional logit-based distillation methods that require the student and teacher models to share the same tokenizer. The ULD loss is based on optimal transport and uses a closed-form solution of Wasserstein distance to facilitate efficient and effective distillation. The paper highlights the challenges of distilling generative models, particularly those with different architectures and tokenizers, and proposes the ULD loss as a solution. The ULD loss combines a cross-entropy loss term with a Wasserstein distance term to align the probability distributions of the teacher and student models, enabling the student to learn from the teacher's knowledge without requiring a shared tokenizer. The paper presents experimental results demonstrating the effectiveness of the ULD loss across various tasks, including extractive and generative question answering, and summarization. The results show that the ULD loss achieves better performance than traditional logit-based distillation methods, particularly in scenarios where the teacher and student models have different architectures and tokenizers. The ULD loss is also shown to be effective in distilling a decoder teacher model to an encoder-decoder student model, demonstrating its versatility across different LLM architectures. The paper also discusses the computational efficiency of the ULD loss, noting that it can be computed efficiently due to its closed-form solution of the Wasserstein distance. The ULD loss is shown to be effective in reducing the size of LLMs while maintaining performance, making it a promising approach for deploying LLMs in resource-constrained environments. The paper concludes that the ULD loss is a significant advancement in knowledge distillation, offering a flexible and effective method for distilling LLMs across a wide range of architectures and tokenizers.This paper introduces the Universal Logit Distillation (ULD) loss, a novel approach for knowledge distillation in large language models (LLMs). The ULD loss is designed to enable distillation across models with different architectures and tokenizers, overcoming the limitations of traditional logit-based distillation methods that require the student and teacher models to share the same tokenizer. The ULD loss is based on optimal transport and uses a closed-form solution of Wasserstein distance to facilitate efficient and effective distillation. The paper highlights the challenges of distilling generative models, particularly those with different architectures and tokenizers, and proposes the ULD loss as a solution. The ULD loss combines a cross-entropy loss term with a Wasserstein distance term to align the probability distributions of the teacher and student models, enabling the student to learn from the teacher's knowledge without requiring a shared tokenizer. The paper presents experimental results demonstrating the effectiveness of the ULD loss across various tasks, including extractive and generative question answering, and summarization. The results show that the ULD loss achieves better performance than traditional logit-based distillation methods, particularly in scenarios where the teacher and student models have different architectures and tokenizers. The ULD loss is also shown to be effective in distilling a decoder teacher model to an encoder-decoder student model, demonstrating its versatility across different LLM architectures. The paper also discusses the computational efficiency of the ULD loss, noting that it can be computed efficiently due to its closed-form solution of the Wasserstein distance. The ULD loss is shown to be effective in reducing the size of LLMs while maintaining performance, making it a promising approach for deploying LLMs in resource-constrained environments. The paper concludes that the ULD loss is a significant advancement in knowledge distillation, offering a flexible and effective method for distilling LLMs across a wide range of architectures and tokenizers.
Reach us at info@study.space