[slides] Towards Cross-Tokenizer Distillation%3A the Universal Logit Distillation Loss for LLMs

The paper introduces the Universal Logit Distillation (ULD) loss, a novel method for knowledge distillation (KD) that addresses the limitations of existing methods, particularly those that require shared tokenizers between teacher and student models. ULD loss is grounded in optimal transport theory and is designed to work across different architectures and tokenizers, making it more versatile and applicable to a wider range of large language models (LLMs). The authors demonstrate the effectiveness of ULD loss through extensive experiments on various tasks, including extractive question answering, generative question answering, and summarization. The results show that ULD loss consistently outperforms other methods, achieving better performance with only half the training dataset or student model size. The paper also includes ablation studies and comparisons with other distillation techniques, highlighting the benefits of ULD loss in stabilizing the training process and preventing overfitting. The authors conclude by discussing the limitations of their work, such as the need for further exploration in non-English languages and the potential for bias transfer from the teacher model to the student model.The paper introduces the Universal Logit Distillation (ULD) loss, a novel method for knowledge distillation (KD) that addresses the limitations of existing methods, particularly those that require shared tokenizers between teacher and student models. ULD loss is grounded in optimal transport theory and is designed to work across different architectures and tokenizers, making it more versatile and applicable to a wider range of large language models (LLMs). The authors demonstrate the effectiveness of ULD loss through extensive experiments on various tasks, including extractive question answering, generative question answering, and summarization. The results show that ULD loss consistently outperforms other methods, achieving better performance with only half the training dataset or student model size. The paper also includes ablation studies and comparisons with other distillation techniques, highlighting the benefits of ULD loss in stabilizing the training process and preventing overfitting. The authors conclude by discussing the limitations of their work, such as the need for further exploration in non-English languages and the potential for bias transfer from the teacher model to the student model.

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

20 Feb 2024 | Nicolas Boizard, Kevin El Haddad, Céline Hudelot, Pierre Colombo