PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning

PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning

24 Jun 2024 | Gyeongman Kim, Doohyuk Jang, Eunho Yang
**PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning** **Authors:** Gyeongman Kim, Doohyuk Jang, Eunho Yang **Institution:** Korea Advanced Institute of Science and Technology (KAIST), AITRICS **Abstract:** Recent advancements in large language models (LLMs) have raised concerns about inference costs, leading to a growing need for model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models is limited. PromptKD, a novel method, leverages prompt tuning to enable generative language models to transfer student-friendly knowledge. Unlike previous KD methods that require fine-tuning the entire teacher model, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Analysis suggests that distilling student-friendly knowledge effectively alleviates exposure bias throughout the training process, leading to performance enhancements. **Key Contributions:** 1. **Investigation of Student-Friendly Knowledge:** PromptKD explores the effect of student-friendly knowledge in KD for generation tasks. 2. **First Use of Prompt Tuning in KD:** It is the first to use prompt tuning in KD, enabling memory-efficient extraction of student-friendly knowledge. 3. **State-of-the-Art Performance:** PromptKD achieves state-of-the-art performance on instruction-following datasets. 4. **Exposure Bias Mitigation:** It demonstrates superior performance in mitigating exposure bias during training. **Related Work:** - **KD for Text Classification:** Most KD research focuses on text classification tasks, with methods evolving from simple approaches to more complex ones. - **KD for Text Generation:** Methods like Supervised KD and SeqKD aim to minimize distribution discrepancies, but they are not designed for generative models. - **Prompt Tuning:** Prompt tuning has become a prominent parameter-efficient fine-tuning technique, but it has not been used in KD for generative models. **PromptKD Method:** - **Instruction-Following Setting:** PromptKD formulates instruction-following as a conditional text generation task. - **Pseudo-Target Generation:** Responses generated by the student are used as pseudo-targets to address exposure bias. - **Prompt Tuning for Adaptive Teaching:** The prompt is updated to minimize the KD loss, encouraging the teacher to generate sentences similar to the student. - **Student-Friendly Knowledge Distillation:** The updated prompt distills student-friendly knowledge to the student, minimizing distribution discrepancies. **Experiments:** - **Dataset and Models:** PromptKD is evaluated on 5 instruction-following datasets using various models, including GPT-2, OPT, and Llama. - **Baselines:** Compared with supervised fine-tuning and other KD methods, Prompt**PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning** **Authors:** Gyeongman Kim, Doohyuk Jang, Eunho Yang **Institution:** Korea Advanced Institute of Science and Technology (KAIST), AITRICS **Abstract:** Recent advancements in large language models (LLMs) have raised concerns about inference costs, leading to a growing need for model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models is limited. PromptKD, a novel method, leverages prompt tuning to enable generative language models to transfer student-friendly knowledge. Unlike previous KD methods that require fine-tuning the entire teacher model, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Analysis suggests that distilling student-friendly knowledge effectively alleviates exposure bias throughout the training process, leading to performance enhancements. **Key Contributions:** 1. **Investigation of Student-Friendly Knowledge:** PromptKD explores the effect of student-friendly knowledge in KD for generation tasks. 2. **First Use of Prompt Tuning in KD:** It is the first to use prompt tuning in KD, enabling memory-efficient extraction of student-friendly knowledge. 3. **State-of-the-Art Performance:** PromptKD achieves state-of-the-art performance on instruction-following datasets. 4. **Exposure Bias Mitigation:** It demonstrates superior performance in mitigating exposure bias during training. **Related Work:** - **KD for Text Classification:** Most KD research focuses on text classification tasks, with methods evolving from simple approaches to more complex ones. - **KD for Text Generation:** Methods like Supervised KD and SeqKD aim to minimize distribution discrepancies, but they are not designed for generative models. - **Prompt Tuning:** Prompt tuning has become a prominent parameter-efficient fine-tuning technique, but it has not been used in KD for generative models. **PromptKD Method:** - **Instruction-Following Setting:** PromptKD formulates instruction-following as a conditional text generation task. - **Pseudo-Target Generation:** Responses generated by the student are used as pseudo-targets to address exposure bias. - **Prompt Tuning for Adaptive Teaching:** The prompt is updated to minimize the KD loss, encouraging the teacher to generate sentences similar to the student. - **Student-Friendly Knowledge Distillation:** The updated prompt distills student-friendly knowledge to the student, minimizing distribution discrepancies. **Experiments:** - **Dataset and Models:** PromptKD is evaluated on 5 instruction-following datasets using various models, including GPT-2, OPT, and Llama. - **Baselines:** Compared with supervised fine-tuning and other KD methods, Prompt
Reach us at info@study.space