WHEN SCALING MEETS LLM FINETUNING: THE EFFECT OF DATA, MODEL AND FINETUNING METHOD

WHEN SCALING MEETS LLM FINETUNING: THE EFFECT OF DATA, MODEL AND FINETUNING METHOD

27 Feb 2024 | Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat
This paper investigates the scaling effects of different factors on the performance of large language models (LLMs) during fine-tuning. The authors conduct systematic experiments to study how various scaling factors, including LLM model size, pretraining data size, new fine-tuning parameter size, and fine-tuning data size, impact fine-tuning performance. They consider two types of fine-tuning methods: full-model tuning (FMT) and parameter-efficient tuning (PET), which includes prompt tuning and low-rank adaptation (LoRA). The experiments are conducted on bilingual LLMs with model sizes ranging from 1B to 16B, and on machine translation and multilingual summarization benchmarks. The main findings are: 1. **Power-based Multiplicative Joint Scaling Law**: LLM fine-tuning follows a power-based multiplicative joint scaling law between fine-tuning data size and other scaling factors. 2. **Model Scaling Benefits More**: LLM model scaling benefits fine-tuning more than pretraining data scaling. 3. **Parameter-Efficient Tuning Ineffectiveness**: Increasing PET parameters generally does not scale well for LoRA and Prompt, although LoRA shows better training stability. 4. **Task- and Data-Dependent Scaling**: The optimal fine-tuning method is highly dependent on the task and fine-tuning data, making the selection of the best method non-trivial. The study provides insights into understanding, selecting, and developing LLM fine-tuning methods, highlighting the importance of considering the specific task and available data when choosing a fine-tuning approach.This paper investigates the scaling effects of different factors on the performance of large language models (LLMs) during fine-tuning. The authors conduct systematic experiments to study how various scaling factors, including LLM model size, pretraining data size, new fine-tuning parameter size, and fine-tuning data size, impact fine-tuning performance. They consider two types of fine-tuning methods: full-model tuning (FMT) and parameter-efficient tuning (PET), which includes prompt tuning and low-rank adaptation (LoRA). The experiments are conducted on bilingual LLMs with model sizes ranging from 1B to 16B, and on machine translation and multilingual summarization benchmarks. The main findings are: 1. **Power-based Multiplicative Joint Scaling Law**: LLM fine-tuning follows a power-based multiplicative joint scaling law between fine-tuning data size and other scaling factors. 2. **Model Scaling Benefits More**: LLM model scaling benefits fine-tuning more than pretraining data scaling. 3. **Parameter-Efficient Tuning Ineffectiveness**: Increasing PET parameters generally does not scale well for LoRA and Prompt, although LoRA shows better training stability. 4. **Task- and Data-Dependent Scaling**: The optimal fine-tuning method is highly dependent on the task and fine-tuning data, making the selection of the best method non-trivial. The study provides insights into understanding, selecting, and developing LLM fine-tuning methods, highlighting the importance of considering the specific task and available data when choosing a fine-tuning approach.
Reach us at info@study.space