[slides and audio] BitFit%3A Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

BitFit is a novel sparse-finetuning method that modifies only the bias terms of a pre-trained Transformer model, significantly reducing the number of parameters that need to be fine-tuned. The method is particularly effective with small to medium-sized training datasets, achieving competitive or better performance compared to fine-tuning the entire model. For larger datasets, BitFit performs similarly to other sparse fine-tuning methods. The key benefits of BitFit include: 1. **Parameter Efficiency**: Only a small subset of parameters (bias terms) are modified, making it suitable for memory-constrained environments. 2. **Task-Invariance**: The same set of parameters is fine-tuned across different tasks, simplifying deployment and hardware implementation. 3. **Localized Changes**: The modified parameters are isolated and localized, reducing the impact on the overall model. The authors demonstrate that freezing most of the network and fine-tuning only the bias terms is surprisingly effective. Specifically, fine-tuning only the "query" and "middle-of-MLP" bias terms, which account for only 0.04% of all model parameters, can achieve comparable or better performance compared to full fine-tuning. This approach not only improves efficiency but also raises questions about the role of bias terms in pre-trained networks and the dynamics of the fine-tuning process. Experiments on the GLUE benchmark show that BitFit outperforms existing methods like Diff-Pruning and Adapters, achieving better results with significantly fewer trainable parameters. The method also exhibits smaller generalization gaps and performs well on token-level tasks. Overall, BitFit is a promising approach for localized, fast fine-tuning of pre-trained transformers, offering practical utility and insights into the fine-tuning dynamics of pre-trained models.BitFit is a novel sparse-finetuning method that modifies only the bias terms of a pre-trained Transformer model, significantly reducing the number of parameters that need to be fine-tuned. The method is particularly effective with small to medium-sized training datasets, achieving competitive or better performance compared to fine-tuning the entire model. For larger datasets, BitFit performs similarly to other sparse fine-tuning methods. The key benefits of BitFit include: 1. **Parameter Efficiency**: Only a small subset of parameters (bias terms) are modified, making it suitable for memory-constrained environments. 2. **Task-Invariance**: The same set of parameters is fine-tuned across different tasks, simplifying deployment and hardware implementation. 3. **Localized Changes**: The modified parameters are isolated and localized, reducing the impact on the overall model. The authors demonstrate that freezing most of the network and fine-tuning only the bias terms is surprisingly effective. Specifically, fine-tuning only the "query" and "middle-of-MLP" bias terms, which account for only 0.04% of all model parameters, can achieve comparable or better performance compared to full fine-tuning. This approach not only improves efficiency but also raises questions about the role of bias terms in pre-trained networks and the dynamics of the fine-tuning process. Experiments on the GLUE benchmark show that BitFit outperforms existing methods like Diff-Pruning and Adapters, achieving better results with significantly fewer trainable parameters. The method also exhibits smaller generalization gaps and performs well on token-level tasks. Overall, BitFit is a promising approach for localized, fast fine-tuning of pre-trained transformers, offering practical utility and insights into the fine-tuning dynamics of pre-trained models.

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

May 22-27, 2022 | Elad Ben-Zaken, Shauli Ravfogel, Yoav Goldberg