15 Feb 2024 | Leonidas Gee, Andrea Zugarni, Leonardo Rigutini, Paolo Torroni
This paper presents a new method for language model (LM) compression called Vocabulary Transfer (VT), which reduces model size and inference time while maintaining performance. The method involves training a tokenizer on a specific domain to create a smaller vocabulary, which is then used to initialize the embeddings of the LM. This approach is combined with knowledge distillation (KD) to further compress the model.
The study evaluates VT on three different domains: medical (ADE), legal (LEDGAR), and news (CoNLL03). Results show that VT reduces the average sequence length, which in turn reduces computational costs and improves performance. When combined with KD, the model size is reduced by up to 2.76 times, with only a minor performance drop.
VT is implemented through a method called Fast Vocabulary Transfer (FVT), which initializes new tokens by averaging the embeddings of the corresponding tokens from the general-purpose tokenizer. This approach outperforms the baseline method, Partial Vocabulary Transfer (PVT), in terms of performance. The results also show that reducing the vocabulary size does not always lead to a decrease in performance, and in some cases, it can even improve it.
VT is shown to be complementary to KD, and the combination of the two methods leads to significant model compression and speedup. The study also demonstrates that VT is orthogonal to other model compression techniques, making it a versatile method for LM compression.
In conclusion, VT enables a strategic trade-off between model compression, inference speed, and accuracy, particularly in specialized domains. The method is effective in reducing model size and inference time while maintaining performance, making it a valuable tool for industrial NLP applications.This paper presents a new method for language model (LM) compression called Vocabulary Transfer (VT), which reduces model size and inference time while maintaining performance. The method involves training a tokenizer on a specific domain to create a smaller vocabulary, which is then used to initialize the embeddings of the LM. This approach is combined with knowledge distillation (KD) to further compress the model.
The study evaluates VT on three different domains: medical (ADE), legal (LEDGAR), and news (CoNLL03). Results show that VT reduces the average sequence length, which in turn reduces computational costs and improves performance. When combined with KD, the model size is reduced by up to 2.76 times, with only a minor performance drop.
VT is implemented through a method called Fast Vocabulary Transfer (FVT), which initializes new tokens by averaging the embeddings of the corresponding tokens from the general-purpose tokenizer. This approach outperforms the baseline method, Partial Vocabulary Transfer (PVT), in terms of performance. The results also show that reducing the vocabulary size does not always lead to a decrease in performance, and in some cases, it can even improve it.
VT is shown to be complementary to KD, and the combination of the two methods leads to significant model compression and speedup. The study also demonstrates that VT is orthogonal to other model compression techniques, making it a versatile method for LM compression.
In conclusion, VT enables a strategic trade-off between model compression, inference speed, and accuracy, particularly in specialized domains. The method is effective in reducing model size and inference time while maintaining performance, making it a valuable tool for industrial NLP applications.