26 Mar 2024 | Henry Kenlay * 1 Frédéric A. Dreyer * 1 Aleksandr Kovaltsuk 1 Dom Miketa 1 Douglas Pires 1 Charlotte M. Deane 1 2
This paper presents the development and evaluation of two large-scale antibody-specific language models, IgBert and IgT5, which are designed to handle both paired and unpaired variable region sequences. These models are trained on over two billion unpaired sequences and two million paired sequences from the Observed Antibody Space (OAS) dataset. The models are trained using masked language modeling (MLM) and are fine-tuned on paired sequences to learn cross-chain features. The study demonstrates that these models outperform existing antibody and protein language models on various tasks, including sequence recovery, binding affinity prediction, and expression prediction. The models achieve significantly lower perplexity and pseudo-perplexity values, indicating better performance in predicting antibody sequences. The research highlights the potential of specialized language models in advancing antibody engineering and therapeutic development, with the trained models available for public use.This paper presents the development and evaluation of two large-scale antibody-specific language models, IgBert and IgT5, which are designed to handle both paired and unpaired variable region sequences. These models are trained on over two billion unpaired sequences and two million paired sequences from the Observed Antibody Space (OAS) dataset. The models are trained using masked language modeling (MLM) and are fine-tuned on paired sequences to learn cross-chain features. The study demonstrates that these models outperform existing antibody and protein language models on various tasks, including sequence recovery, binding affinity prediction, and expression prediction. The models achieve significantly lower perplexity and pseudo-perplexity values, indicating better performance in predicting antibody sequences. The research highlights the potential of specialized language models in advancing antibody engineering and therapeutic development, with the trained models available for public use.