26 Mar 2024 | Henry Kenlay, Frédéric A. Dreyer, Aleksandr Kovaltsuk, Dom Miketa, Douglas Pires, Charlotte M. Deane
This paper introduces IgBert and IgT5, the best performing antibody-specific language models developed to date, which can consistently handle both paired and unpaired variable region sequences as input. These models are trained on the Observed Antibody Space dataset, which contains over two billion unpaired sequences and two million paired sequences of light and heavy chains. The models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. The paper also presents the training strategy, data preparation, and evaluation of the models on various downstream tasks, including sequence recovery, binding affinity prediction, and perplexity measurement. The results show that the paired models significantly outperform the unpaired models in terms of performance on binding affinity prediction tasks. The models are made publicly available and can be used in a range of tasks relevant to antibody engineering. The study highlights the potential of large-scale paired antibody language models in advancing antibody-related research and development.This paper introduces IgBert and IgT5, the best performing antibody-specific language models developed to date, which can consistently handle both paired and unpaired variable region sequences as input. These models are trained on the Observed Antibody Space dataset, which contains over two billion unpaired sequences and two million paired sequences of light and heavy chains. The models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. The paper also presents the training strategy, data preparation, and evaluation of the models on various downstream tasks, including sequence recovery, binding affinity prediction, and perplexity measurement. The results show that the paired models significantly outperform the unpaired models in terms of performance on binding affinity prediction tasks. The models are made publicly available and can be used in a range of tasks relevant to antibody engineering. The study highlights the potential of large-scale paired antibody language models in advancing antibody-related research and development.