January 4, 2024 | Justin Barton, Jacob D. Galson, Jinwoo Leem
The paper "Enhancing Antibody Language Models with Structural Information" by Justin Barton, Jacob D. Galson, and Jinwoo Leem from Alchemab Therapeutics Ltd explores the integration of structural information into antibody language models to improve their performance in binding prediction tasks. The authors propose a multimodal contrastive learning approach called Contrastive Sequence-Structure Pre-training (CSSP), which combines the representations of antibody sequences and structures in a mutual latent space. This approach enhances the models' ability to capture structural similarity and improves accuracy and data efficiency.
The study uses a large dataset of human antibody sequences and structures, including experimental and predicted structures. The CSSP method is applied to three different protein language models: AntiBERTa2, ESM2-650M, and AntiBERTy. The results show that CSSP increases the structural information content in the models' embeddings, leading to better correlations with antibody structural similarity. Specifically, AntiBERTa2-CSSP achieves the highest Pearson correlation coefficients for Fv and CDRH3 similarity.
In benchmarking, the models are evaluated on a dataset of trastuzumab variants binding to HER2. The CSSP-trained models show improved performance, with a mean area under the receiver operating curve (AUROC) score that increases with each epoch of training up to 5 epochs. Additionally, the models exhibit better data efficiency, especially in low-data settings, as evidenced by the significant improvement in AUROC scores even with a small subset of experimental structures.
The authors also caution that predicted structures do not contribute novel information to CSSP, emphasizing the importance of using high-quality experimental structures. Overall, the study demonstrates the effectiveness of CSSP in enhancing antibody language models and its potential for practical applications in antibody engineering, particularly in resource-limited environments.The paper "Enhancing Antibody Language Models with Structural Information" by Justin Barton, Jacob D. Galson, and Jinwoo Leem from Alchemab Therapeutics Ltd explores the integration of structural information into antibody language models to improve their performance in binding prediction tasks. The authors propose a multimodal contrastive learning approach called Contrastive Sequence-Structure Pre-training (CSSP), which combines the representations of antibody sequences and structures in a mutual latent space. This approach enhances the models' ability to capture structural similarity and improves accuracy and data efficiency.
The study uses a large dataset of human antibody sequences and structures, including experimental and predicted structures. The CSSP method is applied to three different protein language models: AntiBERTa2, ESM2-650M, and AntiBERTy. The results show that CSSP increases the structural information content in the models' embeddings, leading to better correlations with antibody structural similarity. Specifically, AntiBERTa2-CSSP achieves the highest Pearson correlation coefficients for Fv and CDRH3 similarity.
In benchmarking, the models are evaluated on a dataset of trastuzumab variants binding to HER2. The CSSP-trained models show improved performance, with a mean area under the receiver operating curve (AUROC) score that increases with each epoch of training up to 5 epochs. Additionally, the models exhibit better data efficiency, especially in low-data settings, as evidenced by the significant improvement in AUROC scores even with a small subset of experimental structures.
The authors also caution that predicted structures do not contribute novel information to CSSP, emphasizing the importance of using high-quality experimental structures. Overall, the study demonstrates the effectiveness of CSSP in enhancing antibody language models and its potential for practical applications in antibody engineering, particularly in resource-limited environments.