[slides] Diffusion Language Models Are Versatile Protein Learners

This paper introduces the Diffusion Protein Language Model (DPLM), a versatile model designed to generate and predict protein sequences. DPLM is pre-trained on evolutionary-scale protein sequences using a discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled manner. After pre-training, DPLM can generate structurally plausible, novel, and diverse protein sequences. It also demonstrates superior representation learning capabilities, making it suitable for various predictive tasks, outperforming models like ESM2. DPLM can be fine-tuned for conditional generation, including generating scaffolds for functional motifs, incorporating other modalities (e.g., structure-conditioned generation), and steering sequence generation towards desired properties using classifier guidance. The paper evaluates DPLM on unconditional generation, predictive tasks, and conditional generation, showing its effectiveness in generating high-quality protein sequences and its ability to capture complex protein structures and properties.This paper introduces the Diffusion Protein Language Model (DPLM), a versatile model designed to generate and predict protein sequences. DPLM is pre-trained on evolutionary-scale protein sequences using a discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled manner. After pre-training, DPLM can generate structurally plausible, novel, and diverse protein sequences. It also demonstrates superior representation learning capabilities, making it suitable for various predictive tasks, outperforming models like ESM2. DPLM can be fine-tuned for conditional generation, including generating scaffolds for functional motifs, incorporating other modalities (e.g., structure-conditioned generation), and steering sequence generation towards desired properties using classifier guidance. The paper evaluates DPLM on unconditional generation, predictive tasks, and conditional generation, showing its effectiveness in generating high-quality protein sequences and its ability to capture complex protein structures and properties.

Diffusion Language Models Are Versatile Protein Learners

28 Feb 2024 | Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu