30 May 2024 | Guillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, Alexander Tong, Avishek Joey Bose
FOLDFLOW-2 is a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. It introduces new architectural features, including a protein large language model for sequence encoding, a multi-modal fusion trunk combining structure and sequence representations, and a geometric transformer-based decoder. FOLDFLOW-2 is trained on a large dataset of synthetic structures, significantly larger than prior datasets, to enhance diversity and novelty in generated samples. It also incorporates Reinforced Fine-Tuning (ReFT) to align with arbitrary rewards, improving secondary structure diversity. FOLDFLOW-2 outperforms previous state-of-the-art models in unconditional generation, designability, diversity, and novelty across all protein lengths. It also demonstrates generalization in equilibrium conformation sampling. The model is capable of handling conditional design tasks such as motif scaffolding and designing scaffolds for VHH nanobodies. FOLDFLOW-2 is SE(3)-invariant and handles multi-modal data by design. It is trained on a new dataset of high-quality synthetic structures, including those filtered from SwissProt. The model's ability to mask sequences enables a wide range of conditional generation tasks. FOLDFLOW-2 is evaluated on various protein design tasks, including unconditional generation, folding, motif scaffolding, and equilibrium conformation sampling. It outperforms existing models in all metrics, including designability, novelty, and diversity. The model is also efficient in terms of computational resources, requiring fewer GPU hours and parameters compared to other models. FOLDFLOW-2 is a competitive model for protein structure generation and has the potential to be a practical base model for future work on capturing protein dynamics.FOLDFLOW-2 is a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. It introduces new architectural features, including a protein large language model for sequence encoding, a multi-modal fusion trunk combining structure and sequence representations, and a geometric transformer-based decoder. FOLDFLOW-2 is trained on a large dataset of synthetic structures, significantly larger than prior datasets, to enhance diversity and novelty in generated samples. It also incorporates Reinforced Fine-Tuning (ReFT) to align with arbitrary rewards, improving secondary structure diversity. FOLDFLOW-2 outperforms previous state-of-the-art models in unconditional generation, designability, diversity, and novelty across all protein lengths. It also demonstrates generalization in equilibrium conformation sampling. The model is capable of handling conditional design tasks such as motif scaffolding and designing scaffolds for VHH nanobodies. FOLDFLOW-2 is SE(3)-invariant and handles multi-modal data by design. It is trained on a new dataset of high-quality synthetic structures, including those filtered from SwissProt. The model's ability to mask sequences enables a wide range of conditional generation tasks. FOLDFLOW-2 is evaluated on various protein design tasks, including unconditional generation, folding, motif scaffolding, and equilibrium conformation sampling. It outperforms existing models in all metrics, including designability, novelty, and diversity. The model is also efficient in terms of computational resources, requiring fewer GPU hours and parameters compared to other models. FOLDFLOW-2 is a competitive model for protein structure generation and has the potential to be a practical base model for future work on capturing protein dynamics.