Protein function prediction as approximate semantic entailment

Protein function prediction as approximate semantic entailment

14 February 2024 | Maxat Kulmanov, Francisco J. Guzmán-Vega, Paula Duek Roggli, Lydie Lane, Stefan T. Arold & Robert Hoehndorf
DeepGO-SE is a novel method for predicting protein functions using a pretrained large language model combined with a neuro-symbolic model that performs function prediction as approximate semantic entailment. The method generates multiple approximate models of the Gene Ontology (GO), and a neural network predicts the truth values of statements about protein functions in these models. By aggregating truth values across multiple models, DeepGO-SE approximates semantic entailment when predicting protein functions. The approach effectively exploits background knowledge in the GO and improves protein function prediction compared to state-of-the-art methods. Protein function prediction is a key challenge in modern biology and bioinformatics, as it enables better understanding of protein roles and interactions. Accurate functional descriptions are essential for drug target identification, disease mechanism understanding, and biotechnological applications. While predicting protein structures has improved, predicting protein functions remains challenging due to the limited number of known functions and their complexity. GO is one of the most successful ontologies in biology, describing molecular functions, biological processes, and cellular components. Researchers identify protein functions through experiments and generate scientific reports, which are added to knowledge bases. These annotations are generally propagated to homologue proteins, resulting in extensive GO annotations in databases like UniProtKB/Swiss-Prot. Recent methods for predicting protein functions use various sources of information, including sequence, interactions, structure, literature, coexpression, phylogenetic analysis, and GO. These methods may use sequence domain annotations, deep convolutional neural networks, language models, or pretrained protein language models. Models may also incorporate protein-protein interactions through knowledge graph embeddings, k nearest neighbours, or graph convolutional neural networks. Natural language models applied to scientific literature have been successful in automated function prediction. However, many function prediction methods rely on sequence similarity, which can be less reliable for proteins with little or no sequence similarity to known functional domains. Molecular functions arise from structure, and proteins with similar structures may have different sequences. Proteins with similar sequences can have different functions depending on their active sites and the organisms in which they are found. Therefore, methods that use the same sources of information for all three subontologies of GO are limited. DeepGO-SE uses a pretrained protein language model (ESM2) to generate protein representations and combines them with a neuro-symbolic model that performs function prediction as approximate semantic entailment. The model uses ELEmbeddings generated from GO axioms to encode ontology axioms based on geometric shapes and relations. The model performs semantic entailment by testing the truth of statements in multiple generated world models. DeepGO-SE was evaluated on the UniProtKB/Swiss-Prot dataset and outperformed baseline methods in terms of Fmax, AUPR, and AUC. The model also improved predictions for complex biological processes and cellular components by incorporating information about an organism's proteome and interactome in the form of protein-protein interaction networks. The model was further evaluated on the neXtProtDeepGO-SE is a novel method for predicting protein functions using a pretrained large language model combined with a neuro-symbolic model that performs function prediction as approximate semantic entailment. The method generates multiple approximate models of the Gene Ontology (GO), and a neural network predicts the truth values of statements about protein functions in these models. By aggregating truth values across multiple models, DeepGO-SE approximates semantic entailment when predicting protein functions. The approach effectively exploits background knowledge in the GO and improves protein function prediction compared to state-of-the-art methods. Protein function prediction is a key challenge in modern biology and bioinformatics, as it enables better understanding of protein roles and interactions. Accurate functional descriptions are essential for drug target identification, disease mechanism understanding, and biotechnological applications. While predicting protein structures has improved, predicting protein functions remains challenging due to the limited number of known functions and their complexity. GO is one of the most successful ontologies in biology, describing molecular functions, biological processes, and cellular components. Researchers identify protein functions through experiments and generate scientific reports, which are added to knowledge bases. These annotations are generally propagated to homologue proteins, resulting in extensive GO annotations in databases like UniProtKB/Swiss-Prot. Recent methods for predicting protein functions use various sources of information, including sequence, interactions, structure, literature, coexpression, phylogenetic analysis, and GO. These methods may use sequence domain annotations, deep convolutional neural networks, language models, or pretrained protein language models. Models may also incorporate protein-protein interactions through knowledge graph embeddings, k nearest neighbours, or graph convolutional neural networks. Natural language models applied to scientific literature have been successful in automated function prediction. However, many function prediction methods rely on sequence similarity, which can be less reliable for proteins with little or no sequence similarity to known functional domains. Molecular functions arise from structure, and proteins with similar structures may have different sequences. Proteins with similar sequences can have different functions depending on their active sites and the organisms in which they are found. Therefore, methods that use the same sources of information for all three subontologies of GO are limited. DeepGO-SE uses a pretrained protein language model (ESM2) to generate protein representations and combines them with a neuro-symbolic model that performs function prediction as approximate semantic entailment. The model uses ELEmbeddings generated from GO axioms to encode ontology axioms based on geometric shapes and relations. The model performs semantic entailment by testing the truth of statements in multiple generated world models. DeepGO-SE was evaluated on the UniProtKB/Swiss-Prot dataset and outperformed baseline methods in terms of Fmax, AUPR, and AUC. The model also improved predictions for complex biological processes and cellular components by incorporating information about an organism's proteome and interactome in the form of protein-protein interaction networks. The model was further evaluated on the neXtProt
Reach us at info@study.space