14 February 2024 | Maxat Kulmanov, Francisco J. Guzmán-Vega, Paula Duek Roggli, Lydie Lane, Stefan T. Arold, Robert Hoehndorf
The Gene Ontology (GO) is a comprehensive theory with over 100,000 axioms that describes protein functions, biological processes, and cellular locations across three subontologies. Predicting protein functions using GO requires both learning and reasoning capabilities to maintain consistency and leverage background knowledge. DeepGO-SE, a method developed by the authors, predicts GO functions from protein sequences using a pretrained large language model. It generates multiple approximate models of GO and uses a neural network to predict the truth values of statements about protein functions in these models. By aggregating these truth values, DeepGO-SE approximates semantic entailment, improving protein function prediction compared to state-of-the-art methods. The approach effectively exploits background knowledge in GO and enhances predictions for molecular functions, biological processes, and cellular components. The method is evaluated on the UniProtKB/Swiss-Prot dataset and the neXtProt dataset, showing significant improvements over baseline methods. The results highlight the effectiveness of incorporating background knowledge and protein interactions in function prediction models.The Gene Ontology (GO) is a comprehensive theory with over 100,000 axioms that describes protein functions, biological processes, and cellular locations across three subontologies. Predicting protein functions using GO requires both learning and reasoning capabilities to maintain consistency and leverage background knowledge. DeepGO-SE, a method developed by the authors, predicts GO functions from protein sequences using a pretrained large language model. It generates multiple approximate models of GO and uses a neural network to predict the truth values of statements about protein functions in these models. By aggregating these truth values, DeepGO-SE approximates semantic entailment, improving protein function prediction compared to state-of-the-art methods. The approach effectively exploits background knowledge in GO and enhances predictions for molecular functions, biological processes, and cellular components. The method is evaluated on the UniProtKB/Swiss-Prot dataset and the neXtProt dataset, showing significant improvements over baseline methods. The results highlight the effectiveness of incorporating background knowledge and protein interactions in function prediction models.