16 April 2024 | Xia Yu¹,², Jia Ren¹, Haixia Long², Rao Zeng², Guoqiang Zhang², Anas Bilal² and Yani Cui¹*
The iDNA-OpenPrompt model is an OpenPrompt learning framework designed for identifying DNA methylation sites. It integrates a prompt template, prompt verbalizer, and pre-trained language model (PLM) to construct a prompt learning framework for DNA methylation sequences. The model also incorporates a DNA vocabulary library, BERT tokenizer, and specific label words to enable accurate identification of DNA methylation sites. The model was evaluated using 17 benchmark datasets covering various species and three types of DNA methylation modifications (4mC, 5hmC, and 6mA). The results consistently show that the iDNA-OpenPrompt model outperforms existing methods in terms of performance and robustness. The model's effectiveness is attributed to its use of the OpenPrompt learning framework, which enhances its performance, along with the prompt template and verbalizer specifically designed for DNA methylation sequences. The model demonstrates strong cross-species validation performance, indicating its ability to identify DNA methylation sites across different species. The study also investigates the impact of the DNA vocabulary and label words on model accuracy, showing that the highest accuracy is achieved when the nucleotide length of the DNA vocabulary and label words is set to 6. The model's contributions include the creation of a DNA vocabulary library and the integration of BERT tokenizer for DNA methylation sequences, as well as the construction of label words specific to DNA methylation sequences. The model's performance is evaluated using metrics such as accuracy (ACC), sensitivity (SN), specificity (SP), Matthews' correlation coefficient (MCC), and area under curve (AUC). The results indicate that the iDNA-OpenPrompt model consistently surpasses other outstanding methods in all 17 datasets. The model's limitations include the manual generation of the DNA vocabulary in the prompt template, which requires manual generation for other biological sequences. Future research directions include automating vocabulary generation and adapting the model to other biological information sequences.The iDNA-OpenPrompt model is an OpenPrompt learning framework designed for identifying DNA methylation sites. It integrates a prompt template, prompt verbalizer, and pre-trained language model (PLM) to construct a prompt learning framework for DNA methylation sequences. The model also incorporates a DNA vocabulary library, BERT tokenizer, and specific label words to enable accurate identification of DNA methylation sites. The model was evaluated using 17 benchmark datasets covering various species and three types of DNA methylation modifications (4mC, 5hmC, and 6mA). The results consistently show that the iDNA-OpenPrompt model outperforms existing methods in terms of performance and robustness. The model's effectiveness is attributed to its use of the OpenPrompt learning framework, which enhances its performance, along with the prompt template and verbalizer specifically designed for DNA methylation sequences. The model demonstrates strong cross-species validation performance, indicating its ability to identify DNA methylation sites across different species. The study also investigates the impact of the DNA vocabulary and label words on model accuracy, showing that the highest accuracy is achieved when the nucleotide length of the DNA vocabulary and label words is set to 6. The model's contributions include the creation of a DNA vocabulary library and the integration of BERT tokenizer for DNA methylation sequences, as well as the construction of label words specific to DNA methylation sequences. The model's performance is evaluated using metrics such as accuracy (ACC), sensitivity (SN), specificity (SP), Matthews' correlation coefficient (MCC), and area under curve (AUC). The results indicate that the iDNA-OpenPrompt model consistently surpasses other outstanding methods in all 17 datasets. The model's limitations include the manual generation of the DNA vocabulary in the prompt template, which requires manual generation for other biological sequences. Future research directions include automating vocabulary generation and adapting the model to other biological information sequences.