Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction

Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction

2024 | Zeyu Luo, Rui Wang, Yawen Sun, Junhao Liu, Zongqing Chen, and Yu-Juan Zhang
This study explores interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction. The authors propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis, and interpretability techniques, they illuminate potential associations between diverse feature types and specific subcellular localizations. They find that the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. They also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. The code, feature extraction API, and all relevant materials are available at https://github.com/yujuan-zhang/feature-representation-for-LLMs. The study focuses on proteins with amino acid sequence lengths between 15–4000 and those marked with a single classification label. The dataset was randomly shuffled and divided into a training subset (60%) and a test subset (40%). Protein localization data were downloaded from TrEMBL within the UniProt database as independent test datasets. The authors used the 'cls' and 'eos' character representation vectors and built the mean vector (average pooling) of amino acid residues based on the last hidden layer of ESM2 model. They also designed segmental mean vectors of amino acid sequences and the vector representing phosphorylation sites. They manually built amino acid feature vectors through domain knowledge and employed Doc2vec and UDSMProt to represent amino acid sequence as opposite correspondence to large language model ESM2. They compared the effects of different representation vectors in distinguishing different subcellular localized proteins through UMAP dimensionality reduction. Next, features from ESM2 were used separately or in combination for the construction of downstream subcellular localization classification models (using RF and DNN). The performance of models constructed with different features were measured by evaluation indicators such as F1 score and Matthews correlation coefficient (MCC), and they also consider de-homology in independent datasets and using 5-fold cross-validation MCC score to conduct fairer and more systemic evaluation. The authors also reduced the dimensionality of the ESM2 'cls' feature vectors by constructing a residual variational autoencoder (Res-VAE) to explore how to compress the dimensionality of ESM2 feature vectors to obtain a better feature representation. They used the Res-VAE model to perform unsupervised dimensionality reduction on the 'cls' representation vector from ESM2. The Res-VAE model maps the 'cls' feature vectors into a lower-dimensional latent space, specifically the latent space is represented as a Gaussian distribution parameterized by a mean vector (mu) and a log-variance vector (log_var). They utilized the reparameterThis study explores interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction. The authors propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis, and interpretability techniques, they illuminate potential associations between diverse feature types and specific subcellular localizations. They find that the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. They also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. The code, feature extraction API, and all relevant materials are available at https://github.com/yujuan-zhang/feature-representation-for-LLMs. The study focuses on proteins with amino acid sequence lengths between 15–4000 and those marked with a single classification label. The dataset was randomly shuffled and divided into a training subset (60%) and a test subset (40%). Protein localization data were downloaded from TrEMBL within the UniProt database as independent test datasets. The authors used the 'cls' and 'eos' character representation vectors and built the mean vector (average pooling) of amino acid residues based on the last hidden layer of ESM2 model. They also designed segmental mean vectors of amino acid sequences and the vector representing phosphorylation sites. They manually built amino acid feature vectors through domain knowledge and employed Doc2vec and UDSMProt to represent amino acid sequence as opposite correspondence to large language model ESM2. They compared the effects of different representation vectors in distinguishing different subcellular localized proteins through UMAP dimensionality reduction. Next, features from ESM2 were used separately or in combination for the construction of downstream subcellular localization classification models (using RF and DNN). The performance of models constructed with different features were measured by evaluation indicators such as F1 score and Matthews correlation coefficient (MCC), and they also consider de-homology in independent datasets and using 5-fold cross-validation MCC score to conduct fairer and more systemic evaluation. The authors also reduced the dimensionality of the ESM2 'cls' feature vectors by constructing a residual variational autoencoder (Res-VAE) to explore how to compress the dimensionality of ESM2 feature vectors to obtain a better feature representation. They used the Res-VAE model to perform unsupervised dimensionality reduction on the 'cls' representation vector from ESM2. The Res-VAE model maps the 'cls' feature vectors into a lower-dimensional latent space, specifically the latent space is represented as a Gaussian distribution parameterized by a mean vector (mu) and a log-variance vector (log_var). They utilized the reparameter
Reach us at info@futurestudyspace.com
[slides] Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction | StudySpace