[slides and audio] Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction

This study explores the use of large language models (LLMs) for feature extraction and dimensionality reduction in protein localization prediction, focusing on the ESM2 model. The authors propose various representation extraction strategies, considering both the character type and position within the ESM2 input sequence. They evaluate the predictive performance and interpretability of different feature types, particularly highlighting the association between specific feature types and subcellular localizations. Key findings include: 1. **Feature Extraction Strategies**: The study introduces different ESM2 representation extraction strategies, including 'cls', 'eos', mean, segment-based, and phosphorylation site-based features. These strategies are designed to capture different structural and functional attributes of amino acid sequences. 2. **Dimensionality Reduction**: The Residual Variational Autoencoder (Res-VAE) is used to reduce the dimensionality of the 'cls' feature vectors, enhancing interpretability and computational efficiency. 3. **Predictive Analysis**: The predictive performance of Random Forest (RF) and Deep Neural Network (DNN) models is evaluated using various feature inputs. The results show that the combination of 'cls' and 'vae_clis' features outperforms other feature sets in predicting subcellular localizations. 4. **Interpretability Techniques**: Model interpretability is assessed using Shapley values and Integrated Gradient values. The analysis reveals that certain features, such as 'segment0', have a higher predictive preference for Mitochondrion and Golgi apparatus predictions compared to other features. 5. ** Biological Function Extraction**: The study also explores the biological significance of features by stratifying protein groups based on Shapley or Integrated Gradient values and performing GO enrichment analysis. Overall, the research provides insights into maximizing the utility of LLMs for protein localization prediction, emphasizing the importance of feature extraction and interpretability. The code, feature extraction API, and relevant materials are available online.This study explores the use of large language models (LLMs) for feature extraction and dimensionality reduction in protein localization prediction, focusing on the ESM2 model. The authors propose various representation extraction strategies, considering both the character type and position within the ESM2 input sequence. They evaluate the predictive performance and interpretability of different feature types, particularly highlighting the association between specific feature types and subcellular localizations. Key findings include: 1. **Feature Extraction Strategies**: The study introduces different ESM2 representation extraction strategies, including 'cls', 'eos', mean, segment-based, and phosphorylation site-based features. These strategies are designed to capture different structural and functional attributes of amino acid sequences. 2. **Dimensionality Reduction**: The Residual Variational Autoencoder (Res-VAE) is used to reduce the dimensionality of the 'cls' feature vectors, enhancing interpretability and computational efficiency. 3. **Predictive Analysis**: The predictive performance of Random Forest (RF) and Deep Neural Network (DNN) models is evaluated using various feature inputs. The results show that the combination of 'cls' and 'vae_clis' features outperforms other feature sets in predicting subcellular localizations. 4. **Interpretability Techniques**: Model interpretability is assessed using Shapley values and Integrated Gradient values. The analysis reveals that certain features, such as 'segment0', have a higher predictive preference for Mitochondrion and Golgi apparatus predictions compared to other features. 5. ** Biological Function Extraction**: The study also explores the biological significance of features by stratifying protein groups based on Shapley or Integrated Gradient values and performing GO enrichment analysis. Overall, the research provides insights into maximizing the utility of LLMs for protein localization prediction, emphasizing the importance of feature extraction and interpretability. The code, feature extraction API, and relevant materials are available online.

Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction

2024 | Zeyu Luo, Rui Wang, Yawen Sun, Junhao Liu, Zongqing Chen, Yu-Juan Zhang