Seeing data as t-SNE and UMAP do

Seeing data as t-SNE and UMAP do

June 2024 | Vivien Marx
Dimension reduction techniques like t-SNE and UMAP are widely used to visualize high-dimensional datasets, but they require careful use and parameter tuning. These methods can create misleading patterns, such as spurious clusters, and may not preserve global structure. Researchers like Rafael Irizarry and Jingyi Jessica Li caution against over-reliance on these tools, emphasizing the need for parameter selection based on justified methods and understanding of the data. PCA is often used first to reduce dimensions before applying t-SNE or UMAP, which are more effective at preserving local structure. However, these methods can distort data and should not be used as definitive conclusions. The National Academies of Sciences, Engineering, and Medicine recommend using population descriptors in genetics and genomics research. The use of self-identified race and ethnicity as genetic descriptors is problematic, as they are social constructs, not genetic. The All of Us Research Program's figure revision highlights the need for careful data visualization to avoid misrepresentation. Researchers must consider the scientific question when using clustering algorithms and be aware of their assumptions. Dimension reduction methods like t-SNE and UMAP are nonlinear and can produce different results based on parameter settings. Tools like scDEED help assess the reliability of data visualizations. Researchers should use multiple methods and compare results to gain a comprehensive understanding. Statistics and thoughtful use of these methods are crucial for accurate scientific conclusions. The field of human genetics is evolving, and there is a growing awareness of the importance of responsible data visualization and analysis.Dimension reduction techniques like t-SNE and UMAP are widely used to visualize high-dimensional datasets, but they require careful use and parameter tuning. These methods can create misleading patterns, such as spurious clusters, and may not preserve global structure. Researchers like Rafael Irizarry and Jingyi Jessica Li caution against over-reliance on these tools, emphasizing the need for parameter selection based on justified methods and understanding of the data. PCA is often used first to reduce dimensions before applying t-SNE or UMAP, which are more effective at preserving local structure. However, these methods can distort data and should not be used as definitive conclusions. The National Academies of Sciences, Engineering, and Medicine recommend using population descriptors in genetics and genomics research. The use of self-identified race and ethnicity as genetic descriptors is problematic, as they are social constructs, not genetic. The All of Us Research Program's figure revision highlights the need for careful data visualization to avoid misrepresentation. Researchers must consider the scientific question when using clustering algorithms and be aware of their assumptions. Dimension reduction methods like t-SNE and UMAP are nonlinear and can produce different results based on parameter settings. Tools like scDEED help assess the reliability of data visualizations. Researchers should use multiple methods and compare results to gain a comprehensive understanding. Statistics and thoughtful use of these methods are crucial for accurate scientific conclusions. The field of human genetics is evolving, and there is a growing awareness of the importance of responsible data visualization and analysis.
Reach us at info@study.space