Understanding Synthetic data generation methods in healthcare%3A A review on open-source tools and methods

Synthetic data generation has emerged as a promising solution to overcome challenges posed by data scarcity and privacy concerns in healthcare, enabling the training of AI algorithms on unbiased, large-scale data. This review explores the application and efficacy of synthetic data methods in healthcare, focusing on tabular, imaging, radiomics, time-series, and omics data. A systematic search of PubMed and Scopus databases identified studies utilizing various methods, including statistical, probabilistic, machine learning, and deep learning approaches. Deep learning-based synthetic data generators were used in 72.6% of the studies, with 75.3% implemented in Python. The review highlights the use of synthetic data to reduce clinical trial costs, enhance AI model predictive power, ensure fair treatment recommendations, and provide access to high-quality, representative multimodal datasets without exposing sensitive patient information. The review also discusses the importance of synthetic data in addressing data privacy concerns, as they can ensure personal identifiers are absent, safeguarding patient confidentiality. Synthetic data can mitigate harmful biases in real data, such as gender, race, and health insurance status, by generating balanced, diverse data. The review emphasizes the need for high fidelity in synthetic data to ensure they accurately mimic real data without compromising privacy. It also highlights the challenges and limitations of synthetic data, including the balance between realism and privacy, and the need for robust evaluation metrics to assess data quality. The review presents a comprehensive analysis of synthetic data generation methods, open-source repositories, and their applications in healthcare. It discusses the use of various methods for generating synthetic data across different data types, including tabular, imaging, radiomics, time-series, and omics data. The review also highlights the importance of programming languages and open-source tools in implementing synthetic data generation methods. The findings indicate that deep learning-based methods are the most prevalent, followed by statistical and machine learning methods. The review concludes that synthetic data can significantly advance personalized medicine, improve treatment efficacy, and ensure data privacy in healthcare.Synthetic data generation has emerged as a promising solution to overcome challenges posed by data scarcity and privacy concerns in healthcare, enabling the training of AI algorithms on unbiased, large-scale data. This review explores the application and efficacy of synthetic data methods in healthcare, focusing on tabular, imaging, radiomics, time-series, and omics data. A systematic search of PubMed and Scopus databases identified studies utilizing various methods, including statistical, probabilistic, machine learning, and deep learning approaches. Deep learning-based synthetic data generators were used in 72.6% of the studies, with 75.3% implemented in Python. The review highlights the use of synthetic data to reduce clinical trial costs, enhance AI model predictive power, ensure fair treatment recommendations, and provide access to high-quality, representative multimodal datasets without exposing sensitive patient information. The review also discusses the importance of synthetic data in addressing data privacy concerns, as they can ensure personal identifiers are absent, safeguarding patient confidentiality. Synthetic data can mitigate harmful biases in real data, such as gender, race, and health insurance status, by generating balanced, diverse data. The review emphasizes the need for high fidelity in synthetic data to ensure they accurately mimic real data without compromising privacy. It also highlights the challenges and limitations of synthetic data, including the balance between realism and privacy, and the need for robust evaluation metrics to assess data quality. The review presents a comprehensive analysis of synthetic data generation methods, open-source repositories, and their applications in healthcare. It discusses the use of various methods for generating synthetic data across different data types, including tabular, imaging, radiomics, time-series, and omics data. The review also highlights the importance of programming languages and open-source tools in implementing synthetic data generation methods. The findings indicate that deep learning-based methods are the most prevalent, followed by statistical and machine learning methods. The review concludes that synthetic data can significantly advance personalized medicine, improve treatment efficacy, and ensure data privacy in healthcare.

Synthetic data generation methods in healthcare: A review on open-source tools and methods

2024 | Vasileios C. Pezoulas, Dimitrios I. Zaridis, Eugenia Mylona, Christos Androutos, Kosmas Apostolidis, Nikolaos S. Tachos, Dimitrios I. Fotiadis