Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data

Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data

13 Nov 2017 | Anuj Karpatne, Gowtham Atluri, James H. Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar
This paper introduces theory-guided data science (TGDS) as a new paradigm for scientific discovery from data. TGDS aims to integrate scientific knowledge with data science models to improve their effectiveness in enabling scientific discovery. Unlike traditional data science models that rely solely on data, TGDS incorporates scientific theories to ensure models are physically consistent, interpretable, and generalizable. The paper discusses the importance of scientific consistency in data science models, particularly in scientific domains where data is limited and physical phenomena are complex. It highlights the limitations of black-box data science models in scientific applications, such as the failure of Google Flu Trends, which overestimated flu cases due to a lack of scientific consistency. The paper also presents several approaches for integrating domain knowledge into data science models, including theory-guided model design, learning, and regularization. These approaches are illustrated with examples from various scientific disciplines, such as hydrology, computational chemistry, and climate science. The paper emphasizes the need for a synergy between theory and data in scientific applications to achieve accurate, interpretable, and generalizable models. The overarching vision of TGDS is to use scientific knowledge to guide data science models, ensuring they are consistent with physical principles and leading to deeper scientific insights. The paper concludes by outlining five research themes in TGDS, including theory-guided model design, learning, and regularization, and highlights the potential of TGDS in advancing scientific discovery.This paper introduces theory-guided data science (TGDS) as a new paradigm for scientific discovery from data. TGDS aims to integrate scientific knowledge with data science models to improve their effectiveness in enabling scientific discovery. Unlike traditional data science models that rely solely on data, TGDS incorporates scientific theories to ensure models are physically consistent, interpretable, and generalizable. The paper discusses the importance of scientific consistency in data science models, particularly in scientific domains where data is limited and physical phenomena are complex. It highlights the limitations of black-box data science models in scientific applications, such as the failure of Google Flu Trends, which overestimated flu cases due to a lack of scientific consistency. The paper also presents several approaches for integrating domain knowledge into data science models, including theory-guided model design, learning, and regularization. These approaches are illustrated with examples from various scientific disciplines, such as hydrology, computational chemistry, and climate science. The paper emphasizes the need for a synergy between theory and data in scientific applications to achieve accurate, interpretable, and generalizable models. The overarching vision of TGDS is to use scientific knowledge to guide data science models, ensuring they are consistent with physical principles and leading to deeper scientific insights. The paper concludes by outlining five research themes in TGDS, including theory-guided model design, learning, and regularization, and highlights the potential of TGDS in advancing scientific discovery.
Reach us at info@study.space