Automated Statistical Model Discovery with Language Models

Automated Statistical Model Discovery with Language Models

2024 | Michael Y. Li, Emily B. Fox, Noah D. Goodman
This paper introduces a method for automated statistical model discovery using large language models (LLMs). The approach, called BoxLM, leverages the domain knowledge and programming capabilities of LMs to iteratively propose and refine probabilistic models. The process is framed within Box's Loop, where an LM acts as both a modeler and a domain expert. The LM proposes probabilistic programs, which are then evaluated and critiqued by another LM, leading to iterative improvements. This method eliminates the need for domain-specific languages or hand-crafted search procedures, enabling more flexible and open-ended model discovery. The method is evaluated in three settings: searching within a restricted space of models, searching over an open-ended space, and improving expert models under natural language constraints. Results show that BoxLM can identify models on par with human experts and extend classic models in interpretable ways. The system is particularly effective in scenarios where modeling constraints are difficult to formalize but easy to express in natural language, such as ensuring interpretability for ecologists. In experiments, BoxLM outperforms existing methods in tasks like Gaussian process kernel discovery and improves upon classic models in predator-prey dynamics. The system is shown to be robust to variations in dataset metadata and can adapt to different modeling constraints. The approach also demonstrates the ability to generate flexible models that balance interpretability and flexibility, highlighting the potential of LMs in automating statistical model discovery. The study underscores the promise of LM-driven model discovery in accelerating and democratizing scientific research.This paper introduces a method for automated statistical model discovery using large language models (LLMs). The approach, called BoxLM, leverages the domain knowledge and programming capabilities of LMs to iteratively propose and refine probabilistic models. The process is framed within Box's Loop, where an LM acts as both a modeler and a domain expert. The LM proposes probabilistic programs, which are then evaluated and critiqued by another LM, leading to iterative improvements. This method eliminates the need for domain-specific languages or hand-crafted search procedures, enabling more flexible and open-ended model discovery. The method is evaluated in three settings: searching within a restricted space of models, searching over an open-ended space, and improving expert models under natural language constraints. Results show that BoxLM can identify models on par with human experts and extend classic models in interpretable ways. The system is particularly effective in scenarios where modeling constraints are difficult to formalize but easy to express in natural language, such as ensuring interpretability for ecologists. In experiments, BoxLM outperforms existing methods in tasks like Gaussian process kernel discovery and improves upon classic models in predator-prey dynamics. The system is shown to be robust to variations in dataset metadata and can adapt to different modeling constraints. The approach also demonstrates the ability to generate flexible models that balance interpretability and flexibility, highlighting the potential of LMs in automating statistical model discovery. The study underscores the promise of LM-driven model discovery in accelerating and democratizing scientific research.
Reach us at info@study.space