5 Dec 2017 | Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik
This paper presents a method for automatic chemical design using a data-driven continuous representation of molecules. The approach involves training a deep neural network on a large dataset of chemical structures to create an encoder, decoder, and predictor. The encoder converts discrete molecular representations into continuous vectors, while the decoder translates these vectors back into molecular representations. The predictor estimates chemical properties from the continuous representation.
The continuous representation allows for the automatic generation of novel chemical structures through operations in the latent space, such as decoding random vectors, perturbing known structures, or interpolating between molecules. It also enables gradient-based optimization to efficiently guide the search for optimized compounds. The method was tested on drug-like molecules and molecules with fewer than nine heavy atoms.
The continuous representation offers several advantages. It eliminates the need for hand-specified mutation rules, allows gradient-based optimization for larger jumps in chemical space, and leverages large sets of unlabeled compounds to build an implicit library. The method also enables the use of Bayesian optimization to select compounds likely to be informative about the global optimum.
The paper introduces a variational autoencoder (VAE) to ensure that points in the latent space correspond to valid molecules. The VAE was trained on two datasets: QM9 (molecules with fewer than nine heavy atoms) and ZINC (drug-like molecules). The latent space representations for these datasets had 156 and 196 dimensions, respectively.
The results show that the continuous latent space allows for interpolation of molecules by following the shortest Euclidean path between their representations. The VAE was able to generate realistic-looking molecules with properties consistent with the training data. The method was also used for property prediction, where a multi-layer perceptron was trained to predict properties from the latent representation.
The paper also discusses the optimization of molecules in the latent space using a Gaussian process model. The objective function used was 5×QED-SAS, where QED is a measure of drug-likeness and SAS is a synthetic accessibility score. The results showed that the Gaussian process model outperformed baseline methods in terms of percentile scores.
The paper concludes that the proposed method offers a new approach for exploring chemical space, allowing for efficient and effective molecular design. The method has the potential to create new avenues for molecular design by leveraging continuous representations and gradient-based optimization. Further improvements include using graph-based autoencoders and incorporating adversarial networks for sequence generation.This paper presents a method for automatic chemical design using a data-driven continuous representation of molecules. The approach involves training a deep neural network on a large dataset of chemical structures to create an encoder, decoder, and predictor. The encoder converts discrete molecular representations into continuous vectors, while the decoder translates these vectors back into molecular representations. The predictor estimates chemical properties from the continuous representation.
The continuous representation allows for the automatic generation of novel chemical structures through operations in the latent space, such as decoding random vectors, perturbing known structures, or interpolating between molecules. It also enables gradient-based optimization to efficiently guide the search for optimized compounds. The method was tested on drug-like molecules and molecules with fewer than nine heavy atoms.
The continuous representation offers several advantages. It eliminates the need for hand-specified mutation rules, allows gradient-based optimization for larger jumps in chemical space, and leverages large sets of unlabeled compounds to build an implicit library. The method also enables the use of Bayesian optimization to select compounds likely to be informative about the global optimum.
The paper introduces a variational autoencoder (VAE) to ensure that points in the latent space correspond to valid molecules. The VAE was trained on two datasets: QM9 (molecules with fewer than nine heavy atoms) and ZINC (drug-like molecules). The latent space representations for these datasets had 156 and 196 dimensions, respectively.
The results show that the continuous latent space allows for interpolation of molecules by following the shortest Euclidean path between their representations. The VAE was able to generate realistic-looking molecules with properties consistent with the training data. The method was also used for property prediction, where a multi-layer perceptron was trained to predict properties from the latent representation.
The paper also discusses the optimization of molecules in the latent space using a Gaussian process model. The objective function used was 5×QED-SAS, where QED is a measure of drug-likeness and SAS is a synthetic accessibility score. The results showed that the Gaussian process model outperformed baseline methods in terms of percentile scores.
The paper concludes that the proposed method offers a new approach for exploring chemical space, allowing for efficient and effective molecular design. The method has the potential to create new avenues for molecular design by leveraging continuous representations and gradient-based optimization. Further improvements include using graph-based autoencoders and incorporating adversarial networks for sequence generation.