Invalid SMILES are beneficial rather than detrimental to chemical language models

Invalid SMILES are beneficial rather than detrimental to chemical language models

April 2024 | Michael A. Skinnider
Invalid SMILES are beneficial rather than detrimental to chemical language models. Chemical language models trained on SMILES (Simplified Molecular-Input Line-Entry System) representations can generate invalid SMILES, which are strings that do not correspond to any valid chemical structure. This study provides causal evidence that generating invalid outputs is not harmful but beneficial to chemical language models. The generation of invalid SMILES acts as a self-corrective mechanism that filters low-likelihood samples from the model output. Conversely, enforcing valid outputs leads to structural biases in the generated molecules, impairing distribution learning and limiting generalization to unseen chemical space. These results refute the prevailing assumption that invalid SMILES are a shortcoming of chemical language models and reframe them as a feature, not a bug. Over the past century, more than 100 million small molecules have been synthesized in the search for new drugs and materials. These efforts have explored only an infinitesimal subset of chemical space, the size of which is estimated at over 10^60 molecules. However, this limited exploration has led to the discovery of numerous molecules that can modulate biological processes. The ability to generate invalid SMILES is a feature of chemical language models, not a bug. The generation of invalid SMILES is widely perceived to be a shortcoming of chemical language models, but this perception is incorrect. The generation of invalid SMILES is actually beneficial as it allows the model to filter out low-quality samples. This is supported by the fact that models trained on SMILES outperform those trained on SELFIES (SELF-referencing Embedded Strings), a textual representation that produces 100% valid output by design. The study shows that invalid SMILES are sampled with significantly lower likelihoods than valid SMILES, suggesting that filtering invalid SMILES provides an intrinsic mechanism to identify and remove low-quality samples from the model output. The study also shows that the generation of invalid outputs improves the performance of chemical language models. This is supported by the fact that models trained on SMILES outperform those trained on SELFIES. The study also shows that the imposition of valency constraints in SELFIES prevents the generation of invalid outputs but results in an overrepresentation of aliphatic rings and an underrepresentation of aromatic rings in the generated molecules. These biases in the chemical composition of the generated molecules are reflected in poor performance on distribution-learning metrics, such as the Fréchet ChemNet distance. Removing valency constraints, and allowing the model to generate invalid outputs, corrects these biases and improves performance. The study also shows that the ability to generate invalid outputs improves the performance of chemical language models on structure elucidation tasks. This is supported by the fact that models trained on SMILES outperform those trained on SELFIES. The study also shows that the ability to generate invalid outputs is not a fundamental flaw in the underlying model but a feature that can be leveraged to improve performance. The study also shows that the ability to generate invalid outputs isInvalid SMILES are beneficial rather than detrimental to chemical language models. Chemical language models trained on SMILES (Simplified Molecular-Input Line-Entry System) representations can generate invalid SMILES, which are strings that do not correspond to any valid chemical structure. This study provides causal evidence that generating invalid outputs is not harmful but beneficial to chemical language models. The generation of invalid SMILES acts as a self-corrective mechanism that filters low-likelihood samples from the model output. Conversely, enforcing valid outputs leads to structural biases in the generated molecules, impairing distribution learning and limiting generalization to unseen chemical space. These results refute the prevailing assumption that invalid SMILES are a shortcoming of chemical language models and reframe them as a feature, not a bug. Over the past century, more than 100 million small molecules have been synthesized in the search for new drugs and materials. These efforts have explored only an infinitesimal subset of chemical space, the size of which is estimated at over 10^60 molecules. However, this limited exploration has led to the discovery of numerous molecules that can modulate biological processes. The ability to generate invalid SMILES is a feature of chemical language models, not a bug. The generation of invalid SMILES is widely perceived to be a shortcoming of chemical language models, but this perception is incorrect. The generation of invalid SMILES is actually beneficial as it allows the model to filter out low-quality samples. This is supported by the fact that models trained on SMILES outperform those trained on SELFIES (SELF-referencing Embedded Strings), a textual representation that produces 100% valid output by design. The study shows that invalid SMILES are sampled with significantly lower likelihoods than valid SMILES, suggesting that filtering invalid SMILES provides an intrinsic mechanism to identify and remove low-quality samples from the model output. The study also shows that the generation of invalid outputs improves the performance of chemical language models. This is supported by the fact that models trained on SMILES outperform those trained on SELFIES. The study also shows that the imposition of valency constraints in SELFIES prevents the generation of invalid outputs but results in an overrepresentation of aliphatic rings and an underrepresentation of aromatic rings in the generated molecules. These biases in the chemical composition of the generated molecules are reflected in poor performance on distribution-learning metrics, such as the Fréchet ChemNet distance. Removing valency constraints, and allowing the model to generate invalid outputs, corrects these biases and improves performance. The study also shows that the ability to generate invalid outputs improves the performance of chemical language models on structure elucidation tasks. This is supported by the fact that models trained on SMILES outperform those trained on SELFIES. The study also shows that the ability to generate invalid outputs is not a fundamental flaw in the underlying model but a feature that can be leveraged to improve performance. The study also shows that the ability to generate invalid outputs is
Reach us at info@study.space
Understanding Invalid SMILES are beneficial rather than detrimental to chemical language models