Invalid SMILES are beneficial rather than detrimental to chemical language models

Invalid SMILES are beneficial rather than detrimental to chemical language models

29 March 2024 | Michael A. Skinnider
The article challenges the common perception that invalid SMILES strings are a significant drawback for chemical language models. Instead, it provides causal evidence that the ability to generate invalid outputs is beneficial to these models. The study demonstrates that invalid SMILES are low-likelihood samples, and filtering them out can improve model performance by removing low-quality samples. Removing valency constraints in the SELFIES representation allows models to generate invalid outputs, which improves performance and corrects structural biases in the exploration of chemical space. These biases, such as an overrepresentation of aliphatic rings and an underrepresentation of aromatic rings, impair generalization to unseen chemical space. The article also shows that models capable of generating invalid outputs outperform those that cannot in tasks like structure elucidation of complex natural products. Overall, the findings suggest that invalid outputs should be seen as a feature rather than a bug, and that efforts to improve chemical language models should focus on enhancing their performance rather than enforcing the generation of valid molecules.The article challenges the common perception that invalid SMILES strings are a significant drawback for chemical language models. Instead, it provides causal evidence that the ability to generate invalid outputs is beneficial to these models. The study demonstrates that invalid SMILES are low-likelihood samples, and filtering them out can improve model performance by removing low-quality samples. Removing valency constraints in the SELFIES representation allows models to generate invalid outputs, which improves performance and corrects structural biases in the exploration of chemical space. These biases, such as an overrepresentation of aliphatic rings and an underrepresentation of aromatic rings, impair generalization to unseen chemical space. The article also shows that models capable of generating invalid outputs outperform those that cannot in tasks like structure elucidation of complex natural products. Overall, the findings suggest that invalid outputs should be seen as a feature rather than a bug, and that efforts to improve chemical language models should focus on enhancing their performance rather than enforcing the generation of valid molecules.
Reach us at info@study.space