2024 | Marvin Alberts, Teodoro Laino & Alain C. Vaucher
A transformer model is introduced for direct prediction of molecular structure from infrared (IR) spectra. The model is pretrained on 634,585 simulated IR spectra and fine-tuned on 3,453 experimental spectra. It achieves a top-1 accuracy of 44.4% and top-10 accuracy of 69.8% for compounds with 6–13 heavy atoms. The model also accurately predicts scaffolds, achieving 84.5% top-1 and 93.0% top-10 accuracy. The model uses both IR spectra and chemical formulas as inputs, with the chemical formula serving as a prior to constrain the chemical space. The model is trained on a sequence of 400 tokens, corresponding to a spectral resolution of approximately 16 cm⁻¹, which is comparable to the typical distance between peaks in an IR spectrum. The model is evaluated on experimental data from the NIST IR database, achieving a top-1 accuracy of 44.39%, top-5 accuracy of 66.85%, and top-10 accuracy of 69.79%. The model performs well in predicting functional groups, with an average F1 score of 0.856. It also demonstrates strong performance in predicting the correct scaffold, with 84.46% top-1 and 93.00% top-10 accuracy. The model's performance is influenced by factors such as the number of heavy atoms and the presence of specific functional groups. The model's ability to predict molecular structure from IR spectra represents a significant advancement in automated structure elucidation. The model is trained using a transformer architecture and is capable of processing both simulated and experimental IR spectra. The model's performance is validated through extensive testing and comparison with previous works. The model's success highlights the potential of machine learning in chemical analysis and opens new possibilities for the application of IR spectroscopy in analytical chemistry.A transformer model is introduced for direct prediction of molecular structure from infrared (IR) spectra. The model is pretrained on 634,585 simulated IR spectra and fine-tuned on 3,453 experimental spectra. It achieves a top-1 accuracy of 44.4% and top-10 accuracy of 69.8% for compounds with 6–13 heavy atoms. The model also accurately predicts scaffolds, achieving 84.5% top-1 and 93.0% top-10 accuracy. The model uses both IR spectra and chemical formulas as inputs, with the chemical formula serving as a prior to constrain the chemical space. The model is trained on a sequence of 400 tokens, corresponding to a spectral resolution of approximately 16 cm⁻¹, which is comparable to the typical distance between peaks in an IR spectrum. The model is evaluated on experimental data from the NIST IR database, achieving a top-1 accuracy of 44.39%, top-5 accuracy of 66.85%, and top-10 accuracy of 69.79%. The model performs well in predicting functional groups, with an average F1 score of 0.856. It also demonstrates strong performance in predicting the correct scaffold, with 84.46% top-1 and 93.00% top-10 accuracy. The model's performance is influenced by factors such as the number of heavy atoms and the presence of specific functional groups. The model's ability to predict molecular structure from IR spectra represents a significant advancement in automated structure elucidation. The model is trained using a transformer architecture and is capable of processing both simulated and experimental IR spectra. The model's performance is validated through extensive testing and comparison with previous works. The model's success highlights the potential of machine learning in chemical analysis and opens new possibilities for the application of IR spectroscopy in analytical chemistry.