Understanding Reoptimization of MDL Keys for Use in Drug Discovery

The paper discusses the reoptimization of MDL keysets for use in molecular similarity calculations. MDL keysets, originally designed for substructure searching, were reoptimized to improve performance in clustering bioactive compounds. The performance of the 166-bit and 960-bit keysets was increased from 0.65 and 0.67 to 0.71 using a surprisal S/N pruned keyset with 208 bits and a genetic algorithm optimized keyset with 548 bits. The underlying technology allows the definition of descriptors based on atom, bond, and atomic neighborhood properties, which are then encoded into keysets. The study explored various methods for keyset optimization, including random pruning, surprisal pruning, and surprisal S/N pruning. Random pruning showed little effect on performance for keysets larger than 1000 bits, while surprisal S/N pruning yielded better results. Genetic algorithm optimization was also tested, but no single globally optimal keyset was found. Instead, multiple keysets with similar performance were identified. The success measure used in the study was based on the fraction of molecular nearest neighbors that belong to the same activity class as the target compound. The results showed that the best performing keyset had a success measure of 0.711 with 548 bits. The study concluded that reoptimizing keysets for molecular similarity can significantly improve performance, and that genetic algorithm optimization is a powerful tool for this purpose. However, the lack of a single globally optimal keyset suggests that performance should be guided by known constraints rather than autonomous optimization.The paper discusses the reoptimization of MDL keysets for use in molecular similarity calculations. MDL keysets, originally designed for substructure searching, were reoptimized to improve performance in clustering bioactive compounds. The performance of the 166-bit and 960-bit keysets was increased from 0.65 and 0.67 to 0.71 using a surprisal S/N pruned keyset with 208 bits and a genetic algorithm optimized keyset with 548 bits. The underlying technology allows the definition of descriptors based on atom, bond, and atomic neighborhood properties, which are then encoded into keysets. The study explored various methods for keyset optimization, including random pruning, surprisal pruning, and surprisal S/N pruning. Random pruning showed little effect on performance for keysets larger than 1000 bits, while surprisal S/N pruning yielded better results. Genetic algorithm optimization was also tested, but no single globally optimal keyset was found. Instead, multiple keysets with similar performance were identified. The success measure used in the study was based on the fraction of molecular nearest neighbors that belong to the same activity class as the target compound. The results showed that the best performing keyset had a success measure of 0.711 with 548 bits. The study concluded that reoptimizing keysets for molecular similarity can significantly improve performance, and that genetic algorithm optimization is a powerful tool for this purpose. However, the lack of a single globally optimal keyset suggests that performance should be guided by known constraints rather than autonomous optimization.

Reoptimization of MDL Keys for Use in Drug Discovery

2002 | Joseph L. Durant, Burton A. Leland, Douglas R. Henry, and James G. Nourse