Received February 27, 1998 | Peter Willett*, John M. Barnard and Geoffrey M. Downs
This paper reviews the use of similarity searching in chemical databases, distinguishing it from substructure searching. It discusses fragment-based measures used for searching chemical structure databases and focuses on two key characteristics of similarity measures: the coefficient quantifying structural resemblance and the structural representations used to characterize molecules. New types of similarity measures are compared with current approaches, and examples of applications related to similarity searching are provided. The paper highlights the limitations of substructure searching, such as the requirement for a clear understanding of the query structure and the lack of control over the output size. It introduces fragment-based similarity searching, which involves comparing a target structure with a set of structural descriptors in the database to rank molecules by decreasing similarity. The paper also reviews various similarity and distance coefficients, including Hamming distance, Euclidean distance, Soergel distance, Tanimoto coefficient, Dice coefficient, and Cosine coefficient, and discusses their properties and applications. Finally, it explores different structural representations for similarity searching, such as 2D and 3D fragment descriptors, physicochemical properties, topological indices, and whole molecule comparisons, emphasizing the need for effective and efficient representations.This paper reviews the use of similarity searching in chemical databases, distinguishing it from substructure searching. It discusses fragment-based measures used for searching chemical structure databases and focuses on two key characteristics of similarity measures: the coefficient quantifying structural resemblance and the structural representations used to characterize molecules. New types of similarity measures are compared with current approaches, and examples of applications related to similarity searching are provided. The paper highlights the limitations of substructure searching, such as the requirement for a clear understanding of the query structure and the lack of control over the output size. It introduces fragment-based similarity searching, which involves comparing a target structure with a set of structural descriptors in the database to rank molecules by decreasing similarity. The paper also reviews various similarity and distance coefficients, including Hamming distance, Euclidean distance, Soergel distance, Tanimoto coefficient, Dice coefficient, and Cosine coefficient, and discusses their properties and applications. Finally, it explores different structural representations for similarity searching, such as 2D and 3D fragment descriptors, physicochemical properties, topological indices, and whole molecule comparisons, emphasizing the need for effective and efficient representations.