11 February 2009 | Ivan Erill* and Michael C O'Neill
This article reevaluates information theory-based methods for identifying DNA-binding sites in genomes. It highlights that conventional benchmarking using artificial sequences often overestimates the efficiency of these methods. Real genomes require additional cues, such as curvature, beyond sequence information. Methods integrating skew information, like Relative Entropy, are ineffective in real genomes because their assumptions may not hold. The evidence suggests that binding sites evolve towards genomic skew and maintain their information content through conservation. The study identifies misconceptions in information theory applications, such as negative entropy, and proposes a revised paradigm for understanding binding site evolution.
The article discusses the application of information theory to binding site recognition, including the Heterology Index (HI) and Relative Entropy (RE). It shows that methods like HI and RE are derived from information theory principles but have limitations. The study evaluates the performance of various methods on both equiprobable and skewed genomic backgrounds. It finds that non-weighted methods often outperform weighted ones in genome-wide searches, suggesting that information in poorly conserved positions is used by proteins to distinguish true binding sites from the genomic background. However, RE-based methods fail in real skewed genomes, as they overestimate the importance of anti-skew positions, leading to high false positive rates.
The study also assesses the performance of search methods on real genomes, finding that RE-based methods perform worse than R_sequence-based ones in some cases. This challenges the assumption that RE is equivalent to R_sequence in skewed genomes. The results suggest that binding sites evolve to maintain their information content through conservation, rather than against genomic skew. The study concludes that information content in binding sites is a compound measure of search and binding affinity requirements, with important implications for understanding binding site evolution. The findings highlight the need for more comprehensive methods that incorporate additional information beyond sequence data.This article reevaluates information theory-based methods for identifying DNA-binding sites in genomes. It highlights that conventional benchmarking using artificial sequences often overestimates the efficiency of these methods. Real genomes require additional cues, such as curvature, beyond sequence information. Methods integrating skew information, like Relative Entropy, are ineffective in real genomes because their assumptions may not hold. The evidence suggests that binding sites evolve towards genomic skew and maintain their information content through conservation. The study identifies misconceptions in information theory applications, such as negative entropy, and proposes a revised paradigm for understanding binding site evolution.
The article discusses the application of information theory to binding site recognition, including the Heterology Index (HI) and Relative Entropy (RE). It shows that methods like HI and RE are derived from information theory principles but have limitations. The study evaluates the performance of various methods on both equiprobable and skewed genomic backgrounds. It finds that non-weighted methods often outperform weighted ones in genome-wide searches, suggesting that information in poorly conserved positions is used by proteins to distinguish true binding sites from the genomic background. However, RE-based methods fail in real skewed genomes, as they overestimate the importance of anti-skew positions, leading to high false positive rates.
The study also assesses the performance of search methods on real genomes, finding that RE-based methods perform worse than R_sequence-based ones in some cases. This challenges the assumption that RE is equivalent to R_sequence in skewed genomes. The results suggest that binding sites evolve to maintain their information content through conservation, rather than against genomic skew. The study concludes that information content in binding sites is a compound measure of search and binding affinity requirements, with important implications for understanding binding site evolution. The findings highlight the need for more comprehensive methods that incorporate additional information beyond sequence data.