11 February 2009 | Ivan Erill* and Michael C O'Neill
This article reexamines the effectiveness of information theory-based methods for identifying DNA-binding sites in genomes. The authors use newly available data on transcription factors from different bacterial genomes to assess these methods more thoroughly. They find that conventional benchmarking against artificial sequence data often overestimates search efficiency. Sequence information alone is often insufficient, and other cues like curvature must be considered in real genomes. Methods integrating skew information, such as Relative Entropy (RE), are ineffective in skewed genomes because their assumptions may not hold. The evidence suggests that binding sites tend to evolve towards genomic skew rather than against it, maintaining their information content through increased conservation. The authors identify several misconceptions about information theory and propose a revised paradigm to explain the observed results. They conclude that among information theory-based methods, the most straightforward approaches perform better on average, as heuristic corrections to these methods often fail on real data. The reexamination of information content in binding sites reveals that it is a compound measure of search and binding affinity requirements, with important implications for understanding binding site evolution.This article reexamines the effectiveness of information theory-based methods for identifying DNA-binding sites in genomes. The authors use newly available data on transcription factors from different bacterial genomes to assess these methods more thoroughly. They find that conventional benchmarking against artificial sequence data often overestimates search efficiency. Sequence information alone is often insufficient, and other cues like curvature must be considered in real genomes. Methods integrating skew information, such as Relative Entropy (RE), are ineffective in skewed genomes because their assumptions may not hold. The evidence suggests that binding sites tend to evolve towards genomic skew rather than against it, maintaining their information content through increased conservation. The authors identify several misconceptions about information theory and propose a revised paradigm to explain the observed results. They conclude that among information theory-based methods, the most straightforward approaches perform better on average, as heuristic corrections to these methods often fail on real data. The reexamination of information content in binding sites reveals that it is a compound measure of search and binding affinity requirements, with important implications for understanding binding site evolution.