November 1992 | STEVEN HENIKOFF* AND JORA G. HENIKOFF
This paper presents a method for deriving substitution matrices from aligned protein blocks, which outperforms traditional matrices like those based on the Dayhoff model. The authors used over 2000 blocks of aligned sequences from more than 500 protein groups to create substitution matrices that improve alignment and homology search results. The method involves constructing a frequency table of amino acid pairs from these blocks, then calculating a logarithm of odds (lod) matrix based on observed and expected frequencies. This approach accounts for the conservation of amino acid residues within blocks, leading to more accurate substitution matrices.
The BLOSUM matrices, derived from this method, are compared to Dayhoff-based PAM matrices. The BLOSUM matrices show higher relative entropy, indicating better discrimination between observed and expected amino acid pair frequencies. The BLOSUM 62 matrix, in particular, performs well in multiple alignment and homology searches. It is more tolerant of hydrophobic substitutions and less tolerant of hydrophilic ones, and is more tolerant of mismatches for rare amino acids like cysteine and tryptophan.
The BLOSUM matrices were tested against various protein families, including the guanine nucleotide-binding protein-coupled receptors, and showed improved performance in detecting homologous sequences. The BLOSUM matrices also outperformed recent updates of the Dayhoff matrices in these tests. The authors conclude that substitution matrices derived from aligned blocks, which represent highly conserved regions of proteins, are more appropriate for sequence alignment and homology searches than matrices based on extrapolated mutation rates. The BLOSUM series is based on the identity and composition of groups in Prosite and the accuracy of the PROTOMAT system, and is expected to remain stable in the future.This paper presents a method for deriving substitution matrices from aligned protein blocks, which outperforms traditional matrices like those based on the Dayhoff model. The authors used over 2000 blocks of aligned sequences from more than 500 protein groups to create substitution matrices that improve alignment and homology search results. The method involves constructing a frequency table of amino acid pairs from these blocks, then calculating a logarithm of odds (lod) matrix based on observed and expected frequencies. This approach accounts for the conservation of amino acid residues within blocks, leading to more accurate substitution matrices.
The BLOSUM matrices, derived from this method, are compared to Dayhoff-based PAM matrices. The BLOSUM matrices show higher relative entropy, indicating better discrimination between observed and expected amino acid pair frequencies. The BLOSUM 62 matrix, in particular, performs well in multiple alignment and homology searches. It is more tolerant of hydrophobic substitutions and less tolerant of hydrophilic ones, and is more tolerant of mismatches for rare amino acids like cysteine and tryptophan.
The BLOSUM matrices were tested against various protein families, including the guanine nucleotide-binding protein-coupled receptors, and showed improved performance in detecting homologous sequences. The BLOSUM matrices also outperformed recent updates of the Dayhoff matrices in these tests. The authors conclude that substitution matrices derived from aligned blocks, which represent highly conserved regions of proteins, are more appropriate for sequence alignment and homology searches than matrices based on extrapolated mutation rates. The BLOSUM series is based on the identity and composition of groups in Prosite and the accuracy of the PROTOMAT system, and is expected to remain stable in the future.