Identifying functional sites in proteins using a multilevel alphabet
The evolutionary process constantly samples the space of possible sequences and structures consistent with the function of the protein. Within the protein sequence, not all the residues tend to evolve the same. While some are conserved other tends to vary. Essential positions in proteins, which usually related to the protein function or protein stability, are often conserved; therefore conservation can be utilized to predict essential residues.
By employing a multilevel alphabet (MLA) which combines physicochemical properties with secondary structure information we show that we can improve the prediction of the functional residues in proteins such as catalytic sites, hot spots and disease-related mutations. The protein residues that are conserved at the MLA level comprise distinct populations of residues which only partially overlap with residues which are conserved at the amino acid level. Furthermore, using the MEME search algorithm, we have shown that we can identify protein signatures within subsets of proteins which encompass common sequence and structural information from large datasets of linear sequences, as well as predicting common structural properties) of known functional-motifs.