Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures
Open Access
- 22 June 2012
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 13 (1), 1-21
- https://doi.org/10.1186/1471-2105-13-144
Abstract
The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.Keywords
This publication has 60 references indexed in Scilit:
- Protein Sectors: Evolutionary Units of Three-Dimensional StructureCell, 2009
- The Charge-dipole Pocket: A Defining Feature of Signaling Pathway GTPase On/Off SwitchesJournal of Molecular Biology, 2009
- Rapid detection, classification and accurate alignment of up to a million or more related protein sequencesBioinformatics, 2009
- INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentificationBioinformatics, 2008
- Characterization and prediction of residues determining protein functional specificityBioinformatics, 2008
- Functional Specificity Lies within the Properties and Evolutionary Changes of Amino AcidsJournal of Molecular Biology, 2007
- The hallmark of AGC kinase functional divergence is its C-terminal tail, a cis-acting regulatory moduleProceedings of the National Academy of Sciences, 2007
- Flexible segments modulate co-folding of dUTPase and nucleocapsid proteinsNucleic Acids Research, 2006
- TreeDet: a web server to explore sequence spaceNucleic Acids Research, 2006
- OrthoMCL: Identification of Ortholog Groups for Eukaryotic GenomesGenome Research, 2003