Enhanced protein domain discovery using taxonomy
Open Access
- 1 January 2004
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 5 (1), 56
- https://doi.org/10.1186/1471-2105-5-56
Abstract
Background: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids. Results: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques. Conclusions: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.Keywords
This publication has 13 references indexed in Scilit:
- The Pfam protein families databaseNucleic Acids Research, 2004
- Enhanced protein domain discovery by using language modeling techniques from speech recognitionProceedings of the National Academy of Sciences, 2003
- ASTRAL compendium enhancementsNucleic Acids Research, 2002
- Theromin, a Novel Leech Thrombin InhibitorJournal of Biological Chemistry, 2000
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- The structure of the tetratricopeptide repeats of protein phosphatase 5: implications for TPR-mediated protein-protein interactionsThe EMBO Journal, 1998
- Profile hidden Markov models.Bioinformatics, 1998
- SCOP: a Structural Classification of Proteins databaseNucleic Acids Research, 1997
- Overexpression of human aspartyl(asparaginyl)beta-hydroxylase in hepatocellular carcinoma and cholangiocarcinoma.Journal of Clinical Investigation, 1996
- Hidden Markov Models in Computational BiologyJournal of Molecular Biology, 1994