An efficient algorithm for large-scale detection of protein families

Top Cited Papers

1 April 2002

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 30 (7), 1575-1584
https://doi.org/10.1093/nar/30.7.1575

Abstract

Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.

Keywords

This publication has 47 references indexed in Scilit:

Strain-specific genes of Helicobacter pylori: distribution, function and dynamics
Nucleic Acids Research, 2001
Domain combinations in archaeal, eubacterial and eukaryotic proteomes
Journal of Molecular Biology, 2001
An insight into domain combinations
Bioinformatics, 2001
The emergence of major cellular processes in evolution
FEBS Letters, 1996
Hidden Markov models
Current Opinion in Structural Biology, 1996
THE MULTIPLICITY OF DOMAINS IN PROTEINS
Annual Review of Biochemistry, 1995
Eukaryotes have “two-component” signal tranducers
Research in Microbiology, 1994
Evolutionarily Mobile Modules in Proteins
Scientific American, 1993
Identification of common molecular subsequences
Journal of Molecular Biology, 1981
ASPECTS OF MOLECULAR EVOLUTION
Annual Review of Genetics, 1973

Cited by 3177 articles