Rapid automatic detection and alignment of repeats in protein sequences

24 August 2000

journal article
research article
Published by Wiley in Proteins-Structure Function and Bioinformatics

Vol. 41 (2), 224-237
https://doi.org/10.1002/1097-0134(20001101)41:2<224::aid-prot70>3.0.co;2-z

Abstract

Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self‐alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam‐A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information. Proteins 2000;41:224–237.

Keywords

This publication has 26 references indexed in Scilit:

ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons
Nucleic Acids Research, 2000
Detection of internal repeats: how common are they?
Current Opinion in Structural Biology, 1998
Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment.
Bioinformatics, 1998
Sequence Alignment with Tandem Duplication
Journal of Computational Biology, 1997
Titins: Giant Proteins in Charge of Muscle Ultrastructure and Elasticity
Science, 1995
A method to recognize distant repeats in protein sequences
Proteins-Structure Function and Bioinformatics, 1993
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment
Science, 1993
Analysis of gene duplication repeats in the myosin rod
Journal of Molecular Biology, 1983
Nucleation, Rapid Folding, and Globular Intrachain Regions in Proteins
Proceedings of the National Academy of Sciences, 1973
Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551
Journal of Molecular Biology, 1971

Cited by 283 articles