Abstract
To maximize the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of first-hand annotation. These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy. A regularly updated non-redundant protein sequence database (nrdb90), a server for homology searches against nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are available for academic use from http://www.embl-ebi.ac. uk/holm/nrdb90. holm@embl-ebi.ac.uk