CD-HIT: accelerated for clustering the next-generation sequencing data
Top Cited Papers
Open Access
- 11 October 2012
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 28 (23), 3150-3152
- https://doi.org/10.1093/bioinformatics/bts565
Abstract
Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability:http://cd-hit.org. Contact:liwz@sdsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 11 references indexed in Scilit:
- Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resourceNucleic Acids Research, 2010
- Search and clustering orders of magnitude faster than BLASTBioinformatics, 2010
- Artificial and natural duplicates in pyrosequencing reads of metagenomic dataBMC Bioinformatics, 2010
- A human gut microbial gene catalogue established by metagenomic sequencingNature, 2010
- A core gut microbiome in obese and lean twinsNature, 2008
- Gene identification and protein classification in microbial metagenomic sequence data via incremental clusteringBMC Bioinformatics, 2008
- Predicting disulfide bond connectivity in proteins by correlated mutations analysisBioinformatics, 2008
- UniRef: comprehensive and non-redundant UniProt reference clustersBioinformatics, 2007
- Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identificationRNA, 2006
- Clustering of highly homologous sequences to reduce the size of large protein databasesBioinformatics, 2001