CD-HIT: accelerated for clustering the next-generation sequencing data

Top Cited Papers

Open Access

11 October 2012

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 28 (23), 3150-3152
https://doi.org/10.1093/bioinformatics/bts565

Abstract

Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability:http://cd-hit.org. Contact:liwz@sdsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

This publication has 11 references indexed in Scilit:

Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource
Nucleic Acids Research, 2010
Search and clustering orders of magnitude faster than BLAST
Bioinformatics, 2010
Artificial and natural duplicates in pyrosequencing reads of metagenomic data
BMC Bioinformatics, 2010
A human gut microbial gene catalogue established by metagenomic sequencing
Nature, 2010
A core gut microbiome in obese and lean twins
Nature, 2008
Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
BMC Bioinformatics, 2008
Predicting disulfide bond connectivity in proteins by correlated mutations analysis
Bioinformatics, 2008
UniRef: comprehensive and non-redundant UniProt reference clusters
Bioinformatics, 2007
Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification
RNA, 2006
Clustering of highly homologous sequences to reduce the size of large protein databases
Bioinformatics, 2001

Cited by 7973 articles