Capturing Whole-Genome Characteristics in Short Sequences Using a Naïve Bayesian Classifier
- 1 August 2001
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 11 (8), 1404-1409
- https://doi.org/10.1101/gr.186401
Abstract
Bacterial genomes have diverged during evolution, resulting in clearcut differences in their nucleotide composition, such as their GC content. The analysis of complete sequences of bacterial genomes also reveals the presence of nonrandom sequence variation, manifest in the frequency profile of specific short oligonucleotides. These frequency profiles constitute highly specific genomic signatures. Based on these differences in oligonucleotide frequency between bacterial genomes, we investigated the possibility of predicting the genome of origin for a specific genomic sequence. To this end, we developed a naïve Bayesian classifier and systematically analyzed 28 eubacterial and archaeal genomes. We found that sequences as short as 400 bases could be correctly classified with an accuracy of 85%. We then applied the classifier to the identification of horizontal gene transfer events in whole-genome sequences and demonstrated the validity of our approach by correctly predicting the transfer of both the superoxide dismutase (sodC) and the bioC gene from Haemophilus influenzaeto Neisseria meningitis, correctly identifying both the donor and recipient species. We believe that this classification methodology could be a valuable tool in biodiversity studies.Keywords
This publication has 20 references indexed in Scilit:
- Detecting Alien Genes in Bacterial GenomesaAnnals of the New York Academy of Sciences, 1999
- Di.erences in Dinucleotide Frequencies of Human, Yeast, and Escherichia coli GenesDNA Research, 1997
- Real-Time DNA Sequencing Using Detection of Pyrophosphate ReleaseAnalytical Biochemistry, 1996
- Dinucleotide relative abundance extremes: a genomic signatureTrends in Genetics, 1995
- Comparisons of eukaryotic genomic sequences.Proceedings of the National Academy of Sciences, 1994
- A Sequential Algorithm for Training Text ClassifiersPublished by Springer Nature ,1994
- Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequencesNucleic Acids Research, 1993
- Statistical analyses of counts and distributions of restriction sites in DNA sequencesNucleic Acids Research, 1992
- Mechanism of homospecific DNA uptake in Haemophilus influenzae transformationMolecular Genetics and Genomics, 1980
- Relevance weighting of search termsJournal of the American Society for Information Science, 1976