Data mining: Efficiency of using sequence databases for polymorphism discovery

Abstract
An open question in research on Single Nucleotide Polymorphisms (SNPs) is, what is the percentage of true SNPs found by in silico pre-screening? To this end, we selected 13 genes, and determined the complete collection of “true” polymorphisms, or polymorphisms experimentally detected, existing in these genes in our laboratory using Denaturing High Performance Liquid Chromatography (DHPLC) and fluorescent sequencing, or in other laboratories using comparable methods. The genes studied by our group were PTGS2, IGFBP1, IGFBP3, and CYP19. GenBank sequence information was then aligned using two methods, and sequence differences termed “candidate” polymorphisms. We then compared the series of SNPs obtained experimentally and in silico and we have found that in silico methods are relatively specific (up to 55% of candidate SNPs found by SNPFinder have been discovered by experimental procedure) but have low sensitivity (not more than 27% of true SNPs are found by in silico methods). Hum Mutat 17:141–150, 2001.