Efficient discovery of single-nucleotide polymorphisms in coding regions of human genes

Abstract
Single nucleotide polymorphisms in protein coding regions (cSNPs) are of great interest for their effects on phenotype and potential for mapping disease genes. We have identified 5,400 novel exonic SNPs from alignments of public EST data to the draft human genome sequence, and approximately 12,000 more novel exonic SNPs from EST cluster alignments. We found 82% of the genomic-aligned SNPs and 63% of the EST-only SNPs to be detectably polymorphic in 20 Finnish DNA samples. 37% of the SNPs mapped to known protein coding regions, yielding 6,500 distinct, novel cSNPs from the two datasets. These data reveal selection against mutations that alter protein structure, and distinct classes of genes under strongly positive vs. negative pressure from natural selection for amino acid replacement (detected by K(A)/K(S)ratio). We have searched these cSNPs for compatibility with the amino acid profile at each site and structural impact on protein core stability.