Entropy-based SNP selection for genetic association studies

Abstract
Because of their abundance, density, and ease of practical use, single-nucleotide polymorphisms (SNPs) have become the major source of information for association gene mapping in humans. Sensible strategies for selecting practically useful SNPs are therefore required. Among the factors influencing the mapping utility of a given set of SNPs are (1) their individual diversity, (2) their haplotype structure in the population of interest, and (3) their physical distribution. We propose a strategy integrating these aspects into a single mapping utility measure, which is based upon Shannon entropy, and which maximizes the amount of information extracted from a genomic region under a Malecot model of linkage disequilibrium (LD) decay. The same utility measure has also been used to define a criterion guiding SNP discovery and rational decision-making about the continuation or termination of a mapping study. The proposed strategy performs consistently well in a data set comprising 549 German control individuals, genotyped for 136 SNPs from four genomic regions of different LD structure. Adoption of the method in practice is estimated to save up to 30% of genotyping load when compared with equidistant SNP localization or pair-wise LD minimization alone.