Population Structure and Eigenanalysis
Top Cited Papers
Open Access
- 1 January 2006
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Genetics
- Vol. 2 (12), e190
- https://doi.org/10.1371/journal.pgen.0020190
Abstract
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure. When analyzing genetic data, one often wishes to determine if the samples are from a population that has structure. Can the samples be regarded as randomly chosen from a homogeneous population, or does the data imply that the population is not genetically homogeneous? Patterson, Price, and Reich show that an old method (principal components) together with modern statistics (Tracy–Widom theory) can be combined to yield a fast and effective answer to this question. The technique is simple and practical on the largest datasets, and can be applied both to genetic markers that are biallelic and to markers that are highly polymorphic such as microsatellites. The theory also allows the authors to estimate the data size needed to detect structure if their samples are in fact from two populations that have a given, but small level of differentiation.Keywords
This publication has 37 references indexed in Scilit:
- Principal components analysis corrects for stratification in genome-wide association studiesNature Genetics, 2006
- Standardized Subsets of the HGDP‐CEPH Human Genome Diversity Cell Line Panel, Accounting for Atypical and Duplicated Samples and Pairs of Close RelativesAnnals of Human Genetics, 2006
- Clines, Clusters, and the Effect of Study Design on the Inference of Human Population StructurePLoS Genetics, 2005
- A haplotype map of the human genomeNature, 2005
- Phase transition of the largest eigenvalue for nonnull complex sample covariance matricesThe Annals of Probability, 2005
- Statistical Tests for Admixture Mapping with Case-Control and Cases-Only DataAmerican Journal of Human Genetics, 2004
- Methods for High-Density Admixture Mapping of Disease GenesAmerican Journal of Human Genetics, 2004
- A High-Density Admixture Map for Disease Gene Discovery in African AmericansAmerican Journal of Human Genetics, 2004
- Assessing the impact of population stratification on genetic association studiesNature Genetics, 2004
- Level-spacing distributions and the Airy kernelCommunications in Mathematical Physics, 1994