Correction for hidden confounders in the genetic analysis of gene expression

21 September 2010

journal article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences

Vol. 107 (38), 16465-16470
https://doi.org/10.1073/pnas.1002425107

Abstract

Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. One way of getting at such an understanding is to find out which parts of our DNA, such as single-nucleotide polymorphisms, affect particular intermediary processes such as gene expression. Naively, such associations can be identified using a simple statistical test on all paired combinations of genetic variants and gene transcripts. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. We present a statistical model that jointly corrects for two particular kinds of hidden structure--population structure (e.g., race, family-relatedness), and microarray expression artifacts (e.g., batch effects), when these confounders are unknown. Applying our method to both real and synthetic, human and mouse data, we demonstrate the need for such a joint correction of confounders, and also the disadvantages of other possible approaches based on those in the current literature. In particular, we show that our class of models has maximum power to detect eQTL on synthetic data, and has the best performance on a bronze standard applied to real data. Lastly, our software and the associations we found with it are available at http://www.microsoft.com/science.

Keywords

This publication has 33 references indexed in Scilit:

Genetics of human gene expression: mapping DNA variants that influence gene expression
Nature Reviews Genetics, 2009
Mapping complex disease traits with global gene expression
Nature Reviews Genetics, 2009
Revealing the architecture of gene regulation: the promise of eQTL studies
Trends in Genetics, 2008
Variations in DNA elucidate molecular networks that cause disease
Nature, 2008
eQED: an efficient method for interpreting eQTL associations using protein networks
Molecular Systems Biology, 2008
A tutorial on statistical methods for population association studies
Nature Reviews Genetics, 2006
Principal components analysis corrects for stratification in genome-wide association studies
Nature Genetics, 2006
A unified mixed-model method for association mapping that accounts for multiple levels of relatedness
Nature Genetics, 2005
An integrative genomics approach to infer causal associations between gene expression and disease
Nature Genetics, 2005
Statistical significance for genomewide studies
Proceedings of the National Academy of Sciences, 2003

Cited by 133 articles