Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis
Top Cited Papers
Open Access
- 28 September 2007
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Genetics
- Vol. 3 (9), e161-35
- https://doi.org/10.1371/journal.pgen.0030161
Abstract
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies. In scientific and medical studies, great care must be taken when collecting data to understand the relationship between two variables, such as a drug and its effect on a disease. In any given study there will be many other variables at play, such as the effects of age and sex on the disease. We show that in studies where the expression levels of thousands of genes are measured at once, these issues become surprisingly critical. Due to the complexity of our genomes, environment, and demographic features, there are many sources of variation when analyzing gene expression levels. In any given study, it is impossible to measure every single variable that may be influencing how our genes are expressed. Despite this, we show that by considering all expression levels simultaneously, one can actually recover the effects of these important missed variables and essentially produce an analysis as if all relevant variables were included. As opposed to traditional studies, the massive amount of data available in this setting is what makes the method, called surrogate variable analysis, possible. We hypothesize that surrogate variable analysis will be useful in many large-scale gene expression studies.Keywords
This publication has 40 references indexed in Scilit:
- Principal components analysis corrects for stratification in genome-wide association studiesNature Genetics, 2006
- Treating Expression Levels of Different Genes as a Sample in Microarray Data Analysis: Is it Worth a Risk?Statistical Applications in Genetics and Molecular Biology, 2006
- Multiple Locus Linkage Analysis of Genomewide Expression in YeastPLoS Biology, 2005
- Integrative analysis of the cancer transcriptomeNature Genetics, 2005
- A Transcriptional Profile of Aging in the Human KidneyPLoS Biology, 2004
- Genetic analysis of genome-wide variation in human gene expressionNature, 2004
- Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factorsNature Genetics, 2003
- Statistical significance for genomewide studiesProceedings of the National Academy of Sciences, 2003
- A Direct Approach to False Discovery RatesJournal of the Royal Statistical Society Series B: Statistical Methodology, 2002
- Analysis of Variance for Gene Expression Microarray DataJournal of Computational Biology, 2000