Asymptotic Conditional Singular Value Decomposition for High-Dimensional Genomic Data
- 16 June 2010
- journal article
- Published by Oxford University Press (OUP) in Biometrics
- Vol. 67 (2), 344-352
- https://doi.org/10.1111/j.1541-0420.2010.01455.x
Abstract
Summary High‐dimensional data, such as those obtained from a gene expression microarray or second generation sequencing experiment, consist of a large number of dependent features measured on a small number of samples. One of the key problems in genomics is the identification and estimation of factors that associate with many features simultaneously. Identifying the number of factors is also important for unsupervised statistical analyses such as hierarchical clustering. A conditional factor model is the most common model for many types of genomic data, ranging from gene expression, to single nucleotide polymorphisms, to methylation. Here we show that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges. We also propose a consistent estimator of the dimension of the underlying conditional factor model for a finite fixed sample size and an infinite number of features based on a scaled eigen‐decomposition. We propose a practical approach for selection of the number of factors in real data sets, and we illustrate the utility of these results for capturing batch and other unmodeled effects in a microarray experiment using the dependence kernel approach of Leek and Storey (2008, Proceedings of the National Academy of Sciences of the United States of America 105, 18718–18723) .Keywords
This publication has 27 references indexed in Scilit:
- High dimensional covariance matrix estimation using a factor modelJournal of Econometrics, 2008
- Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride LevelsScience, 2007
- Mapping complex disease loci in whole-genome association studiesNature, 2004
- Asymptotic distributions of principal components based on robust dispersionsBiometrika, 2003
- Determining the Number of Factors in Approximate Factor ModelsEconometrica, 2002
- Singular value decomposition for genome-wide expression data processing and modelingProceedings of the National Academy of Sciences, 2000
- A Test for the Number of Factors in an Approximate Factor ModelThe Journal of Finance, 1993
- Remarks on Parallel AnalysisMultivariate Behavioral Research, 1992
- The Asymptotic Normal Distribution of Estimators in Factor Analysis under General ConditionsThe Annals of Statistics, 1988
- Asymptotic Theory for Principal Component AnalysisThe Annals of Mathematical Statistics, 1963