Significance analysis of functional categories in gene expression studies: a structured permutation approach

Abstract
In high-throughput genomic and proteomic experiments, investigators monitor expression across a set of experimental conditions. To gain an understanding of broader biological phenomena, researchers have until recently been limited to post hoc analyses of significant gene lists. We describe a general framework, significance analysis of function and expression (SAFE), for conducting valid tests of gene categories ab initio. SAFE is a two-stage, permutation-based method that can be applied to various experimental designs, accounts for the unknown correlation among genes and enables permutation-based estimation of error rates. The utility and flexibility of SAFE is illustrated with a microarray dataset of human lung carcinomas and gene categories based on Gene Ontology and the Protein Family database. Significant gene categories were observed in comparisons of (1) tumor versus normal tissue, (2) multiple tumor subtypes and (3) survival times. Code to implement SAFE in the statistical package R is available from the authors. http://www.bios.unc.edu/~fwright/SAFE.