The effect of GeneChip gene definitions on the microarray study of cancers

Abstract
The Affymetrix GeneChip is a popular microarray platform for genome-wide expression profiling and has been widely used in functional genomics especially in the classification of cancers. Due to the updating of genome data, much of the genome information with which the chips were designed is out-of-date and it has been reported that many of the genes/transcripts on the chips differ from their original definition when mapping the probes to the new genome information. Dai et al. have reported that the updated definition can cause as much as 30–50% discrepancy in the genes selected as differentially expressed on a heart tissue expression profiling dataset. Understanding the nature of this difference is therefore very important for the utilization of the data. In this work, with a large cancer dataset as an example, we compared two major definitions and investigated their effects on classification, clustering, discovery of differentially expressed genes and gene-set-based analysis. Results show that the two definitions agree well on clustering and classification results but genes and gene sets discovered as differentially expressed or enriched can be very different. Discoveries based on the Affymetrix definition can cover most of those based on the new definition, but tend to have more false positives. BioEssays 28: 739–746, 2006.