Most “Dark Matter” Transcripts Are Associated With Known Genes

Top Cited Papers

Open Access

18 May 2010

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Biology

Vol. 8 (5), e1000371
https://doi.org/10.1371/journal.pbio.1000371

Abstract

A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported. The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from “tiling” microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.

Keywords

This publication has 65 references indexed in Scilit:

Origins and functional impact of copy number variation in the human genome
Nature, 2009
Unlocking the secrets of the genome
Nature, 2009
ChIP-seq accurately predicts tissue-specific activity of enhancers
Nature, 2009
Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals
Nature, 2009
Alternative isoform regulation in human tissue transcriptomes
Nature, 2008
Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis
Cell, 2008
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Nature, 2007
Complete sequencing and characterization of 21,243 full-length human cDNAs
Nature Genetics, 2003
Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs
Nature, 2002
BLAT—The BLAST-Like Alignment Tool
Genome Research, 2002

Cited by 407 articles