The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes

Top Cited Papers

4 June 2009

journal article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 19 (7), 1316-1323
https://doi.org/10.1101/gr.080531.108

Abstract

Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

Keywords

This publication has 27 references indexed in Scilit:

The Universal Protein Resource (UniProt) 2009
Nucleic Acids Research, 2009
The UCSC Genome Browser Database: 2008 update
Nucleic Acids Research, 2007
Distinguishing protein-coding and noncoding genes in the human genome
Proceedings of the National Academy of Sciences, 2007
Database resources of the National Center for Biotechnology Information
Nucleic Acids Research, 2007
The HGNC Database in 2008: a resource for the human genome
Nucleic Acids Research, 2007
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Nature, 2007
NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2007
Entrez Gene: gene-centered information at NCBI
Nucleic Acids Research, 2006
Structural variation in the human genome
Nature Reviews Genetics, 2006
Sequencing and comparison of yeast species to identify genes and regulatory elements
Nature, 2003

Cited by 515 articles