The Human Genome Project—An Overview

Abstract
The human genome sequence will underpin human biology and medicine in the next century, providing a single, essential reference to all genetic information. The international program to determine the complete DNA sequence (3,000 million bases) is well underway. As of January 2000, 50% of the sequence is available in the public domain. A comprehensive working draft is expected this year, and the entire sequence is projected to be finished in 2003. DNA sequencing is carried out on mapped, overlapping bacterial clones of 150-200 kb. The working draft comprises assembled unfinished sequence and is released immediately in the public domain. The draft sequence of each clone is then completed, by closing any remaining gaps and resolving any ambiguities, before the entire sequence is checked, annotated, and submitted to the public databases. The sequence of each clone is finished to an accuracy of >99.99%. The availability of a reference sequence of the genome provides the basis for studying the nature of sequence variation, particularly single nucleotide polymorphisms (SNPs), in human populations. SNP typing is a powerful tool for genetic analysis, and will enable us to uncover the association of loci at specific sites in the genome with many disease traits. SNPs occur at a frequency of approximately 1 SNP/kb throughout the genome when the sequence of any two individuals is compared. Programs to detect and map SNPs in the human genome are underway with the aim of establishing a SNP map of the genome during the next two years. The human genome sequence will provide a complete description of all the genes. Annotation of the sequence with the gene structures is achieved by a combination of computational analysis (predictive and homology-based) and experimental confirmation by cDNA sequencing. Detecting homologies between newly defined gene products and proteins of known function helps to postulate biochemical functions for them, which can then be tested. Establishing the association of specific genes with disease phenotypes by mutation screening, particularly for monogenic disorders, provides further assistance in defining the functions of some gene products, as well as helping to establish the cause of the disease. As our knowledge of gene sequences and sequence variation in populations increases, we will pinpoint more and more of the genes and proteins that are important in common, complex diseases. A more detailed understanding of the function of the human genome will be achieved as we identify sequences that control gene expression. Given the availability of gene sequences, the expression status of genes in particular tissues can be monitored in parallel. By comparing corresponding genomic sequences in different species (for example: man, mouse, chicken, and zebrafish), regions that have been highly conserved during evolution can be identified, many of which reflect conserved functions such as gene regulation. These approaches promise to greatly accelerate our interpretation of the human genome sequence.