Nucleotide sequences of complementary deoxyribonucleic acids for the pro.alpha.1 chain of human type I procollagen. Statistical evaluation of structures that are conserved during evolution

Abstract
Nucleotide sequences were determined for 2 cloned c[complementary]DNA encoding for over 3/4 of the pro.alpha.1(I) chain of type I procollagen from man. Comparison with previously published data on amino acid sequences of the .alpha.1(I) chain of type I collagen made it possible to examine mutations in the transcribed products of the gene which have occurred during the evolution of man, calf, rat, mouse and chick. Comparison of the nucleotide sequences with the corresponding sequences of cDNA from chick and with cDNA for the pro.alpha.2(I) chain from man demonstrated that selective pressure during evolution for 250 million or more yr acted more strongly on the structure of the pro.alpha.1(I) chain than on the pro.alpha.2(I) chain. To improve the reliability of the comparison, the nucleotide sequences were examined with a modification of previous procedures for evaluating mutations in replacement sites and silent sites. The corrected divergence for replacement sites between the .alpha.1(I) chains was 6 .+-. 0.8% whereas it was 15 .+-. 1.9% for the .alpha.2(I) chains. The C-propeptide domain of the pro.alpha.(I) chain was also highly conserved with a corrected divergence at replacement sites of 5 .+-. 0.9%, a value that was not distinguishable from the value previously found for the C-propeptide of the pro.alpha.2(I) chain. A large part of the structure of both C-propeptides appears to be under selective pressure. Inspection of changes in the C-propeptide of the pro.alpha.(I) chain suggested that there was a highly conserved region around the carbohydrate attachment site similar to the highly conserved region of 37 amino acids previously found in the C-propeptide of the pro.alpha.2(I) chain. Two statistical tests, however, were unable to confirm nonrandom distribution of changes in the C-propeptide of the pro.alpha.1(I) chain. The same tests established the presence of a nonrandom distribution in nucleotide changes of the C-propeptide of the pro.alpha.2(I) chain. The 3''-noncoding region of the cDNA for pro.alpha.1(I) of human type I procollagen showed no homology with the same region in the chick. Analysis of codon usage for the .alpha.1(I) chain indicated the same 3rd base preference for U and C in codons for Gly, Pro and Ala previously noted for the chick .alpha.1(I) chain. The data on the corrected divergence at replacement sites for the pro.alpha.1(I) and the pro.alpha.2(I) genes in man, mouse and chick were used to estimate the time since the divergence of the pro.alpha.1(I) and the pro.alpha.2(I) genes. The divergence apparently occurred 950 .+-. 120 million yr ago. Since this date precedes several estimates for the 1st appearance of metazoa, it is possible that the pro.alpha. genes duplicated before the 1st multicellular organisms arose. The assumptions for estimating the time of gene duplication based on the evolutionary clock hypothesis may not be valid for collagen genes.