Abstract
Based on a formal proof that justifies the search for generative grammars in the study of gene regulation, a linguistic formalization of an exhaustive data base of Escherichia coli sigma 70 promoters and their regulatory binding sites has been initiated. The grammar presented here generates all the arrays of the collection plus those that are predicted as consistent with the principles of regulation of sigma 70 promoters. "Systems of regulation," sets of regulatory sites that collaborate in a mechanism of regulation, are represented by means of syntactic categories. A small set of phrase structure rules restricted by an X-bar principle and by a hierarchical, c-command relation generates a representation of arrays of sites of regulation where the selection of the protein(s) identifying the system(s) of regulation occurs. Based on the features of the proteins, optional duplicated proximal and remote sites are generated by means of transformational rules. Consistency with the data, the predictions that the grammar generates, and important similarities and differences with some aspects of the generative theory of natural language are discussed.