Constructing Multigenome Views of Whole Microbial Genomes

Abstract
We have designed and implemented a system to carry out cross-genome comparisons of open reading frames (ORFs) from multiple genomes. This implementation includes a genome profiling system that allows us to explore pairwise comparisons at different levels of match similarity and ask biologically motivated queries involving number and identity of ORFs, their function, functional category, distribution in genomes or in biological domains, and statistics on their matches and match families. This analysis required precise definition of new classification terms and concepts. We define the terms genomic signature, summary signature, biologic domain signature, domain class, match level, match family, and extended match family, then use these terms to define concepts, including genomically universal proteins and proteins characteristic of sets of genomes. We initiate an analysis based on automated FASTA (Pearson, 1996) comparison of 22,419 conceptually translated protein sequences from nine microbial genomes.