Finding pathogenicity islands and gene transfer events in genome data

Abstract
Motivation: There is a growing literature on wavelet theory and wavelet methods showing improvements on more classical techniques, especially in the contexts of smoothing and extraction of fundamental components of signals. G+Cpatterns occur at different lengths (scales) and, for this reason, G+Cplots are usually difficult to interpret. Current methods for genome analysis choose a window size and compute a \({\chi}^{2}\) statistics of the average value for each window with respect to the whole genome. Results: Firstly, wavelets are used to smooth G+Cprofiles to locate characteristic patterns in genome sequences. The method we use is based on performing a \({\chi}^{2}\) statistics on the wavelet coefficients of a profile; thus we do not need to choose a fixed window size, in that the smoothing occurs at a set of different scales. Secondly, a wavelet scalogram is used as a measure for sequence profile comparison; this tool is very general and can be applied to other sequence profiles commonly used in genome analysis. We show applications to the analysis of Deinococcus radiodurans chromosome I, of two strains of Helicobacter pylori (26695, J99) and two of Neisseria meningitidis (serogroup B strain MC58 and serogroup A strain Z2491). We report a list of loci that have different G+Ccontent with respect to the nearby regions; the analysis of N. meningitidis serogroup B shows two new large regions with low G+Ccontent that are putative pathogenicity islands. Availability: Software and numerical results (profiles, scalograms, high and low frequency components) for all the genome sequences analyzed are available upon request from the authors. Contact: p.lio@zoo.cam.ac.uk