Abstract
Over a hundred families of non-long terminal repeat retrotransposons (non-LTRs) were found in the newly released Anopheles gambiae genome assembly during a reiterative and comprehensive search using the conserved reverse transcriptase (RT) domains of known non-LTRs as the starting queries. These families, which are defined by at least 20% amino acid sequence divergence in their RT domains, range from a few to approximately 2,000 copies and occupy at least 3% of the genome. In addition to having an unprecedented number of diverse families, A. gambiae non-LTRs represent 8 of the 15 previously defined clades plus two novel clades, Loner and Outcast, more than what has been reported for any genome. Five families were found belonging to the L1 clade, which had no invertebrate representatives to date. One unique family named Sponge contains only a complete open reading frame (ORF) for the Gag-like protein and appears to have been mobilized by a family of the CR1 clade. Although most families appear to be inactive as expected, all clades except R4 have families with characteristics suggesting recent activity. At least 21 families have multiple full-length copies with over 99% nucleotide identity and some or all of the following characteristics: target site duplications (TSDs), intact ORFs, and corresponding expressed sequence tags (ESTs). The incredible diversity and the maintenance of multiple recently active lineages within different clades indicate a complex evolutionary scenario. A. gambiae non-LTRs have the potential to be developed as tools for population genetic studies and genetic manipulations of this primary vector of the devastating disease malaria. The semi-automated reiterative search approach described here may be used with any genome assembly to systematically survey and characterize non-LTRs as well as other transposable elements that encode a conserved protein.