Annotating Large Genomes With Exact Word Matches

Open Access

1 January 2003

journal article
research article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 13 (10), 2306-2315
https://doi.org/10.1101/gr.1350803

Abstract

We have developed a tool for rapidly determining the number of exact matches of any word within large, internally repetitive genomes or sets of genomes. Thus we can readily annotate any sequence, including the entire human genome, with the counts of its constituent words. We create a Burrows-Wheeler transform of the genome, which together with auxiliary data structures facilitating counting, can reside in about one gigabyte of RAM. Our original interest was motivated by oligonucleotide probe design, and we describe a general protocol for defining unique hybridization probes. But our method also has applications for the analysis of genome structure and assembly. We demonstrate the identification of chromosome-specific repeats, and outline a general procedure for finding undiscovered repeats. We also illustrate the changing contents of the human genome assemblies by comparing the annotations built from different genome freezes.

Keywords

This publication has 14 references indexed in Scilit:

The UCSC Genome Browser Database
Nucleic Acids Research, 2003
Representational Oligonucleotide Microarray Analysis: A High-Resolution Method to Detect Genome Copy Number Variation
Genome Research, 2003
A 9.1-kb Gap in the Genome Reference Map Is Shown to Be a Stable Deletion/Insertion Polymorphism of Ancestral Origin
Genomics, 2002
BLAT—The BLAST-Like Alignment Tool
Genome Research, 2002
REPuter: the manifold applications of repeat analysis on a genomic scale
Nucleic Acids Research, 2001
Selection of optimal DNA oligos for gene expression arrays
Bioinformatics, 2001
Repbase Update: a database and an electronic journal of repetitive elements
Trends in Genetics, 2000
Basic Local Alignment Search Tool
Journal of Molecular Biology, 1990
Basic local alignment search tool
Journal of Molecular Biology, 1990
Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences, 1988

Cited by 65 articles