The whole alignment and nothing but the alignment: the problem of spurious alignment flanks
Open Access
- 16 September 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 36 (18), 5863-5871
- https://doi.org/10.1093/nar/gkn579
Abstract
Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human–fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple ‘overalignment’ P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.Keywords
This publication has 27 references indexed in Scilit:
- Estimating the Gumbel scale parameter for local alignment of random sequences by importance sampling with stopping timesThe Annals of Statistics, 2009
- Measuring Global Credibility with Application to Local Sequence AlignmentPLoS Computational Biology, 2008
- Centroid estimation in discrete high-dimensional spaces with applications in biologyProceedings of the National Academy of Sciences, 2008
- Uncertainty in homology inferences: Assessing and improving genomic sequence alignmentGenome Research, 2007
- Transition-Transversion Bias Is Not Universal: A Counter Example from Grasshopper PseudogenesPLoS Genetics, 2007
- NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteinsNucleic Acids Research, 2007
- Continued Colonization of the Human Genome by Mitochondrial DNAPLoS Biology, 2004
- The Human Genome Browser at UCSCGenome Research, 2002
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- A reliable sequence alignment method based on probabilities of residue correspondencesProtein Engineering, Design and Selection, 1995