ETD RECORD

The limits of progressive multiple sequence alignment

Citation

Sheneman, Lucas James.. (2008). The limits of progressive multiple sequence alignment. Theses and Dissertations Collection, University of Idaho Library Digital Collections. https://www.lib.uidaho.edu/digital/etd/items/etd_339.html

Title:
The limits of progressive multiple sequence alignment
Author:
Sheneman, Lucas James.
Date:
2008
Keywords:
Sequence alignment (Bioinformatics) Genetic algorithms
Program:
Bioinformatics & Computational Biology
Abstract:
Biologists use progressive multiple sequence alignment to identify positional homology in regions of molecular sequences. The popularity of this method is due to the pragmatic trade-off between computational efficiency and accuracy. However, progressive alignment has several inherent limitations. This thesis assesses the underlying causes of these limitations and presents novel methodology for improving existing alignment algorithms.;In progressive sequence alignment, a guide tree is inferred and then sequences are pairwise aligned via dynamic programming in the order dictated by the tree. This results in an alignment representing the reconstructed positional homology of the input sequences. This thesis explores the relationship between guide tree topology and alignment accuracy.;We present two distinct genetic algorithms, both of which optimize a population of guide tree topologies using stochastic crossover and mutation operators. One genetic algorithm, EVALYN, achieves high accuracy scores when measured against published benchmark tests. However, we find that EVALYN's destructive crossover reduces the genetic algorithm to a stochastic hill climb.;Most progressive alignment programs infer guide trees using the Neighbor-Joining algorithm. The O(N{esc}p3{esc}s) time complexity of Neighbor-Joining makes it a bottleneck when aligning large datasets. The Relaxed Neighbor-Joining algorithm relaxes the requirements for joining tree nodes, thereby reducing the typical time complexity to O(N{esc}p2{esc}slogN) with no significant qualitative effects. This thesis describes Clearcut, the reference implementation for the Relaxed Neighbor-Joining algorithm. Results show that Clearcut dramatically outperforms existing Neighbor-Joining implementations.;Due to arithmetic ties encountered during the backtracking phase of dynamic programming, a backtracking path is a directed acyclic graph. Each path in the graph results in a different final alignment. The set of distinct alignments has Gaussian distributed accuracy scores. We introduce an unbiased backtracking algorithm and demonstrate how biased backtracking leads to suboptimal alignments.;We explore the correlation between guide tree quality and alignment quality by defining quality in terms of distance from the true alignment or tree and then systematically degrading guide trees via tree edit operators. There is a statistically significant correlation between guide tree quality and alignment quality, although the measured effect of guide tree selection on alignment quality is small.
Description:
Thesis (Ph. D., Bioinformatics and Computational Biology)--University of Idaho, August 2008.
Major Professor:
James A. Foster.
Defense Date:
August 2008.
Type:
Text
Format Original:
xi, 94 leaves :col. ill. ;29 cm.
Format:
record

Contact us about this record

Rights
Rights:
In Copyright - Educational Use Permitted. For more information, please contact University of Idaho Library Special Collections and Archives Department at libspec@uidaho.edu.
Standardized Rights:
http://rightsstatements.org/vocab/InC-EDU/1.0/